How-to

Small datasets: how to maximize their use

Written by

Daniella

Published on

2025-01-28

Reading time

min

In the fast-paced field of artificial intelligence, innovation and the quest for performance are constantly taking center stage. Recently, the Chinese AI company DeepSeek changed the landscape by dethroning GPT chat as the most downloaded free application on the Apple App Store. Since its launch at the end of 2022, ChatGPT has indeed dominated the field of AI, despite increasing competition from giants like Google, Meta and Anthropic. However, the meteoric rise of DeepSeek signals a possible paradigm shift in the AI industry, as this model is already attracting attention not only for its impressive performance but also for its strategic approach to data.

‍

Founded in July 2023 by Liang Wenfeng in Hangzhou, DeepSeek quickly made a name for itself. Les Benchmarks Recent studies show that its third-generation language model (LLM V3) has surpassed those of major American technology companies, while being developed at significantly lower costs according to the statements of its founders. This feat aroused a lot of interest and questions about how a young start-up could achieve what seemed impossible. The answer, as Salesforce CEO Mark Benioff pointed out, lies not only in the technology itself, but in the data and metadata that feeds it. By calling DeepSeek a ”Deepgold“, Benioff said, “The real value of AI does not lie in the user interface or the model. Tomorrow's fortune? It's in our data!”

‍

This perspective highlights a growing awareness within the AI community: The importance of data sets, and in particular small data sets to do without expensive and energy-intensive computing infrastructures. Nothing new, several years ago, the emeritus Andrew Ng already mentioned this subject in his blog (see the article available at this address).

‍

In short, while attention has long been focused on model scale and computing power, the focus is now shifting to the quality and specificity of the data used to train these models (prepared thanks to data labelers and data annotation techniques in most cases). Small data sets, often underestimated in favor of large databases, have unique potential to meet niche applications, improve efficiency, and enable the development of AI even in resource-limited environments.

‍

💡 In this article, we'll explore Why small data sets are becoming a cornerstone of AI progress, how they compare to big sets in terms of usefulness and impact, and what lessons can be learned from pioneers like DeepSeek (who by the way, did not necessarily use small data sets, but that's another debate since the training data used is not yet known at the time of writing this article!). Whether you are an AI enthusiast, a Data Scientist, or simply curious, understanding the role of small data sets in AI developments offers valuable perspectives on the future of AI and its potential!

‍

What is a small dataset?

‍

In the world of big data and artificial intelligence, we often hear about the importance of big data sets. However, small data sets play an equally important role in many areas. But what exactly do we mean by”Small Dataset“?

‍

A small dataset is generally defined as a data set containing a relatively small number of observations or samples (i.e. few raw data, enriched with a limited number of metadata). Although the exact definition may vary depending on the context, a data set is generally considered “small” when it contains fewer than a few thousand entries. These sets can come from a variety of sources, such as scientific experiments, small-scale surveys, or data collections limited to a specific area.

‍

💡 It is important to note that the size of a data set is relative to the field of application and the problem to be solved. For example, in the field of genomics, a set of 1000 DNA sequences could be considered small, while in a local sociological study, the same number of participants could be considered substantial. The concept of “small dataset” therefore depends on the context and standards specific to each discipline!

‍

Hey... looking for data and metadata to train your AI? You’re in the right place! 🫡

Launching a startup or a POC? Click here to order a dataset (basic or custom) at an unbeatable price, and kickstart your experiments today. With this dataset, you’ll be able to test, iterate, and move your AI project forward—fast!

Start annotating

‍

The benefits of small data sets

‍

Contrary to what one might think, small data sets have numerous advantages that make them valuable in many situations. Some of these benefits include:

‍

1. Ease of collection and management

Small datasets are generally faster and less expensive to collect. They require fewer resources in terms of time, money, and labor, making them accessible to more people.

‍

2. Speed of analysis

With less data to process, analytics can be done more quickly, allowing for more frequent iterations and adjustments in the AI research and development process.

‍

3. Better understanding of data

Smaller data sets allow for deeper exploration and a finer understanding of each data point. This can lead to Insights valuable qualitative data that could be lost in the analysis of large volumes of data.

‍

4. Flexibility and agility

Small datasets offer more flexibility in experimenting and adjusting hypotheses. It is easier to change the settings or refocus the study if necessary.

‍

5. Noise reduction

In some cases, small data sets may contain less noise or errors, especially if they are carefully assembled and therefore more qualitative. These datasets can be used to develop more accurate and reliable models.

‍

Challenges and limitations of small data sets

‍

While small datasets have many benefits, they are not without challenges and limitations. Understanding these aspects is very important in order to use these data sets effectively:

‍

1. Limited representativeness

One of the main challenges with small data sets is their limited ability to represent a larger population. There is a higher risk of sampling bias, which can lead to erroneous conclusions if one is not careful.

‍

2. Reduced statistical power

With less data, the statistical power of analyses is often reduced. This means that it may be more difficult to detect subtle effects or to draw statistically significant conclusions.

‍

3. Sensitivity to outliers

Small datasets are more sensitive to outliers or measurement errors. A single wrong data point can have a disproportionate impact on analysis results.

‍

4. Limits in the application of certain analysis techniques

Some advanced analytics techniques, especially in the field of machine learning, require large amounts of data to be effective. Small data sets may limit the use of these methods.

‍

5. Risk of overlearning

In the context of machine learning, models trained on small data sets are more likely to overlearn, that is, to adapt too closely to the training data at the expense of generalization.

‍

Techniques for maximizing the use of small data sets

‍

Faced with the challenges posed by small datasets, we have developed various techniques to make the most of them. Here are some approaches that we frequently recommend to our customers:

‍

1. Cross validation

This technique makes it possible to evaluate the performance of models on small data sets. It involves breaking the data into subsets, training the model on some, and testing it on others, repeating the process several times. This allows for a more robust estimate of model performance.

‍

2. Increase in data

In some areas, such as image processing, we can artificially increase the size of the data set by creating new instances based on existing data. For example, by cropping, cropping, or slightly altering the original images.

‍

3. Regularization techniques

To avoid overlearning, we often use regularization methods such as L1 regularization (Lasso) or L2 (Ridge). These techniques add a penalty to the model's loss function, encouraging simplicity and reducing the risk of overlearning.

‍

4. Transfer learning

This approach, the Transfer Learning, consists of using a pre-trained model on a large data set and refining it on our small data set. This makes it possible to benefit from the knowledge acquired on large volumes of data, even when our own data is limited.

‍

5. Use of a classify to enrich the dataset

Finally, a powerful strategy (which we are seeing more and more) is to exploit a classify to transform a small dataset into a larger set.

‍

Example of approach:

- Select a representative subset of 5000 well-labeled samples.

- Train a classify on this data to create an initial model. Then apply this classify on a larger set of unlabeled data, in batches of 5000 samples.

- Manually correct errors after each iteration and monitor the improvement of model accuracy.

- Starting with around 70-80% accuracy, this iterative process makes it possible to progressively enrich the dataset while reducing errors. This approach is ideal for cases where large-scale manual collection is difficult or expensive.

‍

Areas of application for small data sets

‍

Small datasets are useful in many areas, often where large-scale data collection is difficult, time-consuming, expensive, or simply impossible. Here are a few areas where we frequently see the effective use of small data sets:

‍

1. Medical research

In clinical studies, especially for rare diseases, researchers often work with a limited number of patients. These small datasets are critical because the data is rare: they make it possible to understand the mechanisms of the disease and to develop new treatments.

‍

2. Ecology and conservation

Studies of rare or endangered species often involve small sample sizes. However, these limited data are essential for the conservation and management of biodiversity.

‍

3. Market research for small businesses

Small businesses or startups often don't have the resources to conduct large-scale market research. They therefore rely on small datasets to obtain Insights on their customers and the market.

‍

4. Psychology and behavioral sciences

Behavioral studies often involve relatively small samples due to recruitment constraints and the complexity of experimental protocols.

‍

5. Engineering and quality control

In product testing or quality control processes, we often work with limited samples for reasons of cost or time.

‍

6. Astronomy

Despite technological progress, some rare astronomical phenomena can only be observed a limited number of times, resulting in valuable small datasets.

‍

7. Pilot studies and exploratory research

In many areas, pilot studies with small samples are used to test feasibility and refine hypotheses before engaging in larger scale studies.

‍

Comparison between small and large datasets

‍

The comparison between small datasets and large data sets (or”Big data“) is a frequent topic of discussion in the world of data analysis. Each approach has strengths and weaknesses, and the choice between the two often depends on the specific context of a study or project. Here is a comparison chart that highlights the main differences:

‍

Comparison: Small Datasets vs. Big Data

Aspect	Small Datasets	Big Data
Data Volume	Limited (usually < 10,000 points)	Massive (millions or billions of points)
Collection Cost	Usually low	Often high
Analysis Time	Short	Can be very long
Statistical Power	Limited	High
Overfitting Risk	High	Generally lower
Detailed Understanding	Possible for each data point	Difficult at individual level
Flexibility	High	Limited
Applicability of Advanced AI Techniques	Limited	Extensive
Computing Resource Needs	Low	High
Ease of Updating	High	Can be complex

‍

It is important to note that these comparisons are general and may vary depending on specific situations. In many cases, the ideal approach is to combine the benefits of both types of data sets:

1. Use small datasets for rapid exploratory analyses and pilot studies
2. Validate hypotheses and models on larger data sets where possible
3. Use intelligent sampling techniques to extract representative small datasets from large volumes of data.

‍

In the end, the value of a data set depends not only on its size, but also on its quality, its relevance to the question being asked, and how it is analyzed and interpreted.

‍

Case studies - read in the press, some successes with small data sets

‍

To illustrate the power of small datasets, let's look at some case studies where the careful use of small data sets has led to significant discoveries or innovative applications:

‍

1. Discovery of the exoplanet TRAPPIST-1e

In 2017, a team of astronomers discovered a potentially habitable exoplanet, TRAPPIST-1a, using a relatively small data set. Their analysis was based on only 70 hours of observations from the Spitzer Space Telescope. Despite the limited size of the data, researchers were able to accurately identify the characteristics of this planet.

‍

2. Early prediction of Alzheimer's disease

A study conducted by researchers at the University of San Francisco used a small dataset of only 65 patients to develop a machine learning model that could predict Alzheimer's disease with 82% accuracy up to six years before clinical diagnosis. This study demonstrates how limited but high-quality data can lead to significant advances in the medical field.

‍

3. Optimization of agricultural production

An agricultural startup used a small dataset of 500 soil samples to develop a predictive model of crop quality. By combining this data with weather information and transfer learning techniques, this startup was able to create an accurate recommendation system for farmers, significantly improving yields in various regions.

‍

4. Improving road safety

A municipality analyzed a data set of only 200 traffic accidents to identify the main safety issues. Despite the limited sample size, the in-depth analysis of each case made it possible to identify specific risk factors and to implement targeted measures, reducing the accident rate by 30% in one year.

‍

5. Development of new materials

Materials science researchers used a small dataset of 150 compounds to train a model to predict the properties of new metal alloys. By using data augmentation and transfer learning techniques, they were able to successfully predict the characteristics of new materials, which significantly accelerated the development process.

‍

In conclusion: the growing importance of small data sets

‍

As we explore small datasets, it becomes clear that their importance in the data analytics landscape is constantly growing. Although the era of Big data has revolutionized many fields, including artificial intelligence, we are seeing a renewed interest in small data sets and optimization, rather than the use of GPUs en masse, for several reasons:

‍

1. Accessibility : small datasets are more accessible for a greater number of organizations and individuals. Small datasets therefore democratize the adoption and development of AI: AI is accessible to everyone!
2. Fast iteration : they allow for faster analysis and experimentation cycles, which are essential in a world where agility is required.
3. Focus on quality : the use of small datasets encourages particular attention to the quality and relevance of each data point.
4. Ethics and confidentiality : in a context of growing concerns about data privacy, small datasets often offer a more ethical and less intrusive alternative.
5. Complementarity with big data : far from competing, small datasets and big data are often complementary, offering different and rewarding perspectives.
6. Methodological innovation : the challenges posed by small datasets stimulate innovation in analytical methods, benefiting the entire field of data science.

‍

Are you ready to harness the power of small datasets in your projects? Contact us today to find out how we can develop datasets of any size for you. Together, let's transform your data into Insights actionable, in training data for your AIs and in competitive advantages!

SmoLLM: powerful AI at your fingertips

Hallucinations of LLMs: when datasets shape the reality of AI

LLM hallucinations pose major challenges in AI. Learn how to mitigate these risks through better data annotation!

Preference Dataset: Our Ultimate Guide to Improving Language Models

Preference datasets, essential in AI, capture human choices in order to perfect large language models or LLMs.