Label Skew and Data Scarcity: the double challenge of annotation for AI


In the field of artificial intelligence, the quality and diversity of data play a fundamental role in the performance of machine learning models. However, challenges with annotating data, such as Label Skew and the data shortage (Data Scarcity), often complicate this process.
Let's start with some definitions: the Label Skew results in an unbalanced distribution of labels in a data set, which can interfere with model training and skew results. Data shortages, on the other hand, limit the ability of a model to generalize effectively.
💡 These two obstacles constitute a major double challenge for AI practitioners, who are looking to create robust and reliable systems. In this article, and as usual, we offer you some insights to better understand these concepts!
What is the Label Skew and why is it a problem in data annotation?
The Label Skew refers to an imbalance in the distribution of labels within an annotated data set. This means that some categories or classes are overrepresented compared to others, which can skew the learning of artificial intelligence (AI) models.
For example, in a data set of image classification, if the majority of images belong to a single category (such as dogs) and the other categories (such as cats or birds) are very poorly represented, the model will cause a bias in favor of the dominant class.
This problem is becoming particularly significant in data annotation, as AI models depend on the quality and diversity of data to generalize well. In case of Label Skew, the model is likely to overlearn the characteristics of the overrepresented class, leading to poor performance in less frequent classes. This can be problematic for critical applications where a balance between classes is essential (such as the detection of rare health diseases or the classification of safety anomalies). In addition, the Label Skew can be particularly problematic for certain specific use cases, such as those involving ecological data or medical diagnoses, where accurate measurements are essential.
💡 The Label Skew makes the work of processing and annotating data more complex, as it requires adjustments to rebalance classes or use special techniques (such as oversampling or subsampling) to mitigate the impact of an imbalance on model performance.
What are the common causes of Label Skew in the datasets?
Common causes of Label Skew in datasets are often linked to the nature of the data collected and to the biases inherent to their source. Some of the main causes include:
Natural imbalance in data
Some classes or categories are naturally more common than others in the real world. For example, in fraud or disease detection tasks, fraudulent cases or rare diseases often represent a small proportion of the available data, creating an imbalance.
Data collection bias
The collection method may result in a Label Skew if certain classes are easier to collect or are collected disproportionately. For example, a data set of images taken in an urban environment could overrepresent vehicles or people and underrepresent wildlife or natural scenes. Likewise, some items like pants in fashion data may be overrepresented due to specific collection methods.
Limiting annotation resources
In some situations, manual annotations, which require experts or a lot of time, may not cover all categories equally. This can lead to a Label Skew if some classes are more expensive at annotate (due to the lack of available data, or because the annotation of certain complex shapes requires more time).
Data filtering
During the process of cleaning or filtering data, it is possible that some classes may be eliminated or reduced in number disproportionately, creating an imbalance.
Seasonality or temporality
In some types of data, such as those from e-commerce or social networks, certain classes may be influenced by seasonal or temporary events. For example, during a sales period, a specific product category could be overrepresented compared to the others.
Social or cultural biases
Biases introduced by users or annotators themselves can also cause a Label Skew. For example, in image recognition tasks, objects or people belonging to certain cultures or ethnic groups may be under-represented in the data.
These causes of Label Skew highlight the complexity of data collection and annotation for AI, where an imbalance that is not taken into account can strongly affect the performance and generalization of models.
How the Data Scarcity or “data scarcity” exacerbates the problem of Label Skew ?
La data scarcity (or Data Scarcity) exacerbates the constraints associated with Label Skew Eby further limiting the quantity and diversity of data available for training artificial intelligence models. Here's how these two problems make each other worse:
Underrepresentation of minority classes
Less-frequent classes are becoming even rarer, making the model learning program challenging.
Over-apprenticeship of the ruling classes
The model specializes in overrepresented classes, neglecting minority groups, which increases bias.
Inability to generalize and balance
The lack of data limits the ability of the model to generalize correctly, especially for underrepresented classes.
Increased bias in predictions
The combination of data scarcity and Label Skew reinforces biases, especially in critical areas such as the detection of fraud or diseases.
How to overcome data scarcity when annotating for AI?
Overcoming the data scarcity when annotating for AI requires a combination of strategies aimed at increasing the amount of data available or at maximizing the effectiveness of existing data. Here are some of the most common approaches used to manage the data scarcity in this context:
Synthetic data generation
A common method is to generate artificial data from existing data. Les synthetic data can be created using techniques such as GaNS (Generative Adversarial Networks) or by augmenting data (supplementation), for example by applying transformations (rotation, zoom, blur) to images or by introducing noise into time series. This allows for more examples to be created while maintaining the diversity and balance of the data set.
Reuse of existing datasets for other AI products (knowledge transfer)
The knowledge transfer consists in using a pre-trained model on another similar data set and adjusting it (Fine tuning) on the small amount of data available. This method makes it possible to take advantage of large existing data sets to compensate for data scarcity in a new task.
Semi-supervised annotation
As part of an approach semi-supervised, a small portion of the data is annotated manually, while the other unannotated data is used to train a model to generate predictions about this unlabeled data. This model is then refined over time, combining annotated and unannotated data to enrich the data set.
Use of surrogate data (Surrogate Data)
When direct data is rare, it is sometimes possible to use data that is indirectly linked or substitutive. For example, in the health field, if there is insufficient data on a rare disease, it may be useful to train a model on similar diseases and then adapt the results for the target disease.
Crowdsourcing for annotation
Crowdsourcing makes it possible to gather a large number of human contributions to quickly annotate data sets. While this requires quality checks (as not all annotations are created equal), this approach can help overcome the data scarcity by increasing the volume of annotations, especially for simple or visual tasks. However, be careful to read the working conditions of the contributors working on your datasets: you could have (bad) surprises!
Oversampling and subsampling techniques
To overcome data scarcity in certain classes, techniques of oversampling can be used, where rare examples are duplicated or generated synthetically in order to balance the data set. Conversely, the subsampling overrepresented classes can also reduce the imbalance, but this approach sometimes reduces the overall amount of data available.
Reinforcement learning with simulators
In environments where it is difficult to collect real data, simulators can be used to train models in virtual contexts, reducing dependence on real world data. This method is common in fields such as robotics or video games.
Use of active learning sets
This practice involves training a model on a small amount of data and then requesting additional annotations only for examples where the model is the least confident. This optimizes the annotation process and maximizes the efficiency of available resources while reducing the data scarcity.
Outsourcing to experts
When building data sets for AI, it is often necessary to ask human experts for the services of human experts to annotate complex or rare data. This method can ensure high-quality annotations by implementing efficient workflows to create and manage small, specialized data sets.
By combining several of these solutions, it is possible to Overcoming the Data Scarcity and to create richer and more balanced annotated datasets, which improves the robustness and performance of artificial intelligence models.
Conclusion
The Label Skew And the data scarcity represent significant challenges in data annotation for artificial intelligence. Label imbalance, combined with the limited amount of data, can hinder the performance of AI models, leading to biases and a reduced ability to generalize.
However, through a variety of strategies, such as the use of synthetic data, knowledge transfer, semi-supervised learning, or access to human expert services, it is possible to overcome these obstacles.
These approaches make it possible to maximize the efficiency of the available data and to rebalance the data sets to ensure more robust and efficient models. In a field where data quality is paramount, proactive management of these challenges is essential to develop reliable and effective AI systems!