Poor data: a major obstacle in Machine Learning


🔍 As the business applications of artificial intelligence and machine learning multiply and rapidly transform various sectors, one truth remains: Data quality is the pillar on which these technological advances are based.
Machine learning (ML) has become essential in many industries, and has made it possible to build various AI products in recent years. The dominant approach is centered on data, and for ML models to truly bring value to a business, the quality of the data used is of fundamental importance. In this article, we explore why data quality is critical, and why painstaking and painstaking data preparation is the foundation of a vast majority of AI products.
Why is data quality the backbone of your AI projects?
ML algorithms use data to learn and make predictions. However, not all data has the same value. Data quality is a major determinant of the accuracy and reliability of ML models.
Professionals working on ML projects (Data Scientists, Developers, Data Labelers, etc.) know the challenges well. Many projects seem to stagnate during the test phases, prior to deployment, mainly due to the lack of quality in the annotation of scaled data. Human errors, unclear hypotheses, the subjective and ambiguous aspect of the annotation task and especially a lack of supervision and consideration of the work done by Data Labelers often contribute to these problems.
Data annotated en masse but in an approximate way... a disaster!
Data inaccuracy can be the result of human errors, faulty data collection techniques, or issues with the data source. When an ML model is trained on incorrect data, it can make poor decisions.
Some examples to illustrate the impacts of models trained with imperfect data on products and use cases:
1. Wrong medical diagnosis
Imagine an AI system to help doctors diagnose diseases. If this system is trained on incorrect or incomplete medical data, it could lead to erroneous diagnoses, putting patients' lives at risk. Such a situation highlights the imperative of having accurate and complete medical data to ensure the reliability of AI systems in medicine. To avoid this, and to enable the development of efficient medical AI products and the training of surgeons all over the world, SDSC collective is working on an annotated medical database for AI.
2. Machine translation errors
Machine translation systems use machine learning models to translate texts. If the training data contains errors or incorrect translations, the machine translation results may be inaccurate, which can lead to misunderstandings and communication issues.
3. False positives in computer security
In the field of computer security, systems for detecting intrusions and malicious activities are based on ML models. If the data used to train these models contains incorrect or mislabeled examples, this can lead to false positives, meaning legitimate actions are falsely flagged as threats, causing an unnecessary reaction and leading to a waste of time for threat surveillance activities (SOC), polluted by false alarms.
4. Imperfect movie recommendation systems
Imagine a movie recommendation system. Imagine that this system, based on machine learning, recommends movies to users based on past preferences. However, an insidious bias creeps into the model, causing users to be primarily recommended movies in a specific genre, such as action, at the expense of other genres such as comedy or drama.
The data set used to form the model was unbalanced, with a massive over-representation of action movies, while other genres were underrepresented. The model has thus learned to favor action movies, neglecting the varied preferences of users. This example highlights the importance of having balanced and representative training data to ensure accurate and relevant recommendations.
5. Vehicle emergency braking system failure
Imagine a situation where a car manufacturer sets up an automated emergency braking system, designed to detect obstacles and stop the car in case of imminent danger. This system relies on sensors, cameras, and map data to function properly.
During initial road tests, the emergency braking system does not respond appropriately to pedestrians and obstacles. It brakes sharply for no reason, while in others it does not react at all to moving objects. These malfunctions are due to erroneous sensor data and inconsistencies in the mapping data used to form the system model.
It turns out that the data collected for the formation of the emergency braking model was incomplete and inaccurate. The test scenarios did not cover enough real-world situations, leading to a system that was ill-prepared to respond properly in an emergency situation.
This example highlights that, even in a sector like the automotive industry, where safety is paramount, the quality of the data used to form autonomous systems is crucial. Wrong or incomplete data can endanger the lives of drivers, passengers, and pedestrians, highlighting the importance of rigor in data collection and validation to ensure the reliability of autonomous driving systems.
To mitigate the impact of inaccurate data, it is essential to carefully validate the data before using it. Annotators should be trained in the task, in annotation software (LabelBox, Encord, V7 Labs, Label Studio, CVAT, etc.) and to the required accuracy. Clear guidelines and annotated sample data can ensure data consistency and accuracy.
The trap of unrepresentative data
Unrepresentative data can skew ML models. Numerous examples in the field of easy recognition have hit the headlines. Examples include data quality biases related to facial recognition systems, which are increasingly being used for authentication, security, and other applications. However, several facial recognition systems have shown patterns of racial and ethnic bias due to imbalanced training data.
Take the case of a facial recognition system used by law enforcement agencies to identify suspects. If training data is mostly composed of faces from a single ethnicity, the system may have trouble correctly identifying faces from other ethnic groups. This can lead to misidentification, unfair arrests, and the perpetuation of discriminatory stereotypes.
This example highlights the need for diverse and representative training data to ensure that facial recognition systems do not favour one ethnic group over another, and to avoid the harmful consequences of discrimination and biased justice. In addition, depending on the use case, this data will benefit from being prepared by groups of annotators with varied profiles.
In conclusion...
Data quality is an essential pillar for the success of your AI projects. Annotation errors, biased data, and missing information can put the reliability of ML models at risk. By following best practices such as the training of image annotators, videos and text, data validation, and ongoing monitoring, Data Scientists and other AI developers can maximize the value of their ML initiatives and avoid many of the pitfalls associated with data preparation.