Strategy for manual data annotation in AI: still valid in 2025?


🔍 Create datasets using data annotation : is it necessary for my AI development project and what strategy should I adopt?
Introduction
Training data quality plays a leading role in the development of accurate, efficient, and reliable AI algorithms, underlining the importance of professional data annotation teams for the success of successful AI initiatives.
When undertaking an AI project based on unstructured data, it is important to keep in mind the importance of data annotation as part of AI development cycles. This article aims to serve as a comprehensive guide to help you set up your data annotation strategy for AI development. Although this step is not always required, it plays a decisive role in understanding and exploiting data to build effective products.
We will repeat it several times in this article: machine learning, a fundamental aspect of modern AI systems, depends heavily on data annotation. This practice allows machines to improve their results by mimicking human cognitive processes without direct intervention. It is therefore important to understand this process and especially the issues associated with it.
Reminder: understand data annotation in a few words
Define the different types of data annotation
The term”data annotation“encompasses a variety of methods used to enrich data in formats such as image, text, audio, or video. It is a question of enriching structured or more frequently unstructured data with metadata, to facilitate its interpretation by artificial intelligence algorithms.
Below, we explore each category in more detail.
Image annotation
Image annotation allows artificial intelligence (AI) models to instantly and accurately distinguish between various visual elements, such as eyes, nose, and eyelashes, when analyzing an individual's photo. This precision is necessary for applications such as facial filters or facial recognition, which adapt to the shape of the face and the distance from the camera. Annotations can include captions or labels, helping algorithms recognize and understand images for autonomous learning. The main types of image annotation include classifying, the object recognition, and the segmentation.
Audio annotation
Audio annotation deals with dynamic files and must take into account various parameters such as language, speaker demographics, dialects, and emotions. Techniques like timestamp and audio tagging are critical, including annotating nonverbal characteristics such as silences and background noises.
Video annotation
It may seem silly to recall, but unlike a still image, a video consists of a series of images that simulate movement. Video annotation includes adding key points, polygons, and frames to mark various objects through successive images. This approach allows AI models to learn the movement and behavior of objects, which is essential for functions like locating and tracking objects.
Video annotation tasks use specific techniques such as interpolation. Interpolation, in video annotation, is a technique used to simplify and speed up the video processing process, especially when it comes to tracking moving objects across multiple frames.
Text annotation
Text data is everywhere, from customer reviews to social media mentions. Annotating text requires an understanding of the context, the meaning of words, and the relationship between certain sentences.
Annotation tasks such as semantic annotation, intent annotation, and feeling annotation allow AI models to navigate the complexity of human language, including sarcasm and humor. Other processes include recognizing and linking named entities, which identify and connect textual elements to specific entities, and text categorization, which categorizes text according to different topics or feelings.
Use data annotation tasks, yes, but why?
Using data annotation tasks is a critical process that highlights the importance of accuracy and authenticity in annotated data sets for machine learning. This is an important task that should not be overlooked in the preparation of datasets used for the training of artificial intelligences.
💡 Through this article, we want explore the need for an industrial annotation phase in your artificial intelligence development cycles. We are going to look at the strategies to adopt (whether it is manual or automated annotation, or even automated and enriched by manual validations).
What data? Structured, semi-structured, or unstructured?
Understanding the nature of data
When working on your annotation strategy for AI, the first step is to understand the nature of the data to be analyzed. This can be textual data, images in various sectors such as health for the annotation of medical images, the Retail for product images, and industry for images of manufacturing processes, or videos for example.
The nature of this data (structured or not) as well as the total volume of the data are decisive factors. Should we annotate, and if so, what approach should we take? Manual data annotation plays a critical role in industries such as healthcare for the annotation of medical images, as it is the only way to obtain reliable and unbiased datasets to train object detection models, for example.
Is it really essential to label data?
Data labeling, or the act of annotating and tagging data to make it recognizable and intelligible to machines, includes processes such as cleaning, transcription, actual labeling (data labeling), and the quality assurance process.
This step, which is critical in the process of training machine learning and artificial intelligence models, allows AI models to practice solving real-world challenges without human intervention.
Discerning the differences between manual and automatic annotation is essential in the data processing process prior to the development of an AI product.
Annotating manual or automatic data: what are the differences?
What about manual annotation?
Manual annotation involves the assignment of labels to documents or to subsets of documents by human actors (annotators of data, also called Data Labelers). This critical task in the AI development process ensures data recognition by machines for prediction and machine learning applications.
Is automating data annotation with LLMs a reality?
Automatic annotation, or data annotation, involves computer programs in this task, covering a wide range of AI applications such as autonomous driving, and highlights its critical role and applications in AI technologies. Recently, many businesses have raised the possibility of annotating data with LLMs. What about it?
In reality, the automation of data annotation tasks can be achieved through various methods, including techniques based on a set of rules, or supervised learning algorithms used for annotation (and therefore, whose purpose is not to be a product for the end user, but rather an AI used to prepare data for other AIs). These latest supervised learning algorithms require a prior phase of data annotation regardless of what one may say.
How do I choose between manual and automatic annotation?
The choice between manual and automatic annotation depends largely on the characteristics of the project. You have to keep in mind your final need: if I want to build a ”Ground truth“ dataset, it is unlikely that automatic annotation, which is often not very accurate, will meet my needs. However, while manual annotation is often unequalled in accuracy, it can be expensive and time-consuming.
It is also possible to opt for a hybrid approach, combining the advantages of both methods to maximize efficiency while maintaining the quality of the annotations. We can't say it enough: understanding the needs of your use case and the expected level of data quality are the main criteria that will allow you to choose the most appropriate annotation method for training your AI.
Don't be fooled by the promises of 100% automatic annotation
Promises, always promises
The promise of 100% automatic annotation is attractive, especially because of the speed, lower costs and the possibility of automating large volumes of data. However, it's important not to be fooled by the idea that automated annotation can completely replace human intervention, especially in cases where data accuracy and contextualization are critical.
Large language models, such as OpenAI's GPT-4, offer promising capabilities for automatic annotation by processing a large amount of textual data quickly and cheaply. They can be used for annotation tasks in social sciences, showing an ability to reproduce annotation tasks on data already labeled by humans, with reasonable precision. However, the performance of these models can vary and is often stronger in recall than in accuracy, indicating a tendency to identify positive cases correctly but with a higher risk of error.
Tools to optimize manual annotation processes
On the other hand, annotation platforms suchlike CVAT offer automated annotation features for tasks of computer vision, allowing for increased scale and precision in specific projects. They allow the annotation of encompassing boxes, the object detection, image segmentation, and more, with some task-based automation that helps process larger volumes of data. If it makes the work of annotators, this does not make their intervention less important: if we combine these functionalities with automation, it is in reality a question of making manual tasks more efficient and not of automating a Workflow at 100%!
Other platforms, like Argilla, are designed to facilitate data annotation, dataset management, and model monitoring in the development of machine learning systems. This platform allows users to build and refine datasets with an intuitive interface that supports a variety of annotation types, such as text tags and image annotations. While there is no question of automation per se, platforms like Argilla are paving the way for a hybrid approach to data annotation for AI...
A hybrid approach: the key to success?
Hybrid approaches, combining manual and automatic annotation, can also be implemented, improving accuracy while reducing the time and costs associated with annotating large data sets.
These approaches take advantage of AI to pre-annotate data, that annotators humans can then check and adjust as needed. A hybrid approach makes it possible to obtain high-quality annotations by exploiting both the efficiency of automation and the refinement of human analysis.
The integration of these advanced automatic and semi-automatic annotation tools is essential for Machine Learning projects, especially computer vision, allowing companies and researchers to develop more robust and accurate models.
Challenges in perspective
However, challenges remain, especially in terms of maintaining accuracy as data structures change, requiring continuous adjustments to models to take into account new information introduced or to be introduced. Manual annotation remains essential for providing accurate references and for the validation of automatic annotations, especially in areas where nuances and context are important.
While automatic annotation tools offer significant advantages in terms of speed and cost, they should not be considered a complete solution without human supervision. The integration of human verifications and the strategic use of automatic annotation in the context of a Workflow Broader annotation is essential to maintain the quality and reliability of annotated data... and to avoid data bias!
Improving manual annotation using artificial intelligence (AI): in which cases is it relevant?
When should manual annotation be used vs. automatic annotation?
The relevance of using AI methods to structure data depends closely on the volume of data to be processed. For example, when it comes to analyzing responses to a questionnaire with a relatively modest volume of data, it may be more appropriate to opt for a manual annotation approach.
This method, although time-consuming, can accurately meet the objectives of analysing the themes addressed by the annotators (or survey respondents, for example). It is important to note that determining the relevance of the volume of data required to develop an AI is not based solely on a fixed threshold number of documents, but rather on criteria such as the nature, length of the documents, and the complexity of the annotation task.
Machine learning can be applied to improve manual annotation, by allowing systems to learn from each annotation task to become more accurate and effective. Integrating AI into data annotation processes significantly improves the efficiency and accuracy of manual annotation, underlining its importance in developing accurate and effective AI and machine learning models.
However, when faced with a large volume of documents or a continuous flow of data, automating the annotation process generally becomes a relevant option. In these situations, the purpose of the annotation phase is to initially annotate a portion of the documents, depending on the nature of the documents and the complexity of the task.
A partial annotation of the data can be used to train a supervised algorithm, thus making it possible to effectively automate the annotation across the corpus. However, be careful not to imagine that the automatic annotation task is sufficient by itself. Generally, it will make it possible to produce data that is pre-labelled but requires to be qualified by professional annotators to be exploitable by an AI model.
How to implement AI technologies in annotation cycles?
The implementation of AI technologies in data annotation projects is important in that it contributes to the quality of training data and the performance of AI and machine learning models. The annotation task is becoming more targeted for annotators, making their work more efficient. Data integration such as speech recognition is a good example of how AI-enhanced annotation can handle various types of data, including data from natural language, to help understand and classify information reliably.
An approach that is often recommended is to use Active Learning in annotation processes, to improve working conditions and the efficiency of annotators. Active Learning consists in intelligently selecting the most informative examples for the algorithm in order to progressively improve its performance.
By integrating Active Learning into the manual annotation process, the process can be optimized by specifically targeting the most complex or ambiguous data, which helps to increase the efficiency and accuracy of the algorithm over time.
For example, let's take a task of annotating real estate ads (30 to 40 labels on average for each listing of 500 words). By integrating Active Learning after annotating 2000 texts, pre-annotated data will be generated. This pre-annotated data will then be submitted to the annotators for manual qualification, i.e. they will have the task of checking and correcting pre-annotation errors, rather than manually annotating the 30 to 40 labels mentioned above, for 5,000 remaining ads, for example.
What tools can I use to make my manual data annotation processes more efficient?
1. Collaborative annotation platforms
Introduction to collaboration and project management
For manual data annotation projects, efficiency can be greatly improved through the use of collaborative platforms that allow multiple annotators to work simultaneously on the same data set. Tools like LabelBox offer features that facilitate the distribution of tasks and the monitoring of progress in real time.
Key features and benefits
These platforms often incorporate project management functions, allowing supervisors to monitor progress, assign specific tasks, and monitor the quality of annotations on an ongoing basis. The user interface of these tools is designed to minimize human error and maximize productivity through keyboard shortcuts, customizable tagging templates, and simplified review options.
2. Using Artificial Intelligence to Assist Manual Annotation
AI support techniques
Integrating AI into manual annotation processes can dramatically speed up work while maintaining high precision. For example, tools like Snorkel AI use weak supervision approaches to automatically generate preliminary annotations that annotators can then review and refine.
Benefits of the hybrid approach
A hybrid method using not only manual annotations and workflows Automated systems not only reduce the time spent on annotating each data but also improve the consistency of annotated data by offering initial labels based on advanced machine learning algorithms.
3. Review and quality control systems
Importance of quality control
Quality control is essential in any data annotation process to ensure the reliability and usefulness of annotated data. Integrating review systems where annotations are regularly checked and validated by other team members or supervisors can help maintain the high quality standards needed to train models.
Review tools and methods
Features like built-in comments, change histories, and alerts for inconsistencies are key elements that platforms like Prodigy and LightTag offer to facilitate textual annotation processes, for example. These tools also make it possible to produce detailed metrics on the performance of annotators, which helps identify training or continuous improvement needs.
4. Ongoing training and support for annotators
Role of training
Ongoing training for annotators plays an important role in improving the quality of annotated data. Offering regular training sessions and learning resources for annotators can help align their understanding of annotation criteria and increase their effectiveness. We can't say it enough: before using the services of a Data Labeling provider, consider formalizing an annotation manual!
Use of online resources and tutorials
Platforms like Coursera and Udemy offer specific courses on data annotation that may be useful. Additionally, video tutorials and step-by-step guides available on these annotation platforms can also be valuable resources.
The importance of ethical responsibilities in data labeling
Ensuring fair and equitable practices
It is important to consider one's ethical responsibilities when it comes to Data Labeling, to ensure fair and equitable practices in the development of AI models. Ensuring an ethical data annotation process involves establishing safe, sustainable, and equitable employment practices for those who do this work, ensuring that they are provided with dignified working conditions and fair remuneration. We often tend to equate annotation work to be a laborious and degrading task: we think that it is a vector of job creation and development in countries where opportunities are sometimes few and far between.
Furthermore, diversity and inclusion must be at the heart of annotation practices to avoid the introduction of biases that could negatively affect the equity and representativeness of AI models. This means integrating diverse perspectives and maintaining an inclusive environment among data annotation teams, so that all cultures and individuals involved in AI models are fairly represented.
Detect and reduce model biases
In addition, it is essential to adopt proactive measures to detect and reduce bias in the early stages of collection. and data processing. This includes using pre-processing techniques to balance data sets and using post-processing methods to adjust models to minimize persistent biases.
For these efforts to be effective, it is recommended that a system of evaluation and Feedback continuous, allowing the accuracy and precision of annotations to be monitored and improved on a regular basis. Regular data audits can be beneficial, providing an independent perspective on annotation practices and helping to maintain increased accountability and transparency.
💡 In short, the adoption of these ethical practices in data annotation is not only a legal or moral necessity, but also an essential component for the development of fair and reliable AI technologies.
Recognizing the work of Data Labeling for its true value
Finally, it is essential to recognize that for many Data Labelers across the world, artificial intelligence offers significant opportunities for professional and economic development.
In many countries (for example, this is the case in Madagascar), jobs in the field of Data Labeling provide a stable source of income and allow individuals to acquire valuable technical skills in a fast-growing sector. These opportunities can be especially valuable in regions where traditional employment options are limited or in decline.
Businesses that employ Data Labelers therefore have a responsibility to maximize these opportunities by not only providing fair and safe working conditions, but also by offering training and advancement opportunities.
By doing so, they contribute not only to the improvement of the living conditions of their employees but also to the promotion of local economic development. This creates a virtuous circle where technological advancements not only benefit businesses, but also the communities that support these technologies through their daily work.
Conclusion
The balance between manual and automatic annotation is adjusted according to the specific requirements of data annotation campaigns and artificial intelligence projects. We believe that a dynamic approach that evolves over time is essential.
In this context, Innovatiana distinguishes itself by offering a complete solution through its services and its platform accessible at https://dashboard.innovatiana.com. This platform allows access to labelled data on demand, to meet the varied needs of projects, while offering the possibility of strengthening the labeling teams by mobilizing our team of Data Labelers.
So, Innovatiana is fully in line with a dynamic and progressive vision of annotation in artificial intelligence projects, offering a complete and adapted response to current challenges. In a nutshell, selecting a company that specializes in data annotation, or “labeling,” is important for the success of AI projects. It is up to you to select the right partner to build your datasets and obtain accurate and reliable AI models!