By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
How-to

Conducting your data annotation campaign: our guide (1/2)

Written by
Nicolas
Published on
2023-12-18
Reading time
0
min
🧐 Why annotate images, videos, texts,... what is the importance in AI?

To analyze the content of your data, train supervised algorithms and succeed in your artificial intelligence project, the use of “structured” or “annotated” data is essential.

If your data is already structured, it means that it has been previously organized so that it can be represented in table form, with rows corresponding to observations and columns corresponding to variables. By integrating an upstream structuring process, you benefit from significant time savings and it is likely that you do not need an annotation phase, as your data is already structured.

On the other hand, if your data is “unstructured”, meaning that it cannot be described by a predefined model, is not categorized, and can be very diverse (images, text, videos, etc.), it is very likely that you need to annotate this data. The unstructured nature of this data makes it much more complex to exploit them by artificial intelligence algorithms. In this case, the organization of an annotation phase becomes necessary.

The annotation phase, which consists in assigning one or more labels to elements of a data set, thus makes it possible to create a structured data set, making it possible to train supervised algorithms.

💡 Annotation consisting in assigning to each piece of data the label that best suits it. For example, this may consist in awarding labels such as “dog” or “cat” to a collection of photographs of animals, or the selection of appropriate labels from “city”, “type of housing”, “price offered for purchase” on a series of real estate ads.

The quality of your AI solution, in terms of relevance and performance, will be greatly influenced by the quality of the data, of which the accuracy of the labels is an important aspect, although other qualitative aspects may also play a role (such as the completeness of the explanatory variables, the detection of outliers, etc.). It is therefore essential that the annotation phase be carried out with particular attention to obtaining high quality labels. This guide outlines the key steps and some best practices to ensure this goal.

How to prepare your data annotation campaign? Starting with identifying stakeholders

Conducting a text, image or video annotation campaign requires a specialized team, including annotators (or Data Labelers), a project manager, a Data Scientist and possibly an administrator of the annotation platform (labeling solution such as Label Studio or CVAT).

Below is a brief overview of the different profiles involved in annotation campaigns for AI:

The project manager (Business Expert)

The project manager, a business expert, plays an essential role in planning and monitoring the annotation process. His responsibilities include the implementation of the annotation diagram and the associated manual, the training of annotators, the estimation of the time required for the various annotation tasks, the establishment of an annotation plan, and the qualitative and quantitative monitoring of the project.

The Data Scientist (Technical Expert)

The Data Scientist uses tools and methods to assess the progress and quality of annotations, for the needs of an AI model. It can also pre-annotate documents, prioritize annotations, and implement IT methods to speed up the annotation process. Upstream of the annotation, the Data Scientist can define a data curation strategy, by doing initial work on the raw data in order to eliminate noise (for example: frames unplayable in a video set).

The annotation platform administrator

The platform administrator is responsible for installing the annotation software, managing user accounts, providing documents and preparing labeling environments, and regularly saving annotations to avoid data loss. He also ensures the relevance of the solution and carries out all the technical tests necessary to use the data and metadata produced (example: is it possible to extract complete data in JSON format with an appropriate level of performance).

Data annotators

The profile of annotators varies depending on the annotation task. Some cases simply require fluency in a language such as English or French, while others require specific expertise (for example: knowledge in anatomy, specific expertise in the field of sport, etc.). Annotators are responsible for understanding the task, annotating the documents, and reporting questions or difficulties to the campaign manager as they are annotated.

Define a problem

The annotation process, often a preliminary phase of a larger AI project, requires a thorough reflection on the problem of the project before its actual start. This precaution ensures that the annotations made contribute effectively to solving the specific problem of the project.

The annotation process may vary depending on the intended application and the nature of the problem chosen. Therefore, it is imperative to answer a series of essential questions:

 • What problem is the project trying to solve?

 • What is the overall context of the project and what public service mission does it support?

 • What are the strategic goals of the project and how do they align with the organization's goals?

 • What are the operational goals of the project?

 • What are the expected impacts of the solution on the organization of the service, both from the point of view of public officials and users?

 • Are there similar projects that could be beneficial to explore?

 • What is the scope of the solution being considered, and how does this affect the field of data to be annotated?

Develop a data annotation schema

The annotation schema is a template that allows you to describe the annotations in your project. It must stem from the problem defined above. Concretely, it consists at least of a set of labels (i.e. terms that make it possible to characterize this or that information in a document) and a precise definition of these various labels. For some projects, the annotation schema can also be defined by a hierarchy between labels or by relationships between terms. All the labels can in fact be hierarchized between them. The annotation scheme is sometimes completed by a task of identifying relationships between the annotated entities (for example, an annotation task may be to relate a pronoun to the noun to which it relates).

The business problem to which the project answers is often complex, with many special cases or exceptions to the usual rules. Establishing an annotation schema often involves simplification work (which also results in a loss of information or precision). However, it is important not to simplify to the extreme, and therefore to find a good balance between simplicity and adequacy to the business problem. In order to find this balance, an iterative process is generally the best method to adopt. If the purpose of the annotation is to train an artificial intelligence algorithm, it is not necessary to exclude specific features or instructions that would be too difficult to reproduce by an automatic solution.

Develop and update documentation for the annotation campaign

Documentation is a fundamental element and must evolve dynamically throughout the annotation campaign. By methodically recording the steps taken and listing the challenges encountered, documentation proves to be a valuable tool for ensuring uniformity of information within the project team. It also plays a beneficial role in sharing lessons learned with other similar projects.

Various types of documentation, each targeting specific functions within the project, are essential: general documentation, documentation for annotators, and documentation specifically designed for the administrator of the annotation platform.

Guide for annotators

Documentation for annotators is of paramount importance as a training material. It should include elements such as the detailed description of the project to offer a clear vision of the intended application, the synthetic hierarchy of annotation if applicable, precise explanations of the various labels, including the methodological choices and the logic underlying the annotation. Instructions on how to use the annotation software, concrete examples of specific cases, and a Q&A section help to facilitate the annotation process.

Guide for the administrator of the annotation platform (V7 Labs, Encord or CVAT)

Documenting how the annotation platform works is just as important. A specific guide for the platform administrator should explain how to create accounts for annotators, upload documents, assign tasks, monitor progress, correct annotations, and export annotated documents. This documentation ensures efficient and smooth management of the platform throughout the annotation campaign.

(Continued guide available at this address).

Innovatiana distinguishes itself by offering an integral solution through its platform accessible at https://dashboard.innovatiana.com. This platform offers a global response to the requirements for collecting and annotating data within the same environment. By centralizing all the needs related to these processes, it is positioned as a unique solution for artificial intelligence projects. The platform makes it possible to respond in a personalized way to the specific requirements of each project. In addition, it offers the flexibility needed to strengthen the labelling teams, thus promoting a collaborative and effective approach. Innovatiana is fully in line with a dynamic and evolving perspective of annotation, providing a complete and adapted solution to meet the current challenges of artificial intelligence projects.