Knowledge

Ethical Regulation of AI and Data Labelling

Written by

Aïcha

Published on

2023-06-23

Reading time

min

Data labelling and data ethics: call for ethical regulation of AI in Europe

‍

💡 Artificial intelligence (AI) is revolutionizing our world, but to ensure its ethical and responsible use, adequate regulation is essential.

‍

Labeling data refers to the process of assigning meaningful tags or categories to raw data, such as images, text, or audio, to make it usable for training machine learning models.

‍

In this article, we wanted to discuss the importance of the data labeling process in the construction of AI products, data annotation, crowdsourcing and ethical labeling. Manual labeling, which involves human annotators carefully reviewing and assigning labels to data, is a crucial step to ensure accuracy and quality in datasets. We call on the European Union (EU) to adopt the EU AI Act while highlighting the shortcomings of the current text in terms of AI supply chain and data management. Data labeling work, especially with human-in-the-loop processes, is essential for creating reliable training data and improving model performance.

‍

Data labelling for AI

‍

Data labelling is a key step in the development of AI. It consists in assigning Tags or labels to data sets (or ”datasets“), thus allowing machine learning algorithms to understand and interpret information. Labeling data is used for different types of data, such as text data for sentiment analysis and entity recognition, images for object detection and image segmentation, and audio for speech recognition and emotion detection. However, it is imperative to ensure that this labelling is carried out carefully, accurately and in ethical conditions in order to avoid biases and prejudices. High quality data and accurate label data are essential for effective model training, improving model predictions, and enhancing the performance of machine learning algorithms (a.k.a. ml models or machine learning models). A data centric approach and robust data processing pipelines are crucial for ensuring reliable AI outcomes.

‍

Data annotation requires human expertise. It involves the addition of additional information to the data, that is, a semantic layer associated with images, videos or text, such as metadata or detailed descriptions. For text data, annotation tasks include entity recognition and named entity recognition, which are key for labeling text data in natural language processing. When annotating images and videos, techniques such as drawing a bounding box or bounding boxes around objects, applying image segmentation to divide images into meaningful regions, and using object tracking to follow objects across video frames are fundamental. In the context of AI, it is essential that data annotation is done in an ethical manner. This means that annotators (or Data Labelers) must follow strict guidelines to ensure the integrity and objectivity of annotated data, avoiding stereotypes, discrimination, and value judgments. They must also work in good conditions (decent working hours, stability, career prospects) and be supported (training and support) to produce quality data.

‍

A labeled dataset, where each data point is annotated, is important for supervised learning, enabling models to learn from ground truth, while unlabeled data is used in unsupervised learning to discover patterns without predefined labels. Labeled data is used to train natural language processing models for applications such as virtual assistants, speech recognition, machine translation, and extracting key points from text data. Incorporating new data is important to improve model performance and address evolving challenges in AI systems.

‍

The importance of crowdsourcing in (legacy) labeling processes

‍

Crowdsourcing is an effective method for labeling and annotating large amounts of data. By calling on a community of contributors, it is possible to obtain fast and accurate results. However, some labeling tasks can be time consuming, especially when high accuracy is required. It is thus important to put in place rigorous quality control mechanisms in order to guarantee the reliability of the data produced by the crowdsourcing. If contributors need clarification, providing more detail in instructions is essential to ensure consistent and high-quality results. It is also necessary to remember that this is not the only method for labeling large quantities of data: it is often more efficient to use a panel of functional specialists to annotate data, and to accept their gradual increase in competence rather than requiring an immediate maximum level of quality (which is often the case in labeling processes using crowdsourcing). Data Labeling is an important job, and people ready to invest in it, Data Labelers, should be treated with dignity and considered as AI specialists in the same way as a Data Scientist.

‍

Automated labeling and efficiency: balancing automation with ethical oversight

‍

Automated labeling has become a cornerstone of modern data analysis, enabling organizations to process and label vast amounts of raw data with unprecedented speed. By leveraging advanced machine learning algorithms, businesses can automatically apply labels to data points, streamlining the creation of high quality training data for machine learning models. This efficiency is especially valuable when handling large-scale datasets for applications such as object detection, sentiment analysis, and natural language processing.

‍

However, as automated labeling tools become more prevalent, it is crucial to ensure that data ethics remain at the forefront of data practices. Automated systems, while powerful, can inadvertently reinforce existing biases present in the training data or make errors when labeling sensitive data. Without proper oversight, these issues can compromise the integrity of labeled datasets and undermine the predictive capabilities of machine learning models.

‍

To address these challenges, organizations should adopt best practices that prioritize ethical considerations alongside efficiency. Regular label auditing and validation by data scientists or experienced data labelers can help identify and correct errors introduced by automated labeling. Incorporating human-in-the-loop processes ensures that complex or ambiguous labeling tasks receive the necessary attention, reducing the risk of bias and improving the quality of labeled data. Transparency in the labeling process, including clear documentation of how data is labeled and which algorithms are used, further supports responsible data analysis.

‍

Ultimately, balancing the speed and scalability of automated labeling with robust ethical oversight is essential for building trustworthy machine learning models. By embedding data ethics principles into every stage of the labeling process, organizations can harness the benefits of automation while upholding their moral obligations to handle data ethically and responsibly.

‍

Ethical labelling: a fundamental requirement

‍

Ethical labeling is a fundamental aspect of responsible AI. It is grounded in the basic principles of ethical data collection and usage, which guide responsible practices throughout the data lifecycle. It aims to ensure that the data used to train AI models is collected, labelled, and annotated in an ethical and human-friendly manner. Business professionals play a crucial role in implementing and maintaining ethical labeling practices, ensuring compliance with regulations and industry standards. Protecting consumer data and maintaining trust with the customer base are essential for building long-term relationships and sustaining a positive reputation.

‍

Engaging stakeholders in the development and implementation of ethical labeling policies helps promote transparency and accountability. Additionally, failure to adhere to ethical standards can result in significant legal issues, including lawsuits and regulatory penalties, especially when consumer data is misused or shared without proper consent. Transparency and fairness are key principles of ethical labeling, making it possible to avoid prejudices and discrimination during automated decision-making.

‍

Weaknesses of the EU AI Act: Supply Chain AI and Data Management

‍

Despite the progress of the EU AI Act project, it still has some weaknesses in terms of AI Supply Chain and data management. It is essential to have clear measures in place to ensure transparency and ethics throughout the life cycle of AI systems, from data collection to use. Accountability and control mechanisms should be put in place to ensure adequate data management and to avoid abuse.

‍

Data Labeling at the service of ethical AI: conclusion

‍

It is imperative that the European Union adopt solid ethical regulations to frame the development and use of AI. Regulation is necessary and should not hinder innovation. Data labelling and Sourcing ethics are essential elements to ensure responsible AI and an AI data supply chain that respects human life and fundamental rights. However, it is also important to take into account the current weaknesses of the EU AI Act in terms of AI supply chain and data management, in order to strengthen the protection of these rights.

Data Labeling is a profession, not a casual job

Clickworkers and crowdsourcing: what are they and why should we rethink this model for AI?

Clickworkers in AI: data artists facing the challenges of crowdsourcing. A model to be reinvented for the digital future

Content Moderation and AI: Where Ethics Meets Technology

AI is transforming content moderation, optimizing speed and accuracy. However, a “human-in-the-loop” approach is still necessary!