Knowledge

Data quality in Artificial Intelligence: an information theory approach

Written by

Nanobaly

Published on

2024-10-26

Reading time

min

The expression “Garbage In, Garbage Out“ is often cited in Artificial Intelligence (AI), but few understand its theoretical foundations.

‍

The race for performance in artificial intelligence often focuses on model architecture, the computing power or the optimization techniques.

‍
However, a crucial aspect remains underestimated : the quality of training data. Imagine building a house on an unstable foundation: no matter how sophisticated the architecture is, the structure will be compromised.

‍
Likewise, an AI model trained on noisy or poorly labeled data will inevitably reproduce these defects. This reality is not only empirical; it derives directly from the fundamental principles of information theory. Understanding these principles helps to understand why investing in data quality is often more important than investing in model complexity.

‍

Theoretical foundations

‍

Shannon's Entropy: the measurement of information

Claude Shannon revolutionized our understanding of information by offering a quantitative measure. Shannon entropy is given by

‍

H = -zero p (x) log₂ (p (x))

‍

Where:

H is the entropy (measured in bits)
p (x) Is the probability of occurrence of an event x
(). Represents the sum over all possible events

‍

This formula tells us something fundamental: information is linked to unpredictability. A certain event (p=1) provides no new information, while a rare event provides a lot of information.

‍

Application to training data

In a training dataset, the total information can be broken down as follows:

‍

H_total = H_useful + H_noise

‍

Where:

H_utile represents information relevant to our task
H_noise represents blemishes, errors, and artifacts

‍

This decomposition has a crucial consequence: an AI model that cannot intrinsically distinguish useful information from noise, he will learn both.

At the risk of therefore reproducing the noise at the output of the model.

‍

The principle of maintaining information

‍

The fundamental limit

A fundamental theorem of information theory states that a system cannot create information; it can only transform it. For an AI model, this means:

‍

Output-quality ≤ Entry-quality

‍

This inequality is strict: no architecture, as sophisticated as it is, cannot exceed this limit.

‍

Practical case: image upscaling

‍

Let's take the concrete example of photo upscaling, where we want to increase the resolution of an image:

‍

Image upscale, une image upscale, dont la résolution est augmentée, et l'image d'origine pour comparer — (You can find a list of tools used to upscale a photo **here**)

The quality chain

For a high resolution (HR) image generated from a low resolution (LR) image:

‍

psnr_output ≤ psnr_input - 10*log10 (upscaling_factor²)

‍

Where:

PSNR (Peak Signal-to-Noise Ratio) measures image quality
upscaling_factor Is the ratio between the resolutions (ex: 2 to double)

‍

Impact of training data

‍

Let's consider two training scenarios:

1. High Quality Dataset

- HR images: Uncompressed 4K photos

- Average PSNR: 45dB

- Possible result: ~35dB after upscaling x2‍

‍2. Mediocre Dataset

- HR images: JPEG compressed photos

- Average PSNR: 30dB

- Maximum result: ~20dB after upscaling x2

La 15dB difference in the final result is directly linked to the quality of training data.

‍

The PSNR in dB is a logarithmic measure that compares the maximum possible signal with the noise (the error).
The higher the dB, the better the quality:

‍

The PSNR (Peak Signal-to-Noise Ratio) is defined as:

‍

PSNR = 10 * log10 (MAX²/MSE)

‍

Where:

MAX is the maximum possible pixel value (255 for 8 bits)
MSE Is the mean squared error

‍

For upscaling, when you increase the resolution by a factor n, MSE tends to increase, which effectively decreases the PSNR.
The quality of the result is therefore very sensitive to the noise level.

‍

Order of magnitude of PSNR in dB for images

A high quality JPEG image: ~40-45 dB
Average JPEG compression: ~30-35 dB
A very compressed image: ~20-25dB

‍

The dB being a logarithmic scale:

+3 dB = 2x better quality
+10 dB = 10x better quality
+20 dB = 100x better quality

‍

So when we say “~35dB after upscaling x2", it means that:

The resulting image has good quality
The differences with the “perfect” image are hard to see
This is typical of a good upscaling algorithm

‍

The waterfall effect: the danger of AI-generated data

‍

When AI-generated images are used to train other models, degradation follows a geometric progression:

‍

Generation_quality_n = Original_quality * (1 - tau)

‍

Where:

Tan Is the degradation rate per generation
N Is the number of generations

‍

This formula explains why use generated images by AI to train other models leads to rapid degradation of quality.

‍

The importance of labelling

‍

The quality of the labels is as crucial as that of the data itself. For a supervised model:

‍

Maximum_precision = min (Data_Quality, Precision_labels)

‍

This simple formula shows that even with perfect data, inaccurate labels strictly limit possible performances.

‍

Practical recommendations

‍

1. Preparing the dataset

Above, our simplistic demonstration illustrates the crucial importance of the quality of the data used for training. We invite you to consult this article to learn more about how to prepare a quality dataset for your artificial intelligence models.

We cannot elaborate in this article but the informed reader will notice that the definition of “noise” raises philosophical questions. How do you define noise?

‍

2. Reflection: the subjective nature of noise

The very definition of “noise” in data raises profound philosophical questions. What is considered noise for one application can be critical information for another.

‍

Let's take a photo as an example:

For a facial recognition model, lighting variations are “noise”
For a lighting analysis model, these same variations are the main information

‍

This subjectivity of noise reminds us that the “quality” of the data is intrinsically linked to our objective. Like Schrödinger's cat, noise exists in a superposition: it is both information and disturbance, until we define our context of observation.

‍

This duality highlights the importance of a clear and contextual definition of “quality” in our AI projects, calling into question the idea of absolute data quality.

‍

3. Quality metrics

For each type of data, define minimum thresholds, for example:

‍

Images

‍

PSNR > 40dB, SSIM >0.95

‍

Labels

‍

Accuracy > 98%

‍

Coherence

‍

Cross tests > 95%

‍

The 40dB threshold is not arbitrary. In practice:

40dB: Practically imperceptible differences
35-40dB: Very good quality, differences visible only to experts
30-35dB: Acceptable quality for general use
<30dB: Visible degradation

‍

SSIM (Structural Similarity Index)

The SSIM is complementary to the PSNR:

‍

Threshols_ssim = {“Excellent”: “>0.95", “>0.95", “Good”: “0.90-0.95", “Acceptable”: “0.85-0.90", “Problem”: “>0.95", “>0.95"}

‍

The SSIM is closer to human perception because it considers the structure of the image.

‍

Consistency tests

Cross-tests > 95% involve:

Cross validation K-Fold
Internal consistency tests
Verification of outliers
Distribution analysis

‍

Conclusion

‍

Information theory provides us with a rigorous framework that shows that data quality is not an option but a strict mathematical limit. An AI model, no matter how sophisticated, cannot exceed the quality of its training data.

‍

This understanding should guide our investments: rather than looking only for more complex architectures, we must prioritize ensure the quality of our training data !

‍

sourcing

Shannon entropy: 🔗 https://fr.wikipedia.org/wiki/Entropie_de_Shannon
Illustration: 🔗 https://replicate.com/philz1337x/clarity-upscaler

‍

Academic and technical sources

Shannon, C.E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal.
Wang, Z. et al. (2004). “Image Quality Assessment: From Error Visibility to Structural Similarity.” IEEE Transactions on Image Processing
Goodfellow, I., Bengio, Y., & Courville, A. (2016). “Deep Learning.” MIT Press.
Zhang, K. et al. (2020). “Deep Learning for Image Super-Resolution: A Survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Dodge, S., & Karam, L. (2016). “Understanding how image quality affects deep neural networks.” International Conference on Quality of Multimedia Experience (QOMex).

Poor data: a major obstacle in Machine Learning

Where can you find quality datasets to train your AI models?

A good dataset boosts the performance of AI models. Learn where to find them and how to evaluate them before using them for your AIs

Data Generator: experts' secrets for creating quality datasets

60% of AI data will soon be synthetic. Discover how to generate and validate datasets to optimize your models!

Data quality in Artificial Intelligence: an information theory approach

Theoretical foundations

Shannon's Entropy: the measurement of information

Application to training data

The principle of maintaining information

The fundamental limit

Practical case: image upscaling

The quality chain

Impact of training data

Order of magnitude of PSNR in dB for images

The waterfall effect: the danger of AI-generated data

The importance of labelling

Practical recommendations

1. Preparing the dataset

2. Reflection: the subjective nature of noise

3. Quality metrics

Images

Labels

Coherence

SSIM (Structural Similarity Index)

Consistency tests

Conclusion

sourcing

Academic and technical sources

You may like

Poor data: a major obstacle in Machine Learning

Where can you find quality datasets to train your AI models?

Data Generator: experts' secrets for creating quality datasets