En cliquant sur "Accepter ", vous acceptez que des cookies soient stockés sur votre appareil afin d'améliorer la navigation sur le site, d'analyser son utilisation et de contribuer à nos efforts de marketing. Consultez notre politique de confidentialité pour plus d'informations.
Knowledge

Data quality in Artificial Intelligence: an information theory approach

Written by
Nanobaly
Published on
2024-10-26
Reading time
0
min

The expression Garbage In, Garbage Out is often cited in Artificial Intelligence (AI), but few understand its theoretical foundations.

The race for performance in artificial intelligence often focuses on model architecture, the computing power or the optimization techniques.


However, a crucial aspect remains underestimated : the quality of training data. Imagine building a house on an unstable foundation: no matter how sophisticated the architecture is, the structure will be compromised.


Likewise, an AI model trained on noisy or poorly labeled data will inevitably reproduce these defects. This reality is not only empirical; it derives directly from the fundamental principles of information theory. Understanding these principles helps to understand why investing in data quality is often more important than investing in model complexity.

Theoretical foundations

Shannon's Entropy: the measurement of information

Claude Shannon revolutionized our understanding of information by offering a quantitative measure. Shannon entropy is given by

H = -zero p (x) log₂ (p (x))

Where:

  • H is the entropy (measured in bits)
  • p (x) Is the probability of occurrence of an event x
  • (). Represents the sum over all possible events

This formula tells us something fundamental: information is linked to unpredictability. A certain event (p=1) provides no new information, while a rare event provides a lot of information.

Application to training data

In a training dataset, the total information can be broken down as follows:

H_total = H_useful + H_noise

Where:

  • H_utile represents information relevant to our task
  • H_noise represents blemishes, errors, and artifacts

This decomposition has a crucial consequence: an AI model that cannot intrinsically distinguish useful information from noise, he will learn both.

At the risk of therefore reproducing the noise at the output of the model.

The principle of maintaining information

The fundamental limit

A fundamental theorem of information theory states that a system cannot create information; it can only transform it. For an AI model, this means:

Output-quality ≤ Entry-quality

This inequality is strict: no architecture, as sophisticated as it is, cannot exceed this limit.

Practical case: image upscaling

Let's take the concrete example of photo upscaling, where we want to increase the resolution of an image:

Image upscale, une image upscale, dont la résolution est augmentée, et l'image d'origine pour comparer
(You can find a list of tools used to upscale a photo here)

The quality chain

For a high resolution (HR) image generated from a low resolution (LR) image:

psnr_output ≤ psnr_input - 10*log10 (upscaling_factor²)

Where:

  • PSNR (Peak Signal-to-Noise Ratio) measures image quality
  • upscaling_factor Is the ratio between the resolutions (ex: 2 to double)

Impact of training data

Let's consider two training scenarios:

1. High Quality Dataset

- HR images: Uncompressed 4K photos

- Average PSNR: 45dB

- Possible result: ~35dB after upscaling x2



2. Mediocre Dataset

- HR images: JPEG compressed photos

- Average PSNR: 30dB

- Maximum result: ~20dB after upscaling x2

La 15dB difference in the final result is directly linked to the quality of training data.

The PSNR in dB is a logarithmic measure that compares the maximum possible signal with the noise (the error).
The higher the dB, the better the quality:

The PSNR (Peak Signal-to-Noise Ratio) is defined as:

PSNR = 10 * log10 (MAX²/MSE)

Where:

  • MAX is the maximum possible pixel value (255 for 8 bits)
  • MSE Is the mean squared error

For upscaling, when you increase the resolution by a factor n, MSE tends to increase, which effectively decreases the PSNR.
The quality of the result is therefore very sensitive to the noise level.

Order of magnitude of PSNR in dB for images

  • A high quality JPEG image: ~40-45 dB
  • Average JPEG compression: ~30-35 dB
  • A very compressed image: ~20-25dB

The dB being a logarithmic scale:

  • +3 dB = 2x better quality
  • +10 dB = 10x better quality
  • +20 dB = 100x better quality

So when we say “~35dB after upscaling x2", it means that:

  1. The resulting image has good quality
  2. The differences with the “perfect” image are hard to see
  3. This is typical of a good upscaling algorithm

The waterfall effect: the danger of AI-generated data

When AI-generated images are used to train other models, degradation follows a geometric progression:

Generation_quality_n = Original_quality * (1 - tau)

Where:

  • Tan Is the degradation rate per generation
  • N Is the number of generations

This formula explains why use generated images by AI to train other models leads to rapid degradation of quality.

The importance of labelling

The quality of the labels is as crucial as that of the data itself. For a supervised model:

Maximum_precision = min (Data_Quality, Precision_labels)

This simple formula shows that even with perfect data, inaccurate labels strictly limit possible performances.

Practical recommendations

1. Preparing the dataset

Above, our simplistic demonstration illustrates the crucial importance of the quality of the data used for training. We invite you to consult this article to learn more about how to prepare a quality dataset for your artificial intelligence models.

We cannot elaborate in this article but the informed reader will notice that the definition of “noise” raises philosophical questions. How do you define noise?

2. Reflection: the subjective nature of noise

The very definition of “noise” in data raises profound philosophical questions. What is considered noise for one application can be critical information for another.

Let's take a photo as an example:

  • For a facial recognition model, lighting variations are “noise”
  • For a lighting analysis model, these same variations are the main information

This subjectivity of noise reminds us that the “quality” of the data is intrinsically linked to our objective. Like Schrödinger's cat, noise exists in a superposition: it is both information and disturbance, until we define our context of observation.

This duality highlights the importance of a clear and contextual definition of “quality” in our AI projects, calling into question the idea of absolute data quality.

3. Quality metrics

For each type of data, define minimum thresholds, for example:

Images

PSNR > 40dB, SSIM >0.95

Labels

Accuracy > 98%

Coherence

Cross tests > 95%


The 40dB threshold is not arbitrary. In practice:

  • 40dB: Practically imperceptible differences
  • 35-40dB: Very good quality, differences visible only to experts
  • 30-35dB: Acceptable quality for general use
  • <30dB: Visible degradation

SSIM (Structural Similarity Index)

The SSIM is complementary to the PSNR:

Threshols_ssim = {“Excellent”: “>0.95", “>0.95", “Good”: “0.90-0.95", “Acceptable”: “0.85-0.90", “Problem”: “>0.95", “>0.95"}

The SSIM is closer to human perception because it considers the structure of the image.

Consistency tests

Cross-tests > 95% involve:

  1. Cross validation K-Fold
  2. Internal consistency tests
  3. Verification of outliers
  4. Distribution analysis

Conclusion

Information theory provides us with a rigorous framework that shows that data quality is not an option but a strict mathematical limit. An AI model, no matter how sophisticated, cannot exceed the quality of its training data.

This understanding should guide our investments: rather than looking only for more complex architectures, we must prioritize ensure the quality of our training data !

sourcing

Shannon entropy: 🔗 https://fr.wikipedia.org/wiki/Entropie_de_Shannon
Illustration: 🔗
https://replicate.com/philz1337x/clarity-upscaler

Academic and technical sources

  1. Shannon, C.E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal.
  2. Wang, Z. et al. (2004). “Image Quality Assessment: From Error Visibility to Structural Similarity.” IEEE Transactions on Image Processing
  3. Goodfellow, I., Bengio, Y., & Courville, A. (2016). “Deep Learning.” MIT Press.
  4. Zhang, K. et al. (2020). “Deep Learning for Image Super-Resolution: A Survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
  5. Dodge, S., & Karam, L. (2016). “Understanding how image quality affects deep neural networks.” International Conference on Quality of Multimedia Experience (QOMex).