Data quality in Artificial Intelligence: an information theory approach


The expression “Garbage In, Garbage Out“ is often cited in Artificial Intelligence (AI), but few understand its theoretical foundations.
The race for performance in artificial intelligence often focuses on model architecture, the computing power or the optimization techniques.
However, a crucial aspect remains underestimated : the quality of training data. Imagine building a house on an unstable foundation: no matter how sophisticated the architecture is, the structure will be compromised.
Likewise, an AI model trained on noisy or poorly labeled data will inevitably reproduce these defects. This reality is not only empirical; it derives directly from the fundamental principles of information theory. Understanding these principles helps to understand why investing in data quality is often more important than investing in model complexity.
Theoretical foundations
Shannon's Entropy: the measurement of information
Claude Shannon revolutionized our understanding of information by offering a quantitative measure. Shannon entropy is given by
H = -zero p (x) log₂ (p (x))
Where:
- H is the entropy (measured in bits)
- p (x) Is the probability of occurrence of an event x
- (). Represents the sum over all possible events
This formula tells us something fundamental: information is linked to unpredictability. A certain event (p=1) provides no new information, while a rare event provides a lot of information.
Application to training data
In a training dataset, the total information can be broken down as follows:
H_total = H_useful + H_noise
Where:
- H_utile represents information relevant to our task
- H_noise represents blemishes, errors, and artifacts
This decomposition has a crucial consequence: an AI model that cannot intrinsically distinguish useful information from noise, he will learn both.
At the risk of therefore reproducing the noise at the output of the model.
The principle of maintaining information
The fundamental limit
A fundamental theorem of information theory states that a system cannot create information; it can only transform it. For an AI model, this means:
Output-quality ≤ Entry-quality
This inequality is strict: no architecture, as sophisticated as it is, cannot exceed this limit.
Practical case: image upscaling
Let's take the concrete example of photo upscaling, where we want to increase the resolution of an image:

The quality chain
For a high resolution (HR) image generated from a low resolution (LR) image:
psnr_output ≤ psnr_input - 10*log10 (upscaling_factor²)
Where:
- PSNR (Peak Signal-to-Noise Ratio) measures image quality
- upscaling_factor Is the ratio between the resolutions (ex: 2 to double)
Impact of training data
Let's consider two training scenarios:
1. High Quality Dataset
- HR images: Uncompressed 4K photos
- Average PSNR: 45dB
- Possible result: ~35dB after upscaling x2
2. Mediocre Dataset
- HR images: JPEG compressed photos
- Average PSNR: 30dB
- Maximum result: ~20dB after upscaling x2
La 15dB difference in the final result is directly linked to the quality of training data.
The PSNR in dB is a logarithmic measure that compares the maximum possible signal with the noise (the error).
The higher the dB, the better the quality:
The PSNR (Peak Signal-to-Noise Ratio) is defined as:
PSNR = 10 * log10 (MAX²/MSE)
Where:
- MAX is the maximum possible pixel value (255 for 8 bits)
- MSE Is the mean squared error
For upscaling, when you increase the resolution by a factor n, MSE tends to increase, which effectively decreases the PSNR.
The quality of the result is therefore very sensitive to the noise level.
Order of magnitude of PSNR in dB for images
- A high quality JPEG image: ~40-45 dB
- Average JPEG compression: ~30-35 dB
- A very compressed image: ~20-25dB
The dB being a logarithmic scale:
- +3 dB = 2x better quality
- +10 dB = 10x better quality
- +20 dB = 100x better quality
So when we say “~35dB after upscaling x2", it means that:
- The resulting image has good quality
- The differences with the “perfect” image are hard to see
- This is typical of a good upscaling algorithm
The waterfall effect: the danger of AI-generated data
When AI-generated images are used to train other models, degradation follows a geometric progression:
Generation_quality_n = Original_quality * (1 - tau)
Where:
- Tan Is the degradation rate per generation
- N Is the number of generations
This formula explains why use generated images by AI to train other models leads to rapid degradation of quality.
The importance of labelling
The quality of the labels is as crucial as that of the data itself. For a supervised model:
Maximum_precision = min (Data_Quality, Precision_labels)
This simple formula shows that even with perfect data, inaccurate labels strictly limit possible performances.
Practical recommendations
1. Preparing the dataset
Above, our simplistic demonstration illustrates the crucial importance of the quality of the data used for training. We invite you to consult this article to learn more about how to prepare a quality dataset for your artificial intelligence models.
We cannot elaborate in this article but the informed reader will notice that the definition of “noise” raises philosophical questions. How do you define noise?
2. Reflection: the subjective nature of noise
The very definition of “noise” in data raises profound philosophical questions. What is considered noise for one application can be critical information for another.
Let's take a photo as an example:
- For a facial recognition model, lighting variations are “noise”
- For a lighting analysis model, these same variations are the main information
This subjectivity of noise reminds us that the “quality” of the data is intrinsically linked to our objective. Like Schrödinger's cat, noise exists in a superposition: it is both information and disturbance, until we define our context of observation.
This duality highlights the importance of a clear and contextual definition of “quality” in our AI projects, calling into question the idea of absolute data quality.
3. Quality metrics
For each type of data, define minimum thresholds, for example:
Images
PSNR > 40dB, SSIM >0.95
Labels
Accuracy > 98%
Coherence
Cross tests > 95%
The 40dB threshold is not arbitrary. In practice:
- 40dB: Practically imperceptible differences
- 35-40dB: Very good quality, differences visible only to experts
- 30-35dB: Acceptable quality for general use
- <30dB: Visible degradation
SSIM (Structural Similarity Index)
The SSIM is complementary to the PSNR:
Threshols_ssim = {“Excellent”: “>0.95", “>0.95", “Good”: “0.90-0.95", “Acceptable”: “0.85-0.90", “Problem”: “>0.95", “>0.95"}
The SSIM is closer to human perception because it considers the structure of the image.
Consistency tests
Cross-tests > 95% involve:
- Cross validation K-Fold
- Internal consistency tests
- Verification of outliers
- Distribution analysis
Conclusion
Information theory provides us with a rigorous framework that shows that data quality is not an option but a strict mathematical limit. An AI model, no matter how sophisticated, cannot exceed the quality of its training data.
This understanding should guide our investments: rather than looking only for more complex architectures, we must prioritize ensure the quality of our training data !
sourcing
Shannon entropy: 🔗 https://fr.wikipedia.org/wiki/Entropie_de_Shannon
Illustration: 🔗 https://replicate.com/philz1337x/clarity-upscaler
Academic and technical sources
- Shannon, C.E. (1948). “A Mathematical Theory of Communication.” Bell System Technical Journal.
- Wang, Z. et al. (2004). “Image Quality Assessment: From Error Visibility to Structural Similarity.” IEEE Transactions on Image Processing
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). “Deep Learning.” MIT Press.
- Zhang, K. et al. (2020). “Deep Learning for Image Super-Resolution: A Survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Dodge, S., & Karam, L. (2016). “Understanding how image quality affects deep neural networks.” International Conference on Quality of Multimedia Experience (QOMex).