AudioMnist
AudioMnist is an audio dataset designed for automatic speech recognition. It contains recordings of numbers (from 0 to 9) pronounced by several dozen speakers, under controlled conditions. This dataset is a reference for short word classification tasks and the study of vocal representations.
Approximately 30,000 audio files, WAV format
Open access for academic and research use, under a Creative Commons Attribution license
Description
Each recording is a WAV file containing an isolated number. The dataset is structured with:
- 30,000 audio clips of numbers (0—9)
- 60 different speakers (male and female)
- Information on the gender, age, and linguistic background of participants
- A controlled sound environment to minimize extraneous noise
- 48 kHz sampling for optimal analysis quality
The dataset is often used for supervised classification and self-supervised learning tasks in audio.
What is this dataset for?
AudioMnist is used for:
- Training audio classification models on simple controls
- The neural network benchmark for speech recognition
- The study of inter-speaker variability (age, gender, accent)
- Research on vocal embeddings, phonetics, and acoustics
- Experimentation with CNN or Transformer models on spectrograms
Can it be enriched or improved?
Yes, several possible paths:
- Add background noise or distortions to test robustness
- Extend the dataset to other languages or accents
- Supplement with visual data for audio-visual approaches
- Use data for contrasted learning or audio auto-encoding
🔗 Source: AudioMnist Dataset
Frequently Asked Questions
Can this dataset be used for commercial purposes?
No, the use is limited to academic research. For commercial use, it is recommended to contact the authors of the dataset.
Why is it called AudioMnist?
In reference to the famous MNIST dataset (handwritten figures), AudioMNIST offers a vocal version with the same logic for classifying simple numbers.
Are the speakers multilingual?
Yes, although the recordings are in English, the speakers come from a variety of linguistic backgrounds, which introduces a variety of accents.