VoxCeleb
VoxCeleb is a massive dataset of voice recordings taken from public videos, mostly interviews and media appearances. It contains the voices of several thousand speakers, mostly celebrities, and is designed for the robust identification of people from their voices, despite noise, accents, or changes in the environment.
Over 1 million audio clips of human voices, WAV format
Free access for non-commercial use (restricted license with prior access request)
Description
The dataset comes from the extraction of audio from YouTube videos, with a semi-automatic verification of the voice/face correspondence. It includes:
- Over 1 million voice clips
- Several thousand speakers identified (VoxCeleb1 and VoxCeleb2)
- Metadata about each speaker (identity, nationality, gender...)
- Recordings in real, noisy or varied environments
- Balancing male/female voices, with a great diversity of linguistic origins
It is used to train systems that can recognize or distinguish individuals based on their voiceprints alone.
What is this dataset for?
VoxCeleb is used in numerous projects related to:
- Automatic speaker identification (speaker identification/verification)
- Improving speech recognition systems in noisy environments
- Research in voice biometrics and audio security
- Pre-training Wav2Vec, Whisper or ECAPA-TDNN models
- The creation of voiceprints for personalized voice assistants
Can it be enriched or improved?
Yes, for example:
- By adding data from underrepresented languages
- By supplementing with extracts from non-media domains (podcasts, calls)
- By standardizing audio signals for better comparative performance
- By testing scenarios for spoofing or resisting voice spoofing
🔗 Source: VoxCeleb Dataset
Frequently Asked Questions
Are voices anonymized or identifiable?
They are linked to public identities (mainly celebrities), with detailed metadata, but their use is reserved for research.
Can this dataset be used for commercial projects?
No VoxCeleb is only available for academic or non-commercial use. An access request must be submitted to the research team.
Is the dataset multilingual?
Yes, it covers a wide range of languages and accents, making it a robust basis for multilingual voice identification tasks.