By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information

Preferences Decline Accept

VoxCeleb

Multimodal

VoxCeleb

VoxCeleb is a massive dataset of voice recordings taken from public videos, mostly interviews and media appearances. It contains the voices of several thousand speakers, mostly celebrities, and is designed for the robust identification of people from their voices, despite noise, accents, or changes in the environment.

Download dataset

Size

Over 1 million audio clips of human voices, WAV format

Licence

Free access for non-commercial use (restricted license with prior access request)

Description

‍
The dataset comes from the extraction of audio from YouTube videos, with a semi-automatic verification of the voice/face correspondence. It includes:

Over 1 million voice clips
Several thousand speakers identified (VoxCeleb1 and VoxCeleb2)
Metadata about each speaker (identity, nationality, gender...)
Recordings in real, noisy or varied environments
Balancing male/female voices, with a great diversity of linguistic origins

‍

It is used to train systems that can recognize or distinguish individuals based on their voiceprints alone.

‍

‍

What is this dataset for?

‍
VoxCeleb is used in numerous projects related to:

Automatic speaker identification (speaker identification/verification)
Improving speech recognition systems in noisy environments
Research in voice biometrics and audio security
Pre-training Wav2Vec, Whisper or ECAPA-TDNN models
The creation of voiceprints for personalized voice assistants

‍

‍

💡 Speaker recognition needs precise diarization: our audio annotation services label who speaks when, across languages and recording conditions.

‍

‍

Can it be enriched or improved?

‍
Yes, for example:

By adding data from underrepresented languages
By supplementing with extracts from non-media domains (podcasts, calls)
By standardizing audio signals for better comparative performance
By testing scenarios for spoofing or resisting voice spoofing

‍

‍

🔗 Source: VoxCeleb Dataset

‍

Frequently Asked Questions

Are voices anonymized or identifiable?

They are linked to public identities (mainly celebrities), with detailed metadata, but their use is reserved for research.

Can this dataset be used for commercial projects?

No VoxCeleb is only available for academic or non-commercial use. An access request must be submitted to the research team.

Is the dataset multilingual?

Yes, it covers a wide range of languages and accents, making it a robust basis for multilingual voice identification tasks.

Similar datasets

MixInstruct — Multi-LLM comparison on instruction responses

OpenAI MRCR — Multi-Round Co-Reference Resolution

TREC-QA Dataset

Copyright © Innovatiana SAS (SIREN 913 684 668), a French & Malagasy company, 2021-2026. All rights reserved

Terms of use Privacy Policy