LibriSpeech

LibriSpeech is a reference audio dataset in the field of automatic speech recognition (ASR). It is composed of recordings of public domain books read aloud by English speakers, accompanied by their accurate text transcriptions.

Download dataset

Size

Approximately 1000 hours of audio in FLAC format, with associated transcripts in TXT

Licence

Free for academic and commercial use, under a Creative Commons license

Description

‍
The LibriSpeech dataset includes:

Approximately 1000 hours of audio in English in FLAC format
Word for word transcripts in TXT format
Subsets organized according to the quality of alignment and the complexity of the recordings (clean, other)
An original database from the LibriVox project, with texts from the public domain

‍

What is this dataset for?

‍
LibriSpeech is widely used for:

Training speech recognition models (ASR)
Fine-tuning or evaluation of pre-trained models like Whisper, Wav2Vec, or DeepSpeech
Research on speech comprehension, audio segmentation, or audio-text alignment
Improving speech synthesis and interaction technologies

‍

Can it be enriched or improved?

‍
Yes, although already very structured, LibriSpeech can be adapted to:

Add prosodic or phonetic annotations
Combine with multilingual corpora for code-switching recognition
Create noisy or accented variants to test the robustness of the models
Integrate audio-text into multimodal alignment pipelines

‍

🔗 Source: LibriSpeech Dataset

‍

Frequently Asked Questions

What is the difference between the “clean” and “other” subsets?

“Clean” recordings have better audio quality and clearer diction, while “other” files are more complex (sharp accents, background noise, faster playback, etc.).

Can LibriSpeech be used for languages other than English?

No, LibriSpeech is exclusively in English. For other languages, there are equivalents like Common Voice, Multilingual LibriSpeech, or VoxPopuli.

Is LibriSpeech adapted to speech synthesis?

Yes, even if it is not its main use. The well-segmented recordings and aligned transcripts make it useful for training or evaluating text-to-speech (TTS) systems.

Similar datasets

Video

Deep Fake Detection DFD Entire Original Dataset

Text

ConLL-2003

Text

New York Vehicle Collisions (2014—2023)