LibriSpeech
LibriSpeech is a reference audio dataset in the field of automatic speech recognition (ASR). It is composed of recordings of public domain books read aloud by English speakers, accompanied by their accurate text transcriptions.
Approximately 1000 hours of audio in FLAC format, with associated transcripts in TXT
Free for academic and commercial use, under a Creative Commons license
Description
The LibriSpeech dataset includes:
- Approximately 1000 hours of audio in English in FLAC format
- Word for word transcripts in TXT format
- Subsets organized according to the quality of alignment and the complexity of the recordings (clean, other)
- An original database from the LibriVox project, with texts from the public domain
What is this dataset for?
LibriSpeech is widely used for:
- Training speech recognition models (ASR)
- Fine-tuning or evaluation of pre-trained models like Whisper, Wav2Vec, or DeepSpeech
- Research on speech comprehension, audio segmentation, or audio-text alignment
- Improving speech synthesis and interaction technologies
Can it be enriched or improved?
Yes, although already very structured, LibriSpeech can be adapted to:
- Add prosodic or phonetic annotations
- Combine with multilingual corpora for code-switching recognition
- Create noisy or accented variants to test the robustness of the models
- Integrate audio-text into multimodal alignment pipelines
🔗 Source: LibriSpeech Dataset
Frequently Asked Questions
What is the difference between the “clean” and “other” subsets?
“Clean” recordings have better audio quality and clearer diction, while “other” files are more complex (sharp accents, background noise, faster playback, etc.).
Can LibriSpeech be used for languages other than English?
No, LibriSpeech is exclusively in English. For other languages, there are equivalents like Common Voice, Multilingual LibriSpeech, or VoxPopuli.
Is LibriSpeech adapted to speech synthesis?
Yes, even if it is not its main use. The well-segmented recordings and aligned transcripts make it useful for training or evaluating text-to-speech (TTS) systems.