TED-LIUM Dataset
The TED-LIUM Dataset includes audio recordings of TED conferences accompanied by their transcripts. It is a valuable resource for training automatic speech recognition (ASR) models and for the analysis of oral language in a real and structured context.
Several hundred hours of recordings, WAV (audio) and TXT (transcripts) formats
Free access for research under a permissive license (Creative Commons BY-NC-SA for TED recordings)
Description
The dataset contains:
- Recordings of hundreds of TED talks (from TED.com)
- Transcripts lined up word for word
- A great diversity of speakers, accents and themes (education, technology, society...)
- Professional audio quality (captured in the room with a lavalier microphone)
- Several successive versions (v1, v2, v3) with improved alignments and enrichment of the corpus
It is often used for transcription, machine translation, or linguistic research projects.
What is this dataset for?
TED-LIUM is used for:
- Training automatic speech transcription models (Wav2Vec, Whisper...)
- Generating multilingual subtitles for video content
- Stylistic or lexical analysis of oral language
- The study of prosodic dynamics and discursive markers
- Training multimodal models combining audio, text and video
Can it be enriched or improved?
Yes, in particular by:
- The addition of emotional, prosodic, or linguistic labels
- Combining with TED videos for audio-visual approaches
- More accurate time alignment for fine segmentation tasks
- Interbreeding with other sources of public discourse (e.g. LibriVox, Mozilla Common Voice)
🔗 Source: TED-LIUM Dataset
Frequently Asked Questions
Are the recordings multilingual?
No, the dataset is mostly in English, although TEDx side projects in other languages exist.
What is the advantage of this dataset compared to LibriSpeech?
TED-LIUM offers a more natural and varied oral language than LibriSpeech, which is based on reading. It is therefore closer to the real conditions in which speech is used.
Can it be used to detect themes or feelings?
Yes, TED talks cover diverse and emotionally charged topics, making them a good medium for thematic or emotional discourse analysis.