En cliquant sur "Accepter ", vous acceptez que des cookies soient stockés sur votre appareil afin d'améliorer la navigation sur le site, d'analyser son utilisation et de contribuer à nos efforts de marketing. Consultez notre politique de confidentialité pour plus d'informations.
Open Datasets
TED-LIUM Dataset
Audio

TED-LIUM Dataset

The TED-LIUM Dataset includes audio recordings of TED conferences accompanied by their transcripts. It is a valuable resource for training automatic speech recognition (ASR) models and for the analysis of oral language in a real and structured context.

Download dataset
Size

Several hundred hours of recordings, WAV (audio) and TXT (transcripts) formats

Licence

Free access for research under a permissive license (Creative Commons BY-NC-SA for TED recordings)

Description


The dataset contains:

  • Recordings of hundreds of TED talks (from TED.com)
  • Transcripts lined up word for word
  • A great diversity of speakers, accents and themes (education, technology, society...)
  • Professional audio quality (captured in the room with a lavalier microphone)
  • Several successive versions (v1, v2, v3) with improved alignments and enrichment of the corpus

It is often used for transcription, machine translation, or linguistic research projects.

What is this dataset for?


TED-LIUM is used for:

  • Training automatic speech transcription models (Wav2Vec, Whisper...)
  • Generating multilingual subtitles for video content
  • Stylistic or lexical analysis of oral language
  • The study of prosodic dynamics and discursive markers
  • Training multimodal models combining audio, text and video

Can it be enriched or improved?


Yes, in particular by:

  • The addition of emotional, prosodic, or linguistic labels
  • Combining with TED videos for audio-visual approaches
  • More accurate time alignment for fine segmentation tasks
  • Interbreeding with other sources of public discourse (e.g. LibriVox, Mozilla Common Voice)

🔗 Source: TED-LIUM Dataset

Frequently Asked Questions

Are the recordings multilingual?

No, the dataset is mostly in English, although TEDx side projects in other languages exist.

What is the advantage of this dataset compared to LibriSpeech?

TED-LIUM offers a more natural and varied oral language than LibriSpeech, which is based on reading. It is therefore closer to the real conditions in which speech is used.

Can it be used to detect themes or feelings?

Yes, TED talks cover diverse and emotionally charged topics, making them a good medium for thematic or emotional discourse analysis.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.