GigaSpeech

GigaSpeech is a vast multi-domain English corpus of up to 10,000 hours of high-quality audio from audiobooks, podcasts, and YouTube videos. It includes different speech styles, from read speech to spontaneous speech, on a variety of topics. The dataset is designed for automatic speech recognition (ASR) and speech synthesis (TTS).

Download dataset

Size

Up to 10,000 hours of transcribed audio, WAV/opus files, various audio segments

Licence

Apache 2.0

Description

‍

The dataset GigaSpeech contains a vast array of audio transcribed in English, collected from a variety of sources such as audiobooks, podcasts, and YouTube videos. It offers several configurations ranging from 10 hours (XS) to 10,000 hours (XL) to adapt to research and industrial needs. The audio segments are accompanied by accurate text transcripts, making it possible to train robust speech recognition and synthesis models.

‍

What is this dataset for?

‍

Train automatic speech recognition (ASR) models in English on large amounts of data.
Form speech synthesis systems (TTS) from varied and quality audio.
Test and evaluate models in various thematic areas and speech styles.

‍

Can it be enriched or improved?

‍

Yes, the dataset can be supplemented with additional annotations, finer segmentations, or integrations of new audio sources. It is also possible to adapt transcripts for specific use cases or to add metadata to enrich user experiences.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of Use	⭐⭐⭐☆☆ (Requires handling large volumes and varied formats)
🧼Cleaning Required	⭐⭐⭐☆☆ (Moderate – quality control recommended depending on audio sources)
🏷️Annotation Richness	⭐⭐⭐☆☆ (Accurate text transcriptions, few additional annotations)
📜Commercial License	✅ Free and commercial (Apache 2.0)
👨‍💻Ideal for Beginners	⚠️ Recommended for users with audio experience
🔁Reusable for Fine-Tuning	🔥 Excellent for ASR and TTS fine-tuning
🌍Cultural Diversity	🌐 English only, multi-domain