Synthetic Speech Commands

An open source audio corpus of isolated words, generated by speech synthesis, designed to train voice command detection models.

Download dataset

Size

83,700 WAV files (1s, mono, 16 kHz)

Licence

CC BY-SA 4.0

Description

‍

This dataset contains over 83,000 audio files generated by text-to-speech representing simple words (like “up”, “down”, “yes”, “go”). Each word is generated with variations in voice, pitch, speed, and background noise (e.g. street, train, sea). The files are in WAV format, lasting 1 second, in 16kHz, mono.

‍

What is this dataset for?

‍

Train keyword spotting models
Test the robustness of the models in the face of different types of noise (synthetic, environmental noise)
Create voice assistants or voice-controlled interfaces (IoT, robotics)

‍

Can it be enriched or improved?

‍

Yes. It is possible to mix this data with real records to improve the robustness of the models. Other words can also be added via the same TTS pipeline. Finally, fine classification by type of noise or synthetic speaker could enrich the annotations.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐⭐ (Very simple – well-formatted audio data)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low – uniform audio quality)
🏷️ Annotation richness	⭐⭐⭐✩✩ (Medium – only the spoken word)
📜 Commercial license	✅ Yes (CC BY-SA 4.0)
👨‍💻 Beginner friendly	🌟 Yes – perfect for getting started with audio
🔁 Fine-tuning ready	⚡ Very useful for fine-tuning a lightweight speech recognition model
🌍 Cultural diversity	⚠️ Limited – only synthetic English voices