Synthetic Speech Commands
An open source audio corpus of isolated words, generated by speech synthesis, designed to train voice command detection models.
Description
This dataset contains over 83,000 audio files generated by text-to-speech representing simple words (like “up”, “down”, “yes”, “go”). Each word is generated with variations in voice, pitch, speed, and background noise (e.g. street, train, sea). The files are in WAV format, lasting 1 second, in 16kHz, mono.
What is this dataset for?
- Train keyword spotting models
- Test the robustness of the models in the face of different types of noise (synthetic, environmental noise)
- Create voice assistants or voice-controlled interfaces (IoT, robotics)
Can it be enriched or improved?
Yes. It is possible to mix this data with real records to improve the robustness of the models. Other words can also be added via the same TTS pipeline. Finally, fine classification by type of noise or synthetic speaker could enrich the annotations.
🔎 In summary
🧠 Recommended for
- Beginners in audio processing
- Voice assistant creators
- TTS robustness researchers
🔧 Compatible tools
- TensorFlow
- PyTorch
- SpeechBrain
- Torchaudio
- Librosa
💡 Tip
To simulate realistic environments, combine this dataset with natural speech samples with the same words.
Frequently Asked Questions
Can this dataset replace human voice recordings?
It can complement or augment a real dataset, but remains synthetic. For optimal precision, a real/synthetic blend is preferred.
Is background noise included in the files?
Yes, each file is a combination of synthetic voice with added noise (environmental or generated) to simulate real conditions.
Can you add your own words to this dataset?
Yes, the source provided makes it possible to generate new synthetic words with different vocal and acoustic parameters.




