By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Synthetic Speech Commands
Audio

Synthetic Speech Commands

An open source audio corpus of isolated words, generated by speech synthesis, designed to train voice command detection models.

Download dataset
Size

83,700 WAV files (1s, mono, 16 kHz)

Licence

CC BY-SA 4.0

Description

This dataset contains over 83,000 audio files generated by text-to-speech representing simple words (like “up”, “down”, “yes”, “go”). Each word is generated with variations in voice, pitch, speed, and background noise (e.g. street, train, sea). The files are in WAV format, lasting 1 second, in 16kHz, mono.

What is this dataset for?

  • Train keyword spotting models
  • Test the robustness of the models in the face of different types of noise (synthetic, environmental noise)
  • Create voice assistants or voice-controlled interfaces (IoT, robotics)

Can it be enriched or improved?

Yes. It is possible to mix this data with real records to improve the robustness of the models. Other words can also be added via the same TTS pipeline. Finally, fine classification by type of noise or synthetic speaker could enrich the annotations.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐⭐ (Very simple – well-formatted audio data)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (Low – uniform audio quality)
🏷️ Annotation richness⭐⭐⭐✩✩ (Medium – only the spoken word)
📜 Commercial license✅ Yes (CC BY-SA 4.0)
👨‍💻 Beginner friendly🌟 Yes – perfect for getting started with audio
🔁 Fine-tuning ready⚡ Very useful for fine-tuning a lightweight speech recognition model
🌍 Cultural diversity⚠️ Limited – only synthetic English voices

🧠 Recommended for

  • Beginners in audio processing
  • Voice assistant creators
  • TTS robustness researchers

🔧 Compatible tools

  • TensorFlow
  • PyTorch
  • SpeechBrain
  • Torchaudio
  • Librosa

💡 Tip

To simulate realistic environments, combine this dataset with natural speech samples with the same words.

Frequently Asked Questions

Can this dataset replace human voice recordings?

It can complement or augment a real dataset, but remains synthetic. For optimal precision, a real/synthetic blend is preferred.

Is background noise included in the files?

Yes, each file is a combination of synthetic voice with added noise (environmental or generated) to simulate real conditions.

Can you add your own words to this dataset?

Yes, the source provided makes it possible to generate new synthetic words with different vocal and acoustic parameters.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.