AudioSet
AudioSet is a vast audio corpus compiled by Google, containing millions of sound clips from YouTube videos. Each clip, lasting 10 seconds, is annotated with one or more tags from a structured vocabulary of more than 600 categories of sounds.
Over 2 million annotated audio clips, WAV (via extraction) and JSON (annotations) formats
Free access for research purposes, with annotations provided by Google under a Creative Commons license (original audio remains hosted on YouTube)
Description
AudioSet covers a wide variety of sounds from the real world:
- Human sounds: speech, laughter, cough, screams, applause,...
- Animal sounds: barking, birdsong, henning,...
- Mechanical sounds: engines, alarms, sirens, tools, vehicles,...
- Environments: rain, wind, crowd, forest, classroom,...
- Music: instruments, songs, various musical genres
The annotations are prioritized and are the result of a semi-automated process validated manually on a subset.
What is this dataset for?
AudioSet is used for:
- Training models for the classification and detection of environmental sounds
- The development of real-time sound recognition systems
- Annotating complex audio scenes for robotics or embedded devices
- The study of acoustic contexts in audio or multimodal AI projects
- The analysis of sound events for the creation of audio banks or generative synthesis
Can it be enriched or improved?
Yes, for example:
- By combining AudioSet with extracts that are locally stored or captured in real time
- By refining categories for specific industrial or medical contexts
- By applying segmentation or source separation techniques
- Using audio embeddings as input into multimodal models
🔗 Source: AudioSet Dataset
Frequently Asked Questions
Are the audio files directly downloadable?
No Only annotations and video links are provided. Audio samples must be extracted via YouTube links, in accordance with the terms of use.
Can AudioSet be used commercially?
Annotations are free, but the original audio is subject to YouTube copyright, so a license check is required for commercial use.
Is the dataset multilingual?
Indirectly, yes. The voice sounds come from multilingual videos, but the annotations are in English.