AudioSet

AudioSet is a vast audio corpus compiled by Google, containing millions of sound clips from YouTube videos. Each clip, lasting 10 seconds, is annotated with one or more tags from a structured vocabulary of more than 600 categories of sounds.

Download dataset

Size

Over 2 million annotated audio clips, WAV (via extraction) and JSON (annotations) formats

Licence

Free access for research purposes, with annotations provided by Google under a Creative Commons license (original audio remains hosted on YouTube)

Description

‍
AudioSet covers a wide variety of sounds from the real world:

Human sounds: speech, laughter, cough, screams, applause,...
Animal sounds: barking, birdsong, henning,...
Mechanical sounds: engines, alarms, sirens, tools, vehicles,...
Environments: rain, wind, crowd, forest, classroom,...
Music: instruments, songs, various musical genres

‍

The annotations are prioritized and are the result of a semi-automated process validated manually on a subset.

‍

What is this dataset for?

‍
AudioSet is used for:

Training models for the classification and detection of environmental sounds
The development of real-time sound recognition systems
Annotating complex audio scenes for robotics or embedded devices
The study of acoustic contexts in audio or multimodal AI projects
The analysis of sound events for the creation of audio banks or generative synthesis

‍

Can it be enriched or improved?

‍
Yes, for example:

By combining AudioSet with extracts that are locally stored or captured in real time
By refining categories for specific industrial or medical contexts
By applying segmentation or source separation techniques
Using audio embeddings as input into multimodal models

‍

🔗 Source: AudioSet Dataset

‍