HowTo100m

HowTo100m is a vast multimodal dataset extracted from YouTube tutorial videos. It combines visual (video), auditory (audio/voice), and textual data (automatic subtitles) to allow the training of video-text alignment models, instruction comprehension, and multimodal research. It is a key resource for pre-training large-scale vision-language models.

Download dataset

Size

Approximately 136 million audio/text pairs, from 1.2 million YouTube videos (approximately 20,000 hours of content)

Licence

Free access for academic research, under MIT license. Some videos are subject to YouTube's terms of use

Description

‍
The dataset contains:

1.2 million “How-To” videos from YouTube
Video segments automatically aligned with subtitles transcribed by YouTube
A wide variety of fields: cooking, DIY, beauty, sport, etc.
Audio data (voice, ambiance), video (extracted images), text (raw transcripts)
Extractions in the form of synchronized triples (key image, text, timestamp)

‍

Although subtitles are generated automatically, their massive volume allows for robust learning with low supervision.

‍

What is this dataset for?

‍
HowTo100m is designed for:

Training multimodal models (text + video + audio)
Pre-training for tasks like video research, automatic caption, or understanding instructions
The construction of shared representations between vision and language (e.g. VideoClip, Florence, Flamingo)
The improvement of video-guided assistants (e.g. for robots, voice tutorials)
Zero-shot search on video-text data

‍

Can it be enriched or improved?

‍
Yes, for example:

Improve text-video alignment with more accurate transcription models (e.g. Whisper)
Manually annotate segments for high supervision benchmarks
Add semantic tags or action categories per frame
Use for fine-tuning multimodal generative models (video-to-text or text-to-video)

‍

🔗 Source: HowTo100m Dataset GitHub

‍

Frequently Asked Questions

Are the subtitles reliable?

They are generated automatically, so sometimes noisy. However, their massive volume makes it possible to compensate for the imprecision at the global level.

Can this dataset be used to train generative models?

Yes, it is ideal for training or fine-tuning next-generation video-to-text or multi-modal models.

What architectures have been pre-trained with HowTo100m?

Models such as VideoClip, Frozen, MIL-NCE, or XCLIP have used this corpus for large-scale vision-language pre-training.

Similar datasets

Image

CelebA

Text

Twitter Sentiment Analysis Dataset

Image

Innovatiana's Luxury Fashion Fraud Detection (LFFD)