HowTo100m
HowTo100m is a vast multimodal dataset extracted from YouTube tutorial videos. It combines visual (video), auditory (audio/voice), and textual data (automatic subtitles) to allow the training of video-text alignment models, instruction comprehension, and multimodal research. It is a key resource for pre-training large-scale vision-language models.
Approximately 136 million audio/text pairs, from 1.2 million YouTube videos (approximately 20,000 hours of content)
Free access for academic research, under MIT license. Some videos are subject to YouTube's terms of use
Description
The dataset contains:
- 1.2 million “How-To” videos from YouTube
- Video segments automatically aligned with subtitles transcribed by YouTube
- A wide variety of fields: cooking, DIY, beauty, sport, etc.
- Audio data (voice, ambiance), video (extracted images), text (raw transcripts)
- Extractions in the form of synchronized triples (key image, text, timestamp)
Although subtitles are generated automatically, their massive volume allows for robust learning with low supervision.
What is this dataset for?
HowTo100m is designed for:
- Training multimodal models (text + video + audio)
- Pre-training for tasks like video research, automatic caption, or understanding instructions
- The construction of shared representations between vision and language (e.g. VideoClip, Florence, Flamingo)
- The improvement of video-guided assistants (e.g. for robots, voice tutorials)
- Zero-shot search on video-text data
Can it be enriched or improved?
Yes, for example:
- Improve text-video alignment with more accurate transcription models (e.g. Whisper)
- Manually annotate segments for high supervision benchmarks
- Add semantic tags or action categories per frame
- Use for fine-tuning multimodal generative models (video-to-text or text-to-video)
🔗 Source: HowTo100m Dataset GitHub
Frequently Asked Questions
Are the subtitles reliable?
They are generated automatically, so sometimes noisy. However, their massive volume makes it possible to compensate for the imprecision at the global level.
Can this dataset be used to train generative models?
Yes, it is ideal for training or fine-tuning next-generation video-to-text or multi-modal models.
What architectures have been pre-trained with HowTo100m?
Models such as VideoClip, Frozen, MIL-NCE, or XCLIP have used this corpus for large-scale vision-language pre-training.