By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
HowTo100m
Multimodal

HowTo100m

HowTo100m is a vast multimodal dataset extracted from YouTube tutorial videos. It combines visual (video), auditory (audio/voice), and textual data (automatic subtitles) to allow the training of video-text alignment models, instruction comprehension, and multimodal research. It is a key resource for pre-training large-scale vision-language models.

Download dataset
Size

Approximately 136 million audio/text pairs, from 1.2 million YouTube videos (approximately 20,000 hours of content)

Licence

Free access for academic research, under MIT license. Some videos are subject to YouTube's terms of use

Description


The dataset contains:

  • 1.2 million “How-To” videos from YouTube
  • Video segments automatically aligned with subtitles transcribed by YouTube
  • A wide variety of fields: cooking, DIY, beauty, sport, etc.
  • Audio data (voice, ambiance), video (extracted images), text (raw transcripts)
  • Extractions in the form of synchronized triples (key image, text, timestamp)

Although subtitles are generated automatically, their massive volume allows for robust learning with low supervision.

What is this dataset for?


HowTo100m is designed for:

  • Training multimodal models (text + video + audio)
  • Pre-training for tasks like video research, automatic caption, or understanding instructions
  • The construction of shared representations between vision and language (e.g. VideoClip, Florence, Flamingo)
  • The improvement of video-guided assistants (e.g. for robots, voice tutorials)
  • Zero-shot search on video-text data

Can it be enriched or improved?


Yes, for example:

  • Improve text-video alignment with more accurate transcription models (e.g. Whisper)
  • Manually annotate segments for high supervision benchmarks
  • Add semantic tags or action categories per frame
  • Use for fine-tuning multimodal generative models (video-to-text or text-to-video)

🔗 Source: HowTo100m Dataset GitHub

Frequently Asked Questions

Are the subtitles reliable?

They are generated automatically, so sometimes noisy. However, their massive volume makes it possible to compensate for the imprecision at the global level.

Can this dataset be used to train generative models?

Yes, it is ideal for training or fine-tuning next-generation video-to-text or multi-modal models.

What architectures have been pre-trained with HowTo100m?

Models such as VideoClip, Frozen, MIL-NCE, or XCLIP have used this corpus for large-scale vision-language pre-training.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.