By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Ego4D Video — Embodied planning dataset
Multimodal

Ego4D Video — Embodied planning dataset

Dataset derived from Ego4D containing first-person videos associated with natural language instructions generated automatically and then verified manually. It is designed for embodied planning and multi-modal reasoning tasks.

Download dataset
Size

Hundreds of hours of egocentric videos + text instructions, video formats + JSON

Licence

Apache 2.0

Description

Ego4D Video is a multimodal dataset combining self-centered videos with detailed step-by-step instructions. It is based on the famous Ego4D dataset, by selecting relevant sequences enriched with language descriptions generated automatically and then verified by humans. This dataset is ideal for training embodied planning, navigation, or comprehension models in real context.

What is this dataset for?

  • Train vision-language models to follow instructions in complex environments
  • Testing multimodal reasoning skills through embodied planning
  • Develop autonomous agents capable of interacting with the real world by following instructions

Can it be enriched or improved?

Yes, it is possible to add new videos, expand the types of tasks represented, or include additional annotations (objects, actions, locations). The structure also allows the addition of multilingual translations or user feedback to refine the instructions.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐✩✩ (Requires synchronized video + text processing)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (Low – instructions already filtered and validated)
🏷️ Annotation richness⭐⭐⭐⭐⭐ (Very rich – structured instructions, real-world views)
📜 Commercial license✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly⚠️ Medium – better with video + NLP experience
🔁 Fine-tuning ready🎬 Excellent for action grounding and instruction-following models
🌍 Cultural diversity⚠️ Diversity based on real scenes from Ego4D

🧠 Recommended for

  • Robotics researchers
  • AI planning
  • Incarnate VLMs

🔧 Compatible tools

  • PyTorch
  • OpenCV
  • Hugging Face Datasets
  • CLIP
  • VideoMAE

💡 Tip

Use video-instruction correspondence to train a step-by-step planning model with fine supervision.

Frequently Asked Questions

What is the difference between the original Ego4D and this dataset?

This dataset selects specific segments of Ego4D and enriches them with detailed and validated language instructions.

Can this dataset be used for autonomous navigation?

Yes, it is particularly suited to embodied navigation and instruction tracking tasks in a real context.

Do you need advanced skills to use it?

A good command of video processing and multimodal models is recommended to use it effectively.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.