Ego4D Video — Embodied planning dataset

Dataset derived from Ego4D containing first-person videos associated with natural language instructions generated automatically and then verified manually. It is designed for embodied planning and multi-modal reasoning tasks.

Download dataset

Size

Hundreds of hours of egocentric videos + text instructions, video formats + JSON

Licence

Apache 2.0

Description

‍

Ego4D Video is a multimodal dataset combining self-centered videos with detailed step-by-step instructions. It is based on the famous Ego4D dataset, by selecting relevant sequences enriched with language descriptions generated automatically and then verified by humans. This dataset is ideal for training embodied planning, navigation, or comprehension models in real context.

‍

What is this dataset for?

‍

Train vision-language models to follow instructions in complex environments
Testing multimodal reasoning skills through embodied planning
Develop autonomous agents capable of interacting with the real world by following instructions

‍

Can it be enriched or improved?

‍

Yes, it is possible to add new videos, expand the types of tasks represented, or include additional annotations (objects, actions, locations). The structure also allows the addition of multilingual translations or user feedback to refine the instructions.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐✩✩ (Requires synchronized video + text processing)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low – instructions already filtered and validated)
🏷️ Annotation richness	⭐⭐⭐⭐⭐ (Very rich – structured instructions, real-world views)
📜 Commercial license	✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly	⚠️ Medium – better with video + NLP experience
🔁 Fine-tuning ready	🎬 Excellent for action grounding and instruction-following models
🌍 Cultural diversity	⚠️ Diversity based on real scenes from Ego4D