VisualWebInstruct

VisualWebInstruct is a large, multi-modal question-and-answer (QA) dataset with approximately 40% visual data associated with over 163,000 images. It covers several scientific fields and focuses on complex, multi-step reasoning.

Download dataset

Size

1.9 million examples in Parquet format, 1.55 GB

Licence

Apache 2.0

Description

‍

VisualWebInstruct is a large-scale multimodal instruction corpus, combining more than 1.9 million question-answer pairs, a high proportion of which have associated images. Fields covered include math, physics, finance, chemistry, and more. The dataset is designed to improve the reasoning ability of vision-language models via complex multi-step tasks.

‍

What is this dataset for?

‍

Train multimodal models capable of reasoning on complex questions combining text and images
Improving understanding and response in a variety of scientific fields
Test the robustness of models on visual and textual QA tasks

‍

Can it be enriched or improved?

‍

The dataset can be enriched by adding new domains, extending visual or textual annotations, and increasing the number of images and questions. Incorporating human feedback to validate responses can also improve quality.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐✩ (Large but well-organized dataset, Parquet format)
🧼 Need for cleaning	⭐⭐⭐⭐✩ (Moderate – requires filtering depending on use case)
🏷️ Annotation richness	⭐⭐⭐⭐⭐ (Very rich – multimodal QA, many scientific domains)
📜 Commercial license	✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly	⚠️ Moderate – good for advanced multimodal users
🔁 Fine-tuning ready	✅ Perfect for vision-language model fine-tuning
🌍 Cultural diversity	🌐 Wide diversity of domains and image sources