Cambrian Alignment Dataset

Cambrian-Alignment dataset containing question-answer alignment data from multiple sources including LLava, Mini-Gemini, Allava, and ShareGPT4V. Used to improve the consistency of responses in multimodal models combining vision and language. The dataset is large and comes in the form of archives to be extracted and merged before use.

Download dataset

Size

Over 50 GB, 291 750 rows, files archived in tar

Licence

Apache 2.0

Description

‍

The dataset Cambrian-Alignment groups together question-answer pairs used for the alignment of multimodal models combining text and images. It brings together data from several projects such as LLava, Mini-Gemini, Allava, and ShareGPT4V. The dataset is primarily used to refine and assess the ability of models to produce consistent and relevant responses in a multimodal context.

‍

What is this dataset for?

‍

Train and align multimodal models (vision + language) to improve contextual understanding
Evaluate the quality of LLM responses on multimodal interaction tasks
Creating robust benchmarks for advanced multimodal systems

‍

Can it be enriched or improved?

‍

This dataset can be completed with other alignment data from various sources or adapted to specific domains. The detailed annotation of the answers can also improve the quality of the training. Additional multimodal dialogue data can be integrated to strengthen diversity and coverage.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐✩✩✩ (Complex – requires managing large archives)
🧼 Need for cleaning	⭐⭐⭐✩✩ (Moderate – merging and extracting tar files needed)
🏷️ Annotation richness	⭐⭐⭐⭐✩ (Good – multi-source Q&A)
📜 Commercial license	✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly	⚠️ No – volume and format require solid technical experience
🔁 Fine-tuning ready	🤖 Yes – excellent for advanced multimodal training
🌍 Cultural diversity	🌐 Varied – multi-source and diverse contexts