AyavisionBench

AyavisionBench is a benchmark designed to test vision-language models in 23 languages, covering 9 task categories, ranging from graph comprehension to OCR and transcription.

Download dataset

Size

3,105 JPG image-question pairs, 23 languages, total size ~1.34 GB

Licence

Apache 2.0

Description

‍

AyavisionBench is a multilingual dataset designed to assess the capabilities of models combining vision and natural language. It contains images in JPG format associated with questions that require visual context to be answered, in 23 major languages covering approximately half of the world's population. Tasks include describing images, understanding graphics, optical character recognition, and more.

‍

What is this dataset for?

‍

Assess the multimodal and multilingual understanding of AI models
Test robustness on various visual tasks like OCR, transcription, visual reasoning
Train models capable of generalizing to multiple languages and scripts

‍

Can it be enriched or improved?

‍

Yes, it is possible to add more languages, to diversify the types of images, or to enrich the questions with human annotations to increase the quality of the answers and the diversity of the cases.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐✩ (Clear dataset, requires multilingual handling)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low – data well verified)
🏷️ Annotation richness	⭐⭐⭐⭐✩ (Good – varied questions per image)
📜 Commercial license	✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly	⚠️ Accessible for advanced multimodal projects
🔁 Fine-tuning ready	✅ Perfect for multilingual multimodal fine-tuning
🌍 Cultural diversity	🌐 Very high – 23 languages across diverse families and scripts