Flickr30k Image‑Caption Dataset

Multilingual dataset of more than 30,000 images, each with 5 captions written by annotators, for training vision and language models.

Download dataset

Size

≈ 30,000 images + CSV annotations (captions), ~4.43 GB

Licence

CC0: public domain

Description

‍

The Flickr30k dataset provides more than 30,000 images, each accompanied by 5 human legends. The images are hosted on Flickr and the annotations are available in CSV format. Ideal for training and evaluating legend generation, VQA or vision-language models.

‍

What is this dataset for?

‍

Generating image captions (image captioning)
Vision-language modeling and image-text search
Visual Question Answering (VQA) or multimodal retrieval

‍

Can it be enriched or improved?

‍

Yes, by downloading the images via their URLs, you can create local sets. It is possible to add visual annotations (objects, regions) or to retranslate legends into other languages.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐⭐ (Annotations ready, image download required)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low – well-formatted CSV; URL handling needed)
🏷️ Annotation richness	⭐⭐⭐⭐⭐ (5 captions per image, very rich)
📜 Commercial license	✅ CC0 – commercial use allowed
👨‍💻 Beginner friendly	✅ Yes – standard base for multimodality Vietnam
🔁 Fine-tuning ready	🖼️ Excellent for vision-language fine-tuning
🌍 Cultural diversity	🌐 Large diversity of everyday human scenes