By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Flickr30k Image‑Caption Dataset
Multimodal

Flickr30k Image‑Caption Dataset

Multilingual dataset of more than 30,000 images, each with 5 captions written by annotators, for training vision and language models.

Download dataset
Size

≈ 30,000 images + CSV annotations (captions), ~4.43 GB

Licence

CC0: public domain

Description

The Flickr30k dataset provides more than 30,000 images, each accompanied by 5 human legends. The images are hosted on Flickr and the annotations are available in CSV format. Ideal for training and evaluating legend generation, VQA or vision-language models.

What is this dataset for?

  • Generating image captions (image captioning)
  • Vision-language modeling and image-text search
  • Visual Question Answering (VQA) or multimodal retrieval

Can it be enriched or improved?

Yes, by downloading the images via their URLs, you can create local sets. It is possible to add visual annotations (objects, regions) or to retranslate legends into other languages.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐⭐ (Annotations ready, image download required)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (Low – well-formatted CSV; URL handling needed)
🏷️ Annotation richness⭐⭐⭐⭐⭐ (5 captions per image, very rich)
📜 Commercial license✅ CC0 – commercial use allowed
👨‍💻 Beginner friendly✅ Yes – standard base for multimodality Vietnam
🔁 Fine-tuning ready🖼️ Excellent for vision-language fine-tuning
🌍 Cultural diversity🌐 Large diversity of everyday human scenes

🧠 Recommended for

  • Multimodal AI researchers
  • VQA students
  • Vision-language engineers

🔧 Compatible tools

  • Hugging Face Datasets
  • PyTorch
  • TensorFlow
  • Deep Lake
  • CLIP
  • BLIP
  • ViLT

💡 Tip

Download images in batch and keep a local snapshot to prevent them from disappearing by Flickr.

Frequently Asked Questions

Are the images included in the dataset?

Yes — they are provided in the “flickr30k-images” version (~4.43GB) on Kaggle.

Can I use this dataset commercially without attribution?

Yes, the CC0 license allows commercial use without attribution requirements.

Is it possible to re-encode legends in other languages?

Yes, the captions.csv fields can be translated to create multilingual versions that improve the performance of the model.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.