Flickr Image Dataset

A multi-modal data set containing Flickr images with captions, annotated entities, and boundary areas for visual and linguistic learning.

Download dataset

Size

31,800 images, 158,000 captions, 276,000 bounding boxes — JPEG, CSV

Licence

CC0: Public Domain

Description

‍

The dataset Flickr Image Dataset is a multimodal resource based on the Flickr30k dataset. It combines 31,800 images with 158,000 text captions, enriched by more than 244,000 co-reference chains and 276,000 bounding boxes annotated manually. It is a reference game for image description generation, image/text alignment, and visual grounding tasks.

‍

What is this dataset for?

‍

Train models for generating image captions (image captioning)
Align text entities and visual regions in the same image (grounded NLP)
Test multimodal models on the joint understanding of image + language

‍

Can it be enriched or improved?

‍

Yes. It is possible to add additional annotations (relational, linguistic, or visual), to introduce other languages into the legends, or to apply automatic detection techniques to compare with manual annotations. It can also be used to pre-train newer vision-language architectures.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of Use	⭐⭐⭐☆☆ (Structured but requires image + text processing)
🧼Cleaning Required	⭐⭐☆☆☆ (Low to moderate depending on the target task)
🏷️Annotation Richness	⭐⭐⭐⭐☆ (Excellent: captions + entities + bounding boxes + coreferences)
📜Commercial License	✅ Yes (CC0)
👨‍💻Beginner Friendly	👍 Moderate – knowledge in multimodality recommended
🔁Reusable for Fine-tuning	🔥 Excellent base for CLIP, BLIP, Flamingo, etc.
🌍Cultural Diversity	🌍 Medium: mainly English, but varied content