By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Flickr Image Dataset
Multimodal

Flickr Image Dataset

A multi-modal data set containing Flickr images with captions, annotated entities, and boundary areas for visual and linguistic learning.

Download dataset
Size

31,800 images, 158,000 captions, 276,000 bounding boxes — JPEG, CSV

Licence

CC0: Public Domain

Description

The dataset Flickr Image Dataset is a multimodal resource based on the Flickr30k dataset. It combines 31,800 images with 158,000 text captions, enriched by more than 244,000 co-reference chains and 276,000 bounding boxes annotated manually. It is a reference game for image description generation, image/text alignment, and visual grounding tasks.

What is this dataset for?

  • Train models for generating image captions (image captioning)
  • Align text entities and visual regions in the same image (grounded NLP)
  • Test multimodal models on the joint understanding of image + language

Can it be enriched or improved?

Yes. It is possible to add additional annotations (relational, linguistic, or visual), to introduce other languages into the legends, or to apply automatic detection techniques to compare with manual annotations. It can also be used to pre-train newer vision-language architectures.

🔎 In summary

Criterion Evaluation
🧩Ease of Use ⭐⭐⭐☆☆ (Structured but requires image + text processing)
🧼Cleaning Required ⭐⭐☆☆☆ (Low to moderate depending on the target task)
🏷️Annotation Richness ⭐⭐⭐⭐☆ (Excellent: captions + entities + bounding boxes + coreferences)
📜Commercial License ✅ Yes (CC0)
👨‍💻Beginner Friendly 👍 Moderate – knowledge in multimodality recommended
🔁Reusable for Fine-tuning 🔥 Excellent base for CLIP, BLIP, Flamingo, etc.
🌍Cultural Diversity 🌍 Medium: mainly English, but varied content

🧠 Recommended for

  • Multimodality researchers
  • Visual assistant developers
  • Vision/language students

🔧 Compatible tools

  • Hugging Face Transformers
  • CLIP
  • BLIP
  • Detectron2
  • SpacY
  • OpenCV

💡 Tip

For training, group entities of the same type and apply cross-embeddings between bounding boxes and text segments.

Frequently Asked Questions

Can images and annotations be used for a commercial project?

Yes, the dataset is licensed under CC0, which allows unrestricted commercial use.

Is it suitable for training CLIP or BLIP models?

Absolutely, the dataset is rich in image-text pairs and annotations, making it ideal for these multimodal architectures.

Are the legends in multiple languages?

No, all descriptions are in English. However, it is possible to generate automatic translations to broaden language coverage.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.