Flickr Image Dataset
A multi-modal data set containing Flickr images with captions, annotated entities, and boundary areas for visual and linguistic learning.
31,800 images, 158,000 captions, 276,000 bounding boxes — JPEG, CSV
CC0: Public Domain
Description
The dataset Flickr Image Dataset is a multimodal resource based on the Flickr30k dataset. It combines 31,800 images with 158,000 text captions, enriched by more than 244,000 co-reference chains and 276,000 bounding boxes annotated manually. It is a reference game for image description generation, image/text alignment, and visual grounding tasks.
What is this dataset for?
- Train models for generating image captions (image captioning)
- Align text entities and visual regions in the same image (grounded NLP)
- Test multimodal models on the joint understanding of image + language
Can it be enriched or improved?
Yes. It is possible to add additional annotations (relational, linguistic, or visual), to introduce other languages into the legends, or to apply automatic detection techniques to compare with manual annotations. It can also be used to pre-train newer vision-language architectures.
🔎 In summary
🧠 Recommended for
- Multimodality researchers
- Visual assistant developers
- Vision/language students
🔧 Compatible tools
- Hugging Face Transformers
- CLIP
- BLIP
- Detectron2
- SpacY
- OpenCV
💡 Tip
For training, group entities of the same type and apply cross-embeddings between bounding boxes and text segments.
Frequently Asked Questions
Can images and annotations be used for a commercial project?
Yes, the dataset is licensed under CC0, which allows unrestricted commercial use.
Is it suitable for training CLIP or BLIP models?
Absolutely, the dataset is rich in image-text pairs and annotations, making it ideal for these multimodal architectures.
Are the legends in multiple languages?
No, all descriptions are in English. However, it is possible to generate automatic translations to broaden language coverage.