Flickr30k Image‑Caption Dataset
Multilingual dataset of more than 30,000 images, each with 5 captions written by annotators, for training vision and language models.
Description
The Flickr30k dataset provides more than 30,000 images, each accompanied by 5 human legends. The images are hosted on Flickr and the annotations are available in CSV format. Ideal for training and evaluating legend generation, VQA or vision-language models.
What is this dataset for?
- Generating image captions (image captioning)
- Vision-language modeling and image-text search
- Visual Question Answering (VQA) or multimodal retrieval
Can it be enriched or improved?
Yes, by downloading the images via their URLs, you can create local sets. It is possible to add visual annotations (objects, regions) or to retranslate legends into other languages.
🔎 In summary
🧠 Recommended for
- Multimodal AI researchers
- VQA students
- Vision-language engineers
🔧 Compatible tools
- Hugging Face Datasets
- PyTorch
- TensorFlow
- Deep Lake
- CLIP
- BLIP
- ViLT
💡 Tip
Download images in batch and keep a local snapshot to prevent them from disappearing by Flickr.
Frequently Asked Questions
Are the images included in the dataset?
Yes — they are provided in the “flickr30k-images” version (~4.43GB) on Kaggle.
Can I use this dataset commercially without attribution?
Yes, the CC0 license allows commercial use without attribution requirements.
Is it possible to re-encode legends in other languages?
Yes, the captions.csv fields can be translated to create multilingual versions that improve the performance of the model.