TextOCR: Extraction of text on natural images
A corpus of natural images annotated with text for training optical character recognition (OCR) models and visual comprehension.
25,000 images, around 1 million word annotations, PNG and JSON formats
CC0: Public Domain
Description
TextOCR is an open-source dataset designed for extracting text from images of natural scenes. It contains over 25,000 images from TextVQA, enriched with nearly 1 million word annotations. Words are located by polygons, which allows precise training for optical text recognition (OCR) on straight or curved texts, under various conditions.
What is this dataset for?
- Train OCR models capable of recognizing text in complex contexts (curved, partially visible, etc.)
- Improving VQA (Visual Question Answering) models or multimodal captioning
- Test the robustness of models to different types of fonts and backgrounds
Can it be enriched or improved?
Yes. It is possible to add additional languages, to combine with synthetic data, or to extend the database to tasks such as the classification of text in the image. The annotation can also be enriched with semantic meta-information (location, type of panel, etc.).
🔎 In summary
🧠 Recommended for
- Advanced OCR projects
- VQA
- Understanding street images
🔧 Compatible tools
- PaddleOCR
- Tesseract
- Detectron2
- MMdetection
- EasyOCR
💡 Tip
For better performance, use a pipeline that combines text detection and fine OCR recognition based on the polygons provided.
Frequently Asked Questions
Is the text always well centered in the images?
No, text is present in various contexts, sometimes partial or angled, making it a good challenge for OCR models.
Does the dataset only contain English?
Mostly yes. However, some words or signs may be multilingual depending on the context of the images.
Can it be used to train a captioning model?
Yes, combined with visual annotations, it is possible to use this dataset to generate image captions that contain text.




