TextOCR: Extraction of text on natural images

A corpus of natural images annotated with text for training optical character recognition (OCR) models and visual comprehension.

Download dataset

Size

25,000 images, around 1 million word annotations, PNG and JSON formats

Licence

CC0: Public Domain

Description

‍

TextOCR is an open-source dataset designed for extracting text from images of natural scenes. It contains over 25,000 images from TextVQA, enriched with nearly 1 million word annotations. Words are located by polygons, which allows precise training for optical text recognition (OCR) on straight or curved texts, under various conditions.

‍

What is this dataset for?

‍

Train OCR models capable of recognizing text in complex contexts (curved, partially visible, etc.)
Improving VQA (Visual Question Answering) models or multimodal captioning
Test the robustness of models to different types of fonts and backgrounds

‍

Can it be enriched or improved?

‍

Yes. It is possible to add additional languages, to combine with synthetic data, or to extend the database to tasks such as the classification of text in the image. The annotation can also be enriched with semantic meta-information (location, type of panel, etc.).

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐⭐ (High – well-structured JSON annotations)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low – ready-to-use for training)
🏷️ Annotation richness	⭐⭐⭐⭐⭐ (Excellent – fine localization down to individual words)
📜 Commercial license	✅ Yes (CC0)
👨‍💻 Beginner friendly	🌟 Yes – perfect for starting OCR projects
🔁 Fine-tuning ready	🎯 Ideal for fine-tuning OCR or multimodal models
🌍 Cultural diversity	⚠️ Medium – mostly English