By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
TextOCR: Extraction of text on natural images
Image

TextOCR: Extraction of text on natural images

A corpus of natural images annotated with text for training optical character recognition (OCR) models and visual comprehension.

Download dataset
Size

25,000 images, around 1 million word annotations, PNG and JSON formats

Licence

CC0: Public Domain

Description

TextOCR is an open-source dataset designed for extracting text from images of natural scenes. It contains over 25,000 images from TextVQA, enriched with nearly 1 million word annotations. Words are located by polygons, which allows precise training for optical text recognition (OCR) on straight or curved texts, under various conditions.

What is this dataset for?

  • Train OCR models capable of recognizing text in complex contexts (curved, partially visible, etc.)
  • Improving VQA (Visual Question Answering) models or multimodal captioning
  • Test the robustness of models to different types of fonts and backgrounds

Can it be enriched or improved?

Yes. It is possible to add additional languages, to combine with synthetic data, or to extend the database to tasks such as the classification of text in the image. The annotation can also be enriched with semantic meta-information (location, type of panel, etc.).

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐⭐ (High – well-structured JSON annotations)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (Low – ready-to-use for training)
🏷️ Annotation richness⭐⭐⭐⭐⭐ (Excellent – fine localization down to individual words)
📜 Commercial license✅ Yes (CC0)
👨‍💻 Beginner friendly🌟 Yes – perfect for starting OCR projects
🔁 Fine-tuning ready🎯 Ideal for fine-tuning OCR or multimodal models
🌍 Cultural diversity⚠️ Medium – mostly English

🧠 Recommended for

  • Advanced OCR projects
  • VQA
  • Understanding street images

🔧 Compatible tools

  • PaddleOCR
  • Tesseract
  • Detectron2
  • MMdetection
  • EasyOCR

💡 Tip

For better performance, use a pipeline that combines text detection and fine OCR recognition based on the polygons provided.

Frequently Asked Questions

Is the text always well centered in the images?

No, text is present in various contexts, sometimes partial or angled, making it a good challenge for OCR models.

Does the dataset only contain English?

Mostly yes. However, some words or signs may be multilingual depending on the context of the images.

Can it be used to train a captioning model?

Yes, combined with visual annotations, it is possible to use this dataset to generate image captions that contain text.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.