By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Text Extraction for OCR
Multimodal

Text Extraction for OCR

Multimodal dataset consisting of invoice images and XML files containing the extracted data. Each image is associated with an identical XML file that can be used to extract entities such as invoice number, date, company name, telephone, and address.

Download dataset
Size

Approximately 1560 pairs of image files (invoices) and XML, JPG/PNG and XML formats, tabular data extracted

Licence

CC0: Public Domain

Description

The dataset Text Extraction for OCR Contains approximately 1560 images of old invoices with their corresponding XML files. These XML files provide information extracted from each invoice, including the invoice number, date, business names, telephone numbers, and addresses. Images often contain visual errors such as character substitutions (e.g. '0' replaced by 'O'), simulating real conditions.

What is this dataset for?

  • Develop and test entity extraction algorithms (NER) specific to invoice documents
  • Improving the recognition of tabular data in complex images
  • Building OCR models that are robust against typographical or image quality errors

Can it be enriched or improved?

Yes, it is possible to add additional manual annotations to increase the richness of the extracted entities, or to integrate other types of similar documents (receipts, purchase orders). Correcting errors in XML can also improve the quality of the dataset.

🔎 In summary

Criterion Evaluation
🧩Ease of Use ⭐⭐⭐☆☆ (Medium: requires joint handling of images and XML)
🧼Cleaning Required ⭐☆☆☆☆ (High: XML data errors need correction)
🏷️Annotation Richness ⭐⭐⭐☆☆ (Good: multiple entities extracted with XML structure)
📜Commercial License ✅ Free (CC0)
👨‍💻Beginner Friendly ⚠️ Medium: requires OCR and XML knowledge
🔁Reusable for Fine-Tuning 🔥 Excellent for training OCR and task-specific NER models
🌍Cultural Diversity 🌍 Medium: invoice-focused dataset, context not specified

🧠 Recommended for

  • OCR researchers
  • NER Tool Developers
  • Documentary digitization projects

🔧 Compatible tools

  • Tesseract
  • EasyOCR
  • SpacY
  • Transformers OCR

💡 Tip

Combine visual image analysis with XML data to improve the accuracy of extractions.

Frequently Asked Questions

Can this dataset be used to automatically extract invoice data?

Yes, it is designed specifically for automatically extracting key entities from invoice images, with associated XML files.

Do XML files need to be cleaned before use?

Yes, some typographical errors are present in XML, it is advisable to correct them for better results.

Does the dataset only contain images or also annotations?

It contains both invoice images and their structured XML files that serve as annotations.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.