Text Extraction for OCR

Multimodal dataset consisting of invoice images and XML files containing the extracted data. Each image is associated with an identical XML file that can be used to extract entities such as invoice number, date, company name, telephone, and address.

Download dataset

Size

Approximately 1560 pairs of image files (invoices) and XML, JPG/PNG and XML formats, tabular data extracted

Licence

CC0: Public Domain

Description

‍

The dataset Text Extraction for OCR Contains approximately 1560 images of old invoices with their corresponding XML files. These XML files provide information extracted from each invoice, including the invoice number, date, business names, telephone numbers, and addresses. Images often contain visual errors such as character substitutions (e.g. '0' replaced by 'O'), simulating real conditions.

‍

What is this dataset for?

‍

Develop and test entity extraction algorithms (NER) specific to invoice documents
Improving the recognition of tabular data in complex images
Building OCR models that are robust against typographical or image quality errors

‍

Can it be enriched or improved?

‍

Yes, it is possible to add additional manual annotations to increase the richness of the extracted entities, or to integrate other types of similar documents (receipts, purchase orders). Correcting errors in XML can also improve the quality of the dataset.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of Use	⭐⭐⭐☆☆ (Medium: requires joint handling of images and XML)
🧼Cleaning Required	⭐☆☆☆☆ (High: XML data errors need correction)
🏷️Annotation Richness	⭐⭐⭐☆☆ (Good: multiple entities extracted with XML structure)
📜Commercial License	✅ Free (CC0)
👨‍💻Beginner Friendly	⚠️ Medium: requires OCR and XML knowledge
🔁Reusable for Fine-Tuning	🔥 Excellent for training OCR and task-specific NER models
🌍Cultural Diversity	🌍 Medium: invoice-focused dataset, context not specified