Text Extraction for OCR
Multimodal dataset consisting of invoice images and XML files containing the extracted data. Each image is associated with an identical XML file that can be used to extract entities such as invoice number, date, company name, telephone, and address.
Approximately 1560 pairs of image files (invoices) and XML, JPG/PNG and XML formats, tabular data extracted
CC0: Public Domain
Description
The dataset Text Extraction for OCR Contains approximately 1560 images of old invoices with their corresponding XML files. These XML files provide information extracted from each invoice, including the invoice number, date, business names, telephone numbers, and addresses. Images often contain visual errors such as character substitutions (e.g. '0' replaced by 'O'), simulating real conditions.
What is this dataset for?
- Develop and test entity extraction algorithms (NER) specific to invoice documents
- Improving the recognition of tabular data in complex images
- Building OCR models that are robust against typographical or image quality errors
Can it be enriched or improved?
Yes, it is possible to add additional manual annotations to increase the richness of the extracted entities, or to integrate other types of similar documents (receipts, purchase orders). Correcting errors in XML can also improve the quality of the dataset.
🔎 In summary
🧠 Recommended for
- OCR researchers
- NER Tool Developers
- Documentary digitization projects
🔧 Compatible tools
- Tesseract
- EasyOCR
- SpacY
- Transformers OCR
💡 Tip
Combine visual image analysis with XML data to improve the accuracy of extractions.
Frequently Asked Questions
Can this dataset be used to automatically extract invoice data?
Yes, it is designed specifically for automatically extracting key entities from invoice images, with associated XML files.
Do XML files need to be cleaned before use?
Yes, some typographical errors are present in XML, it is advisable to correct them for better results.
Does the dataset only contain images or also annotations?
It contains both invoice images and their structured XML files that serve as annotations.