Text Anonymization Benchmark

Structured corpus of European legal decisions annotated for anonymization: named entities, identifiers, sensitive attributes.

Download dataset

Size

1,268 documents in English in annotated JSON format

Licence

MIT

Description

‍

The dataset Text Anonymization Benchmark (TAB) brings together 1,268 judgments in English from the European Court of Human Rights, carefully annotated for the study and modeling of the automatic anonymization of documents. Each file contains the original text, identifiers of named entities (persons, places, etc.), their semantic category, their confidential status, and co-reference relationships. The JSON standoff format allows fine reuse in NLP pipelines.

‍

What is this dataset for?

‍

Train models for the automatic anonymization of legal or sensitive texts
Study biases related to personal and confidential information in documents
Test named entity detection and masking (NER) systems

‍

Can it be enriched or improved?

‍

Yes. It is possible to add other languages or jurisdictions for better geographic coverage. Annotations can be enriched with legal typologies or additional metadata (types of decisions, duration, etc.). This corpus can also be crossed with other games to increase the diversity of cases.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐✩ (Clear and documented JSON format)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low: ready-to-use data)
🏷️ Annotation richness	⭐⭐⭐⭐⭐ (Very detailed - identifiers, categories, coreferences)
📜 Commercial license	✅ Yes (MIT)
👨‍💻 Beginner friendly	⚠️ Accessible with NLP basics
🔁 Fine-tuning ready	🎯 Yes, ideal for NER, anonymization, classification
🌍 Cultural diversity	⚠️ Limited to Europe and English

‍

🧠 Recommended for

Digital law researchers
NLP anonymization projects
Legal labelling

‍

🔧 Compatible tools

SpacY
Hugging Face Transformers
Prodigy
Doccano

‍

💡 Tip

To detect biases, compare the annotations of multiple annotators using the annotator_id field.

Frequently Asked Questions

Is this dataset suitable for areas other than law?

Yes, although it comes from the legal field, the format and the annotations make it relevant for anonymization in other sensitive areas such as health or education.

Can a NER model be trained only with this corpus?

Yes, it contains enough annotated examples to initiate or refine a named entity recognition model.

Is the corpus multilingual?

No, it's only in English. However, it is possible to translate it or to enrich it for other languages.

Similar datasets

Video

Innovatiana's Cosmetics Retail Dataset (CRD)

Audio

TIMIT Dataset

Text

GoEmotions