Text Anonymization Benchmark
Structured corpus of European legal decisions annotated for anonymization: named entities, identifiers, sensitive attributes.
Description
The dataset Text Anonymization Benchmark (TAB) brings together 1,268 judgments in English from the European Court of Human Rights, carefully annotated for the study and modeling of the automatic anonymization of documents. Each file contains the original text, identifiers of named entities (persons, places, etc.), their semantic category, their confidential status, and co-reference relationships. The JSON standoff format allows fine reuse in NLP pipelines.
What is this dataset for?
- Train models for the automatic anonymization of legal or sensitive texts
- Study biases related to personal and confidential information in documents
- Test named entity detection and masking (NER) systems
Can it be enriched or improved?
Yes. It is possible to add other languages or jurisdictions for better geographic coverage. Annotations can be enriched with legal typologies or additional metadata (types of decisions, duration, etc.). This corpus can also be crossed with other games to increase the diversity of cases.
🔎 In summary
🧠 Recommended for
- Digital law researchers
- NLP anonymization projects
- Legal labelling
🔧 Compatible tools
- SpacY
- Hugging Face Transformers
- Prodigy
- Doccano
💡 Tip
To detect biases, compare the annotations of multiple annotators using the annotator_id field.
Frequently Asked Questions
Is this dataset suitable for areas other than law?
Yes, although it comes from the legal field, the format and the annotations make it relevant for anonymization in other sensitive areas such as health or education.
Can a NER model be trained only with this corpus?
Yes, it contains enough annotated examples to initiate or refine a named entity recognition model.
Is the corpus multilingual?
No, it's only in English. However, it is possible to translate it or to enrich it for other languages.




