By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Text Anonymization Benchmark
Text

Text Anonymization Benchmark

Structured corpus of European legal decisions annotated for anonymization: named entities, identifiers, sensitive attributes.

Download dataset
Size

1,268 documents in English in annotated JSON format

Licence

MIT

Description

The dataset Text Anonymization Benchmark (TAB) brings together 1,268 judgments in English from the European Court of Human Rights, carefully annotated for the study and modeling of the automatic anonymization of documents. Each file contains the original text, identifiers of named entities (persons, places, etc.), their semantic category, their confidential status, and co-reference relationships. The JSON standoff format allows fine reuse in NLP pipelines.

What is this dataset for?

  • Train models for the automatic anonymization of legal or sensitive texts
  • Study biases related to personal and confidential information in documents
  • Test named entity detection and masking (NER) systems

Can it be enriched or improved?

Yes. It is possible to add other languages or jurisdictions for better geographic coverage. Annotations can be enriched with legal typologies or additional metadata (types of decisions, duration, etc.). This corpus can also be crossed with other games to increase the diversity of cases.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐✩ (Clear and documented JSON format)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (Low: ready-to-use data)
🏷️ Annotation richness⭐⭐⭐⭐⭐ (Very detailed - identifiers, categories, coreferences)
📜 Commercial license✅ Yes (MIT)
👨‍💻 Beginner friendly⚠️ Accessible with NLP basics
🔁 Fine-tuning ready🎯 Yes, ideal for NER, anonymization, classification
🌍 Cultural diversity⚠️ Limited to Europe and English

🧠 Recommended for

  • Digital law researchers
  • NLP anonymization projects
  • Legal labelling

🔧 Compatible tools

  • SpacY
  • Hugging Face Transformers
  • Prodigy
  • Doccano

💡 Tip

To detect biases, compare the annotations of multiple annotators using the annotator_id field.

Frequently Asked Questions

Is this dataset suitable for areas other than law?

Yes, although it comes from the legal field, the format and the annotations make it relevant for anonymization in other sensitive areas such as health or education.

Can a NER model be trained only with this corpus?

Yes, it contains enough annotated examples to initiate or refine a named entity recognition model.

Is the corpus multilingual?

No, it's only in English. However, it is possible to translate it or to enrich it for other languages.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.