By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
ConLL-2003
Text

ConLL-2003

The ConLL-2003 dataset is a reference in automatic language processing for the Named Entity Recognition (NER) task. It was introduced as part of the ConLL-2003 Shared Task conference and contains texts annotated with entities such as people, organizations, places, and various names.

Download dataset
Size

Several hundreds of thousands of annotated tokens, in BIO format (CoNLL)

Licence

Academic use under specific license. Verification required for commercial uses

Description


The CoNLL-2003 dataset includes:

  • Journalistic texts taken from Reuters RCV1
  • Several hundreds of thousands of manually annotated tokens
  • A standardized BIO (Begin, Inside, Outside) format for NER
  • Named entities classified into 4 categories: PER (people), LOC (places), ORG (organizations), MISC (others)

What is this dataset for?


ConLL-2003 is used for:

  • Training named entity recognition (NER) models
  • Comparative evaluation of new NLP architectures
  • Automatic extraction of information from structured or unstructured documents
  • The improvement of search engines, monitoring systems or voice assistants

Can it be enriched or improved?


Yes, this corpus can be enriched in various ways:

  • Adaptation to other languages or specific fields (legal, medical, etc.)
  • Extending the annotation schema with new feature classes
  • Add relationships between entities for linking or coreference resolution tasks
  • Integration into comprehensive NLP pipelines including classification, parsing, or summarizing

🔗 Source: ConLL-2003 Dataset

Frequently Asked Questions

Why is CoNLL-2003 used so much for NER?

Because it offers a standardized, reproducible and well-annotated benchmark, which makes it a reference for comparing the performance of models on a fundamental NLP task.

Does the dataset cover multiple languages?

Yes, it includes data in English and German. For other languages, variants like WikiANN or MASAKHANE can be used.

Can CoNLL-2003 be adapted to business use cases?

Yes, by adjusting entity classes or combining this dataset with internal corpora, it can be used to train specialized NER models in specific business contexts.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.