ConLL-2003
The ConLL-2003 dataset is a reference in automatic language processing for the Named Entity Recognition (NER) task. It was introduced as part of the ConLL-2003 Shared Task conference and contains texts annotated with entities such as people, organizations, places, and various names.
Several hundreds of thousands of annotated tokens, in BIO format (CoNLL)
Academic use under specific license. Verification required for commercial uses
Description
The CoNLL-2003 dataset includes:
- Journalistic texts taken from Reuters RCV1
- Several hundreds of thousands of manually annotated tokens
- A standardized BIO (Begin, Inside, Outside) format for NER
- Named entities classified into 4 categories: PER (people), LOC (places), ORG (organizations), MISC (others)
What is this dataset for?
ConLL-2003 is used for:
- Training named entity recognition (NER) models
- Comparative evaluation of new NLP architectures
- Automatic extraction of information from structured or unstructured documents
- The improvement of search engines, monitoring systems or voice assistants
Can it be enriched or improved?
Yes, this corpus can be enriched in various ways:
- Adaptation to other languages or specific fields (legal, medical, etc.)
- Extending the annotation schema with new feature classes
- Add relationships between entities for linking or coreference resolution tasks
- Integration into comprehensive NLP pipelines including classification, parsing, or summarizing
🔗 Source: ConLL-2003 Dataset
Frequently Asked Questions
Why is CoNLL-2003 used so much for NER?
Because it offers a standardized, reproducible and well-annotated benchmark, which makes it a reference for comparing the performance of models on a fundamental NLP task.
Does the dataset cover multiple languages?
Yes, it includes data in English and German. For other languages, variants like WikiANN or MASAKHANE can be used.
Can CoNLL-2003 be adapted to business use cases?
Yes, by adjusting entity classes or combining this dataset with internal corpora, it can be used to train specialized NER models in specific business contexts.