ConLL-2003

The ConLL-2003 dataset is a reference in automatic language processing for the Named Entity Recognition (NER) task. It was introduced as part of the ConLL-2003 Shared Task conference and contains texts annotated with entities such as people, organizations, places, and various names.

Download dataset

Size

Several hundreds of thousands of annotated tokens, in BIO format (CoNLL)

Licence

Academic use under specific license. Verification required for commercial uses

Description

‍
The CoNLL-2003 dataset includes:

Journalistic texts taken from Reuters RCV1
Several hundreds of thousands of manually annotated tokens
A standardized BIO (Begin, Inside, Outside) format for NER
Named entities classified into 4 categories: PER (people), LOC (places), ORG (organizations), MISC (others)

‍

What is this dataset for?

‍
ConLL-2003 is used for:

Training named entity recognition (NER) models
Comparative evaluation of new NLP architectures
Automatic extraction of information from structured or unstructured documents
The improvement of search engines, monitoring systems or voice assistants

‍

Can it be enriched or improved?

‍
Yes, this corpus can be enriched in various ways:

Adaptation to other languages or specific fields (legal, medical, etc.)
Extending the annotation schema with new feature classes
Add relationships between entities for linking or coreference resolution tasks
Integration into comprehensive NLP pipelines including classification, parsing, or summarizing

‍

🔗 Source: ConLL-2003 Dataset

‍

Frequently Asked Questions

Why is CoNLL-2003 used so much for NER?

Because it offers a standardized, reproducible and well-annotated benchmark, which makes it a reference for comparing the performance of models on a fundamental NLP task.

Does the dataset cover multiple languages?

Yes, it includes data in English and German. For other languages, variants like WikiANN or MASAKHANE can be used.

‍

Can CoNLL-2003 be adapted to business use cases?

Yes, by adjusting entity classes or combining this dataset with internal corpora, it can be used to train specialized NER models in specific business contexts.

Similar datasets

Gutenberg Dataset

MNIST

ImageNet