WikiText-103 Dataset
WikiText-103 is a large text dataset, extracted from Wikipedia, designed for training and evaluating language models. It is distinguished by the linguistic quality of its texts, which maintain a natural grammatical structure, unlike other data sets containing noisy or unstructured content.
Over 100 million words in TXT format
Free for academic use. Audit recommended for commercial projects
Description
The WikiText-103 dataset includes:
- 28,475 Wikipedia articles
- Over 100 million words in English
- Complete, untruncated and low-noise texts
- A raw format (TXT), suitable for training autoregressive or bidirectional models
What is this dataset for?
WikiText-103 is used for:
- Training text generation models/LL (ex: GPT, Transformer-XL, etc.)
- Evaluating models on pure language modeling tasks
- Fine-tuning models for sequence completion or prediction
- The study of syntactic structures and contextual coherence in NLP
Can it be enriched or improved?
Yes, the dataset can be enriched in several ways:
- By combining it with other specialized corpora for multilingual or sectoral tasks
- By further cleaning the data or by eliminating possible duplications
- By structuring the corpus for finer semantic or syntactic annotation
- By adding metadata or links to the entities mentioned for NER or linking tasks
🔗 Source: WikiText Dataset
Frequently Asked Questions
What is the difference between WikiText-2 and WikiText-103?
WikiText-103 is a much bigger and more comprehensive version than WikiText-2. It contains over 100 million words compared to around 2 million for WikiText-2, which makes it possible to train deeper and more efficient models.
Can WikiText-103 be used to train multilingual models?
No, WikiText-103 is only in English. For multilingual approaches, it is preferable to use datasets such as CC100, OSCAR or mC4.
Why use WikiText-103 instead of raw snippets from Wikipedia?
WikiText-103 has been carefully selected to avoid entries that are too short, noisy, or uninformative. It maintains structural links and paragraph consistency, making it much more reliable for training quality language models.