WikiText-103 Dataset

WikiText-103 is a large text dataset, extracted from Wikipedia, designed for training and evaluating language models. It is distinguished by the linguistic quality of its texts, which maintain a natural grammatical structure, unlike other data sets containing noisy or unstructured content.

Download dataset

Size

Over 100 million words in TXT format

Licence

Free for academic use. Audit recommended for commercial projects

Description

‍
The WikiText-103 dataset includes:

28,475 Wikipedia articles
Over 100 million words in English
Complete, untruncated and low-noise texts
A raw format (TXT), suitable for training autoregressive or bidirectional models

‍

What is this dataset for?

‍
WikiText-103 is used for:

Training text generation models/LL (ex: GPT, Transformer-XL, etc.)
Evaluating models on pure language modeling tasks
Fine-tuning models for sequence completion or prediction
The study of syntactic structures and contextual coherence in NLP

‍

Can it be enriched or improved?

‍
Yes, the dataset can be enriched in several ways:

By combining it with other specialized corpora for multilingual or sectoral tasks
By further cleaning the data or by eliminating possible duplications
By structuring the corpus for finer semantic or syntactic annotation
By adding metadata or links to the entities mentioned for NER or linking tasks

‍

🔗 Source: WikiText Dataset

‍

Frequently Asked Questions

What is the difference between WikiText-2 and WikiText-103?

WikiText-103 is a much bigger and more comprehensive version than WikiText-2. It contains over 100 million words compared to around 2 million for WikiText-2, which makes it possible to train deeper and more efficient models.

Can WikiText-103 be used to train multilingual models?

No, WikiText-103 is only in English. For multilingual approaches, it is preferable to use datasets such as CC100, OSCAR or mC4.

‍

Why use WikiText-103 instead of raw snippets from Wikipedia?

WikiText-103 has been carefully selected to avoid entries that are too short, noisy, or uninformative. It maintains structural links and paragraph consistency, making it much more reliable for training quality language models.

Similar datasets

MNIST

ImageNet

Kaggle Financial Datasets