By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
WikiText-103 Dataset
Text

WikiText-103 Dataset

WikiText-103 is a large text dataset, extracted from Wikipedia, designed for training and evaluating language models. It is distinguished by the linguistic quality of its texts, which maintain a natural grammatical structure, unlike other data sets containing noisy or unstructured content.

Download dataset
Size

Over 100 million words in TXT format

Licence

Free for academic use. Audit recommended for commercial projects

Description


The WikiText-103 dataset includes:

  • 28,475 Wikipedia articles
  • Over 100 million words in English
  • Complete, untruncated and low-noise texts
  • A raw format (TXT), suitable for training autoregressive or bidirectional models

What is this dataset for?


WikiText-103 is used for:

  • Training text generation models/LL (ex: GPT, Transformer-XL, etc.)
  • Evaluating models on pure language modeling tasks
  • Fine-tuning models for sequence completion or prediction
  • The study of syntactic structures and contextual coherence in NLP

Can it be enriched or improved?


Yes, the dataset can be enriched in several ways:

  • By combining it with other specialized corpora for multilingual or sectoral tasks
  • By further cleaning the data or by eliminating possible duplications
  • By structuring the corpus for finer semantic or syntactic annotation
  • By adding metadata or links to the entities mentioned for NER or linking tasks

🔗 Source: WikiText Dataset

Frequently Asked Questions

What is the difference between WikiText-2 and WikiText-103?

WikiText-103 is a much bigger and more comprehensive version than WikiText-2. It contains over 100 million words compared to around 2 million for WikiText-2, which makes it possible to train deeper and more efficient models.

Can WikiText-103 be used to train multilingual models?

No, WikiText-103 is only in English. For multilingual approaches, it is preferable to use datasets such as CC100, OSCAR or mC4.

Why use WikiText-103 instead of raw snippets from Wikipedia?

WikiText-103 has been carefully selected to avoid entries that are too short, noisy, or uninformative. It maintains structural links and paragraph consistency, making it much more reliable for training quality language models.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.