Gutenberg Dataset

The Gutenberg dataset is a collection of literary texts from the public domain, made available by Project Gutenberg. It is a valuable resource for NLP applications that focus on written language, literature, or text generation models.

Download dataset

Size

Several tens of thousands of books, TXT and EPUB format

Licence

Public domain, under Project Gutenberg terms and conditions. Verification required for commercial redistributions

Description

‍
The Gutenberg dataset includes:

Several tens of thousands of books (novels, essays, theater, poetry...)
Open formats: TXT, EPUB, HTML
Texts in English, but also available in other languages
A simple structure, compatible with traditional NLP pipelines

‍

What is this dataset for?

‍
This corpus is widely used to:

Training text generation or completion models
Linguistic or stylistic analysis on corpora of various authors
The development of automatic summary or literary classification models
The study of the evolution of written language over time

‍

Can it be enriched or improved?

‍
Yes, although rich, the Gutenberg dataset can be:

Cleaned and segmented into chapters, paragraphs, or dialogue units
Annotated with metadata: author, genre, genre, date, style, historical period
Combined with other corpora for multilingual or comparative approaches
Used to create benchmarks on long text generation or literary paraphrasing

‍

🔗 Source: Gutenberg Dataset

‍

Frequently Asked Questions

Does the dataset only contain texts in English?

No, although mostly in English, Project Gutenberg also offers books in several languages, including French, Spanish, German or Italian.

Is it suitable for training large models?

Yes, due to its size and quality, it is a good complement to other corpora for LLM models focused on literary language or long narration.

How do I filter or structure the texts in the dataset?

It is possible to use the metadata provided (title, author, language) or cleaning scripts to extract only literary content and ignore notes, prefaces, or legal notices.

Similar datasets

NIH Chest X-rays

Brain MRI

ConLL-2003