By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Gutenberg Dataset
Text

Gutenberg Dataset

The Gutenberg dataset is a collection of literary texts from the public domain, made available by Project Gutenberg. It is a valuable resource for NLP applications that focus on written language, literature, or text generation models.

Download dataset
Size

Several tens of thousands of books, TXT and EPUB format

Licence

Public domain, under Project Gutenberg terms and conditions. Verification required for commercial redistributions

Description


The Gutenberg dataset includes:

  • Several tens of thousands of books (novels, essays, theater, poetry...)
  • Open formats: TXT, EPUB, HTML
  • Texts in English, but also available in other languages
  • A simple structure, compatible with traditional NLP pipelines

What is this dataset for?


This corpus is widely used to:

  • Training text generation or completion models
  • Linguistic or stylistic analysis on corpora of various authors
  • The development of automatic summary or literary classification models
  • The study of the evolution of written language over time

Can it be enriched or improved?


Yes, although rich, the Gutenberg dataset can be:

  • Cleaned and segmented into chapters, paragraphs, or dialogue units
  • Annotated with metadata: author, genre, genre, date, style, historical period
  • Combined with other corpora for multilingual or comparative approaches
  • Used to create benchmarks on long text generation or literary paraphrasing

🔗 Source: Gutenberg Dataset

Frequently Asked Questions

Does the dataset only contain texts in English?

No, although mostly in English, Project Gutenberg also offers books in several languages, including French, Spanish, German or Italian.

Is it suitable for training large models?

Yes, due to its size and quality, it is a good complement to other corpora for LLM models focused on literary language or long narration.

How do I filter or structure the texts in the dataset?

It is possible to use the metadata provided (title, author, language) or cleaning scripts to extract only literary content and ignore notes, prefaces, or legal notices.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.