Gutenberg Dataset
The Gutenberg dataset is a collection of literary texts from the public domain, made available by Project Gutenberg. It is a valuable resource for NLP applications that focus on written language, literature, or text generation models.
Several tens of thousands of books, TXT and EPUB format
Public domain, under Project Gutenberg terms and conditions. Verification required for commercial redistributions
Description
The Gutenberg dataset includes:
- Several tens of thousands of books (novels, essays, theater, poetry...)
- Open formats: TXT, EPUB, HTML
- Texts in English, but also available in other languages
- A simple structure, compatible with traditional NLP pipelines
What is this dataset for?
This corpus is widely used to:
- Training text generation or completion models
- Linguistic or stylistic analysis on corpora of various authors
- The development of automatic summary or literary classification models
- The study of the evolution of written language over time
Can it be enriched or improved?
Yes, although rich, the Gutenberg dataset can be:
- Cleaned and segmented into chapters, paragraphs, or dialogue units
- Annotated with metadata: author, genre, genre, date, style, historical period
- Combined with other corpora for multilingual or comparative approaches
- Used to create benchmarks on long text generation or literary paraphrasing
🔗 Source: Gutenberg Dataset
Frequently Asked Questions
Does the dataset only contain texts in English?
No, although mostly in English, Project Gutenberg also offers books in several languages, including French, Spanish, German or Italian.
Is it suitable for training large models?
Yes, due to its size and quality, it is a good complement to other corpora for LLM models focused on literary language or long narration.
How do I filter or structure the texts in the dataset?
It is possible to use the metadata provided (title, author, language) or cleaning scripts to extract only literary content and ignore notes, prefaces, or legal notices.