Cosmopedia: Massive Synthetic Corpus

Cosmopedia is a huge corpus of synthetic texts generated by the Mixtral-8x7b-Instruct-V0.1 model. It includes millions of educational articles, tutorials, stories, or blog posts inspired by sources like Stanford, wikiHow, or RedPajama.

Download dataset

Size

30 million documents, 25 billion tokens, JSON/parquet format

Licence

Apache 2.0

Description

‍

Cosmopedia is one of the largest open-source synthetic data datasets. It includes over 30 million documents, generated automatically by the template Mixtral-8x7b-instruct-v0.1, based on prompts from educational sources (KhanAcademy, Stanford, wikiHow, etc.) or the web. The objective is to recreate a global textual map through diverse and structured content.

‍

What is this dataset for?

‍

Massive fine-tuning of LLMs on coherent and multi-thematic content
Pre-train build, QA, or summary models
Testing the robustness of models in the face of synthetic variations similar to natural language

‍

Can it be enriched or improved?

‍

Yes. It is possible to add a classification layer by theme, to filter certain sources or to use Cosmopedia as a basis for automated teaching systems. Partial human annotation could also improve quality in certain segments.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐✩✩ (Large volume, requires adapted pipeline)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low – content generated and structured)
🏷️ Annotation richness	⭐⭐✩✩✩ (Raw text, without annotations but very diverse)
📜 Commercial license	✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly	⚠️ Complexity linked to corpus size
🔁 Fine-tuning ready	🤖 Perfect for LLM training
🌍 Cultural diversity	🎭 High thematic and stylistic diversity