By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Cosmopedia: Massive Synthetic Corpus
Text

Cosmopedia: Massive Synthetic Corpus

Cosmopedia is a huge corpus of synthetic texts generated by the Mixtral-8x7b-Instruct-V0.1 model. It includes millions of educational articles, tutorials, stories, or blog posts inspired by sources like Stanford, wikiHow, or RedPajama.

Download dataset
Size

30 million documents, 25 billion tokens, JSON/parquet format

Licence

Apache 2.0

Description

Cosmopedia is one of the largest open-source synthetic data datasets. It includes over 30 million documents, generated automatically by the template Mixtral-8x7b-instruct-v0.1, based on prompts from educational sources (KhanAcademy, Stanford, wikiHow, etc.) or the web. The objective is to recreate a global textual map through diverse and structured content.

What is this dataset for?

  • Massive fine-tuning of LLMs on coherent and multi-thematic content
  • Pre-train build, QA, or summary models
  • Testing the robustness of models in the face of synthetic variations similar to natural language

Can it be enriched or improved?

Yes. It is possible to add a classification layer by theme, to filter certain sources or to use Cosmopedia as a basis for automated teaching systems. Partial human annotation could also improve quality in certain segments.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐✩✩ (Large volume, requires adapted pipeline)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (Low – content generated and structured)
🏷️ Annotation richness⭐⭐✩✩✩ (Raw text, without annotations but very diverse)
📜 Commercial license✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly⚠️ Complexity linked to corpus size
🔁 Fine-tuning ready🤖 Perfect for LLM training
🌍 Cultural diversity🎭 High thematic and stylistic diversity

🧠 Recommended for

  • LLMs pre-workout
  • Educational projects
  • Generative AI R&D

🔧 Compatible tools

  • PyTorch
  • Hugging Face Datasets
  • DeepSpeed
  • LoRa
  • Axolotl

💡 Tip

Use Nomic's interactive map to filter themes before full ingestion into a pipeline.

Frequently Asked Questions

Is the content reliable for educational use?

This is synthetic data that has not been verified by humans, so it should be used with care for critical uses.

Can I only extract wikiHow articles?

Yes, the dataset is divided into splits according to the sources used for the prompts. You can filter accordingly.

Can an LLM be trained using this dataset only?

Yes, the volume and diversity make Cosmopedia suitable for pre-training or massive tuning of a language model.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.