Cosmopedia: Massive Synthetic Corpus
Cosmopedia is a huge corpus of synthetic texts generated by the Mixtral-8x7b-Instruct-V0.1 model. It includes millions of educational articles, tutorials, stories, or blog posts inspired by sources like Stanford, wikiHow, or RedPajama.
Description
Cosmopedia is one of the largest open-source synthetic data datasets. It includes over 30 million documents, generated automatically by the template Mixtral-8x7b-instruct-v0.1, based on prompts from educational sources (KhanAcademy, Stanford, wikiHow, etc.) or the web. The objective is to recreate a global textual map through diverse and structured content.
What is this dataset for?
- Massive fine-tuning of LLMs on coherent and multi-thematic content
- Pre-train build, QA, or summary models
- Testing the robustness of models in the face of synthetic variations similar to natural language
Can it be enriched or improved?
Yes. It is possible to add a classification layer by theme, to filter certain sources or to use Cosmopedia as a basis for automated teaching systems. Partial human annotation could also improve quality in certain segments.
🔎 In summary
🧠 Recommended for
- LLMs pre-workout
- Educational projects
- Generative AI R&D
🔧 Compatible tools
- PyTorch
- Hugging Face Datasets
- DeepSpeed
- LoRa
- Axolotl
💡 Tip
Use Nomic's interactive map to filter themes before full ingestion into a pipeline.
Frequently Asked Questions
Is the content reliable for educational use?
This is synthetic data that has not been verified by humans, so it should be used with care for critical uses.
Can I only extract wikiHow articles?
Yes, the dataset is divided into splits according to the sources used for the prompts. You can filter accordingly.
Can an LLM be trained using this dataset only?
Yes, the volume and diversity make Cosmopedia suitable for pre-training or massive tuning of a language model.