OpenThoughts 114k

Structured reasoning corpus covering math, science, programming, and puzzles. Used to refine and test OpenThinker models.

Download dataset

Size

114,000 examples in JSON format (problems, solutions, reasoning, code), data ready to be trained

Licence

Apache 2.0

Description

‍

OpenThoughts-114k is a generative reasoning dataset with 114,000 high-quality examples. Each entry includes a problem, a reference solution, intermediate reasoning, and sometimes code. The dataset covers a variety of fields such as math, science, computer science, and puzzles, and has been used to train OpenThinker models (7B and 32B).

‍

What is this dataset for?

‍

Train Models to Make Multi-Stage Reasoning
Create benchmarks to test LLM models on STEM tasks
Improving the performance of models on complex cases via fine-tuning

‍

Can it be enriched or improved?

‍

Yes, it is possible to add annotations on the reasoning stages, to classify the problems by difficulty, or to generate question variants. The dataset can also be combined with other resources to create multi-lingual or multi-domain sets.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of Use	⭐⭐⭐⭐⭐ (Ready to use for training)
🧼Need for Cleaning	⭐⭐⭐⭐⭐ (Low — well-structured data)
🏷️Annotation Richness	⭐⭐⭐⭐⭐ (Complete — solutions, reasoning, metadata)
📜Commercial License	✅ Yes (Apache 2.0)
👨‍💻Beginner-Friendly	🧑‍🎓 Yes, with minimal technical background
🔁Reusable for Fine-Tuning	🔥 Excellent for STEM or reasoning models
🌍Cultural Diversity	🌍 Moderate — technical content, limited cultural scope