OpenThoughts 114k
Structured reasoning corpus covering math, science, programming, and puzzles. Used to refine and test OpenThinker models.
114,000 examples in JSON format (problems, solutions, reasoning, code), data ready to be trained
Apache 2.0
Description
OpenThoughts-114k is a generative reasoning dataset with 114,000 high-quality examples. Each entry includes a problem, a reference solution, intermediate reasoning, and sometimes code. The dataset covers a variety of fields such as math, science, computer science, and puzzles, and has been used to train OpenThinker models (7B and 32B).
What is this dataset for?
- Train Models to Make Multi-Stage Reasoning
- Create benchmarks to test LLM models on STEM tasks
- Improving the performance of models on complex cases via fine-tuning
Can it be enriched or improved?
Yes, it is possible to add annotations on the reasoning stages, to classify the problems by difficulty, or to generate question variants. The dataset can also be combined with other resources to create multi-lingual or multi-domain sets.
🔎 In summary
🧠 Recommended for
- AI engineers
- NLP researchers
- Creators of Reasoning Models
🔧 Compatible tools
- Transformers
- Evalchemy
- Jupyter
- LoRa
- Curator Viewer
💡 Tip
Use the “metadata” subset for research tasks on reasoning or explainability strategies.
Frequently Asked Questions
Does this dataset include step-by-step reasoning?
Yes, each example contains model-generated intermediate reasoning, facilitating detailed analysis of the simulated cognitive processes.
Can this dataset be used for code generation models?
Yes, part of the dataset contains code with test cases and starter code, ideal for fine-tuning on coding tasks.
Is it possible to isolate examples by field (math, science, etc.)?
Yes, each example is annotated with a “domain” field allowing precise thematic filtering according to the type of problem.