By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
OpenThoughts 114k
Text

OpenThoughts 114k

Structured reasoning corpus covering math, science, programming, and puzzles. Used to refine and test OpenThinker models.

Download dataset
Size

114,000 examples in JSON format (problems, solutions, reasoning, code), data ready to be trained

Licence

Apache 2.0

Description

OpenThoughts-114k is a generative reasoning dataset with 114,000 high-quality examples. Each entry includes a problem, a reference solution, intermediate reasoning, and sometimes code. The dataset covers a variety of fields such as math, science, computer science, and puzzles, and has been used to train OpenThinker models (7B and 32B).

What is this dataset for?

  • Train Models to Make Multi-Stage Reasoning
  • Create benchmarks to test LLM models on STEM tasks
  • Improving the performance of models on complex cases via fine-tuning

Can it be enriched or improved?

Yes, it is possible to add annotations on the reasoning stages, to classify the problems by difficulty, or to generate question variants. The dataset can also be combined with other resources to create multi-lingual or multi-domain sets.

🔎 In summary

Criterion Evaluation
🧩Ease of Use ⭐⭐⭐⭐⭐ (Ready to use for training)
🧼Need for Cleaning ⭐⭐⭐⭐⭐ (Low — well-structured data)
🏷️Annotation Richness ⭐⭐⭐⭐⭐ (Complete — solutions, reasoning, metadata)
📜Commercial License ✅ Yes (Apache 2.0)
👨‍💻Beginner-Friendly 🧑‍🎓 Yes, with minimal technical background
🔁Reusable for Fine-Tuning 🔥 Excellent for STEM or reasoning models
🌍Cultural Diversity 🌍 Moderate — technical content, limited cultural scope

🧠 Recommended for

  • AI engineers
  • NLP researchers
  • Creators of Reasoning Models

🔧 Compatible tools

  • Transformers
  • Evalchemy
  • Jupyter
  • LoRa
  • Curator Viewer

💡 Tip

Use the “metadata” subset for research tasks on reasoning or explainability strategies.

Frequently Asked Questions

Does this dataset include step-by-step reasoning?

Yes, each example contains model-generated intermediate reasoning, facilitating detailed analysis of the simulated cognitive processes.

Can this dataset be used for code generation models?

Yes, part of the dataset contains code with test cases and starter code, ideal for fine-tuning on coding tasks.

Is it possible to isolate examples by field (math, science, etc.)?

Yes, each example is annotated with a “domain” field allowing precise thematic filtering according to the type of problem.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.