By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
MixInstruct — Multi-LLM comparison on instruction responses
Text

MixInstruct — Multi-LLM comparison on instruction responses

Massive data set containing answers from 11 LLMs on various instructions. Includes automatic quality scores (BLUE, RED, BERTScore, BartScore) as well as peer-to-peer comparisons evaluated by ChatGPT. An ideal resource for training, comparing, or improving language models.

Download dataset
Size

110,000 examples in Parquet (582 MB)

Licence

MIT

Description

MixInstruct is a data set of 110,000 examples composed of responses generated by 11 popular open-source language models, based on a common set of instructions. For each response, several automatic metrics are provided (BLUE, RED, BERTScore, BartScore), as well as paired comparisons performed by ChatGPT on a subset of more than 4,700 examples.

What is this dataset for?

  • Compare the performance of LLM models on instruction-tracking tasks
  • Training or evaluating multi-source generative models
  • Create a consistent instruction-following benchmark for open source LLMs

Can it be enriched or improved?

Yes, this dataset can be enriched with new models, new instructions, or other metrics (e.g. human eval, toxicity score). It is also possible to add metadata such as generation time, model parameters, or inference cost.

🔎 In summary

Criterion Evaluation
🧩Ease of use ⭐⭐⭐⭐⭐ (Simple format to handle, Parquet)
🧼Cleaning required ⭐⭐⭐☆☆ (Data already well structured)
🏷️Annotation richness ⭐⭐⭐⭐⭐ (Scores + paired comparisons)
📜Commercial license ✅ Yes (MIT)
👨‍💻Beginner-friendly 🧠 Accessible with some NLP knowledge
🔁Reusable for fine-tuning 🔥 Perfect for instruction model fine-tuning
🌍Cultural diversity 🌐 Mainly in English, but generalizable

🧠 Recommended for

  • NLP researchers
  • LLMs developers
  • Generative AI evaluators

🔧 Compatible tools

  • Transformers
  • OpenChat
  • DeepEval
  • LangChain
  • Pandas

💡 Tip

Filter the examples according to the variance of the scores to create a difficult subset (hard set) for fine evaluation purposes.

Frequently Asked Questions

Does this dataset include responses generated by GPT-4?

No, it includes comparisons evaluated by GPT-4, but the answers come from 11 other open-source models.

Can I use this dataset to train a new LLM?

Yes, it can be used for fine-tuning or multi-reference distillation, especially for instruction follow-up tasks.

Are metrics calculated automatically or manually?

Scores like BLUE or RED are automatic, but paired comparisons are obtained via GPT-4 evaluation (prompt ChatGPT).

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.