MixInstruct — Multi-LLM comparison on instruction responses
Massive data set containing answers from 11 LLMs on various instructions. Includes automatic quality scores (BLUE, RED, BERTScore, BartScore) as well as peer-to-peer comparisons evaluated by ChatGPT. An ideal resource for training, comparing, or improving language models.
Description
MixInstruct is a data set of 110,000 examples composed of responses generated by 11 popular open-source language models, based on a common set of instructions. For each response, several automatic metrics are provided (BLUE, RED, BERTScore, BartScore), as well as paired comparisons performed by ChatGPT on a subset of more than 4,700 examples.
What is this dataset for?
- Compare the performance of LLM models on instruction-tracking tasks
- Training or evaluating multi-source generative models
- Create a consistent instruction-following benchmark for open source LLMs
Can it be enriched or improved?
Yes, this dataset can be enriched with new models, new instructions, or other metrics (e.g. human eval, toxicity score). It is also possible to add metadata such as generation time, model parameters, or inference cost.
🔎 In summary
🧠 Recommended for
- NLP researchers
- LLMs developers
- Generative AI evaluators
🔧 Compatible tools
- Transformers
- OpenChat
- DeepEval
- LangChain
- Pandas
💡 Tip
Filter the examples according to the variance of the scores to create a difficult subset (hard set) for fine evaluation purposes.
Frequently Asked Questions
Does this dataset include responses generated by GPT-4?
No, it includes comparisons evaluated by GPT-4, but the answers come from 11 other open-source models.
Can I use this dataset to train a new LLM?
Yes, it can be used for fine-tuning or multi-reference distillation, especially for instruction follow-up tasks.
Are metrics calculated automatically or manually?
Scores like BLUE or RED are automatic, but paired comparisons are obtained via GPT-4 evaluation (prompt ChatGPT).




