OpenAI HumanEval

OpenAI HumanEval is an evaluation dataset dedicated to code generation in Python. It contains 164 problems with function signature, explanatory docstring, canonical solution, and unit tests. This dataset was created manually to ensure that it is not in the model training corpora, thus allowing for reliable evaluation.

Download dataset

Size

164 examples, Python code with docstrings, unit tests, JSON format

Licence

MIT

Description

‍

The dataset OpenAI HumanEval includes 164 Python programming problems. Each example contains the signature of a function, a docstring describing the expected behavior, the body of the canonical solution, and unit tests to validate the generated code. This dataset is designed to assess the ability of models to generate correct and functional code.

‍

What is this dataset for?

‍

Evaluate the quality of models for automatically generating Python code.
Serve as a basis for fine-tuning specialized programming models.
Test the robustness of models in understanding and producing complex functions.

‍

Can it be enriched or improved?

‍

Yes, it is possible to add new issues or extend unit tests to cover more cases. You can also diversify languages or increase the complexity of tasks for more advanced training.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of use	⭐⭐⭐⭐⭐ (simple dataset, ready to use)
🧼Need for cleaning	⭐⭐⭐⭐⭐ (no cleaning needed)
🏷️Richness of annotations	⭐⭐⭐⭐☆ (good: unit tests and explicit docstrings)
📜Commercial license	✅ Yes (MIT)
👨‍💻Beginner friendly	👍 Yes, easy to use even for beginners
🔁Reusable for fine-tuning	🔥 Suitable for fine-tuning and code model evaluation
🌍Cultural diversity	⚠️ English content only, limited to Python language

‍

🧠 Recommended for

NLP/code researchers
AI Developers
Programming Educators

‍

🔧 Compatible tools

Classic ML frameworks
Python environment
Jupyter notebooks

‍

💡 Tip

Always execute generated code in a secure environment to avoid the risks associated with the execution of arbitrary code.

Frequently Asked Questions

What is the main particularity of the HumanEval dataset?

It contains programming problems that are manually designed not to appear in the training data, thus ensuring a fair evaluation of code generation models.

How many examples does this dataset contain?

It includes 164 examples of Python programming problems with unit tests.

Is it possible to add your own problems to HumanEval?

Yes, the dataset can be enriched with new problems or tests, which makes it possible to adapt the difficulty and diversity of the tasks.

Similar datasets

Audio

DCASE Challenge Dataset

Audio

ESC-50 (Environmental Sound Classification)

Audio

AudioMnist