Apt-eval - Detection of texts reworked by AI
Text corpus to assess the ability of AI detectors to identify human texts slightly modified by different LLMs.
Approximately 15,000 texts, CSV/JSON, classified by polisher, degree and type of editing
MIT
Description
Apt-eval is a new benchmark designed to analyze the behavior of AI text detectors when faced with reworked human texts. It includes 15,000 text samples from six areas (blog, news, speech, etc.), modified by five major language models (LLM), using two approaches: based on the degree and based on the percentage of modification. The objective is to simulate a realistic case of lightweight use of AIs in human writing.
What is this dataset for?
- Evaluate the robustness of AI text detectors in the face of minimal changes by LLMs
- Compare the impact of different models (Gpt-4o, Llama, DeepSeek) according to several polishing strategies
- Develop new detection tools or classify hybrid texts
Can it be enriched or improved?
Yes, it is possible to add other languages, other textual genres (such as poetry or social posts), or to cross the results with human evaluations. A multilingual extension would also be relevant to strengthen the generalization analyses of detectors.
🔎 In summary
🧠 Recommended for
- AI detection researchers
- Textual authenticity projects
- Ethical NLP
🔧 Compatible tools
- Scikit-learn
- Hugging Face Datasets
- Pytorch
- SpacY
- LLM-Detectors
💡 Tip
Use similarity scores to train adaptive detection models with variable thresholds.
Frequently Asked Questions
Does this dataset include the original texts before editing?
Yes, the initial human texts are available in a parallel version of the dataset for direct comparison.
What is the difference between the two types of “polishing”?
The “degree-based” mode applies a defined level of modification (minor, major...), while “percentage-based” uses a specific percentage of the original text.
Can we accurately detect the texts modified by Gpt-4o in this corpus?
Precisely, this dataset shows that even powerful detectors fail in the face of subtle modifications, in particular those of GPT-4o.




