Apt-eval - Detection of texts reworked by AI

Text corpus to assess the ability of AI detectors to identify human texts slightly modified by different LLMs.

Download dataset

Size

Approximately 15,000 texts, CSV/JSON, classified by polisher, degree and type of editing

Licence

MIT

Description

‍

Apt-eval is a new benchmark designed to analyze the behavior of AI text detectors when faced with reworked human texts. It includes 15,000 text samples from six areas (blog, news, speech, etc.), modified by five major language models (LLM), using two approaches: based on the degree and based on the percentage of modification. The objective is to simulate a realistic case of lightweight use of AIs in human writing.

‍

What is this dataset for?

‍

Evaluate the robustness of AI text detectors in the face of minimal changes by LLMs
Compare the impact of different models (Gpt-4o, Llama, DeepSeek) according to several polishing strategies
Develop new detection tools or classify hybrid texts

‍

Can it be enriched or improved?

‍

Yes, it is possible to add other languages, other textual genres (such as poetry or social posts), or to cross the results with human evaluations. A multilingual extension would also be relevant to strengthen the generalization analyses of detectors.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐⭐ (High: well-structured and labeled)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (No cleaning needed)
🏷️ Annotation richness	⭐⭐⭐⭐⭐ (Very detailed: semantic score, textual distances, polisher, etc.)
📜 Commercial license	✅ Yes (MIT)
👨‍💻 Beginner friendly	🌟 Yes – easy to load, clear and useful in NLP
🔁 Fine-tuning ready	⚡ Yes, especially for AI detectors, classification tasks
🌍 Cultural diversity	⚠️ Medium: mainly English, varied genres

‍

🧠 Recommended for

AI detection researchers
Textual authenticity projects
Ethical NLP

‍

🔧 Compatible tools

Scikit-learn
Hugging Face Datasets
Pytorch
SpacY
LLM-Detectors

‍

💡 Tip

Use similarity scores to train adaptive detection models with variable thresholds.

Frequently Asked Questions

Does this dataset include the original texts before editing?

Yes, the initial human texts are available in a parallel version of the dataset for direct comparison.

What is the difference between the two types of “polishing”?

The “degree-based” mode applies a defined level of modification (minor, major...), while “percentage-based” uses a specific percentage of the original text.

Can we accurately detect the texts modified by Gpt-4o in this corpus?

Precisely, this dataset shows that even powerful detectors fail in the face of subtle modifications, in particular those of GPT-4o.

Similar datasets

Brain MRI

ConLL-2003

Consumer Complaints Dataset