By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Apt-eval - Detection of texts reworked by AI
Text

Apt-eval - Detection of texts reworked by AI

Text corpus to assess the ability of AI detectors to identify human texts slightly modified by different LLMs.

Download dataset
Size

Approximately 15,000 texts, CSV/JSON, classified by polisher, degree and type of editing

Licence

MIT

Description

Apt-eval is a new benchmark designed to analyze the behavior of AI text detectors when faced with reworked human texts. It includes 15,000 text samples from six areas (blog, news, speech, etc.), modified by five major language models (LLM), using two approaches: based on the degree and based on the percentage of modification. The objective is to simulate a realistic case of lightweight use of AIs in human writing.

What is this dataset for?

  • Evaluate the robustness of AI text detectors in the face of minimal changes by LLMs
  • Compare the impact of different models (Gpt-4o, Llama, DeepSeek) according to several polishing strategies
  • Develop new detection tools or classify hybrid texts

Can it be enriched or improved?

Yes, it is possible to add other languages, other textual genres (such as poetry or social posts), or to cross the results with human evaluations. A multilingual extension would also be relevant to strengthen the generalization analyses of detectors.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐⭐ (High: well-structured and labeled)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (No cleaning needed)
🏷️ Annotation richness⭐⭐⭐⭐⭐ (Very detailed: semantic score, textual distances, polisher, etc.)
📜 Commercial license✅ Yes (MIT)
👨‍💻 Beginner friendly🌟 Yes – easy to load, clear and useful in NLP
🔁 Fine-tuning ready⚡ Yes, especially for AI detectors, classification tasks
🌍 Cultural diversity⚠️ Medium: mainly English, varied genres

🧠 Recommended for

  • AI detection researchers
  • Textual authenticity projects
  • Ethical NLP

🔧 Compatible tools

  • Scikit-learn
  • Hugging Face Datasets
  • Pytorch
  • SpacY
  • LLM-Detectors

💡 Tip

Use similarity scores to train adaptive detection models with variable thresholds.

Frequently Asked Questions

Does this dataset include the original texts before editing?

Yes, the initial human texts are available in a parallel version of the dataset for direct comparison.

What is the difference between the two types of “polishing”?

The “degree-based” mode applies a defined level of modification (minor, major...), while “percentage-based” uses a specific percentage of the original text.

Can we accurately detect the texts modified by Gpt-4o in this corpus?

Precisely, this dataset shows that even powerful detectors fail in the face of subtle modifications, in particular those of GPT-4o.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.