AI-Generated Essays Dataset

This dataset offers a balanced corpus of texts generated by humans and by AI, with binary annotation (0 = human, 1 = AI). It is designed to train synthetic text detectors or to explore the stylistic differences between human and automatic writing.

Download dataset

Size

1,460 tests in CSV format (≈ 200 tokens each)

Licence

CC0: Public Domain

Description

‍

The dataset AI-Generated Essays Dataset contains 1,460 essays, a small fraction (about 6%) of which were generated by artificial intelligence. Each line includes the full text and a label indicating whether it was written by a human or an AI. This corpus is used as a reference for training, testing and analyzing models capable of differentiating the origin of a text.

‍

What is this dataset for?

‍

Train an AI-generated text detection model (TF-IDF, transformers, etc.)
Analyze the stylistic differences between human and generated language.
Create educational tools or data science challenges around the detection of synthetic text.

‍

Can it be enriched or improved?

‍

Yes. The corpus can be extended with longer texts or in other languages. It is also possible to add linguistic annotations (average sentence length, lexical complexity, etc.) or to combine augmentation methods (back translation, paraphrase, etc.) to better balance the classes.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of Use	⭐⭐⭐⭐☆ (very simple, ready-to-use CSV)
🧼Cleaning Required	⭐⭐⭐☆☆ (data already clean)
🏷️Annotation Richness	⭐☆☆☆☆ (limited to a single binary label)
📜Commercial License	✅ Yes (CC0)
👨‍💻Beginner-Friendly	👶 Perfect for getting started with AI text detection
🔁Reusable for Fine-Tuning	⚠️ Low volume → useful for bootstrapping or testing
🌍Cultural Diversity	🌍 Low – texts likely in English, no geographic context