By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
AI-Generated Essays Dataset
Text

AI-Generated Essays Dataset

This dataset offers a balanced corpus of texts generated by humans and by AI, with binary annotation (0 = human, 1 = AI). It is designed to train synthetic text detectors or to explore the stylistic differences between human and automatic writing.

Download dataset
Size

1,460 tests in CSV format (≈ 200 tokens each)

Licence

CC0: Public Domain

Description

The dataset AI-Generated Essays Dataset contains 1,460 essays, a small fraction (about 6%) of which were generated by artificial intelligence. Each line includes the full text and a label indicating whether it was written by a human or an AI. This corpus is used as a reference for training, testing and analyzing models capable of differentiating the origin of a text.

What is this dataset for?

  • Train an AI-generated text detection model (TF-IDF, transformers, etc.)
  • Analyze the stylistic differences between human and generated language.
  • Create educational tools or data science challenges around the detection of synthetic text.

Can it be enriched or improved?

Yes. The corpus can be extended with longer texts or in other languages. It is also possible to add linguistic annotations (average sentence length, lexical complexity, etc.) or to combine augmentation methods (back translation, paraphrase, etc.) to better balance the classes.

🔎 In summary

Criterion Evaluation
🧩Ease of Use ⭐⭐⭐⭐☆ (very simple, ready-to-use CSV)
🧼Cleaning Required ⭐⭐⭐☆☆ (data already clean)
🏷️Annotation Richness ⭐☆☆☆☆ (limited to a single binary label)
📜Commercial License ✅ Yes (CC0)
👨‍💻Beginner-Friendly 👶 Perfect for getting started with AI text detection
🔁Reusable for Fine-Tuning ⚠️ Low volume → useful for bootstrapping or testing
🌍Cultural Diversity 🌍 Low – texts likely in English, no geographic context

🧠 Recommended for

  • NLP trainers
  • Data science students
  • Light AI detection projects

🔧 Compatible tools

  • Scikit-learn
  • SpacY
  • BERT
  • SHAP
  • LIME

💡 Tip

To compensate for the class imbalance, apply SMOTE oversampling or dynamic weighting in the loss function.

Frequently Asked Questions

Is this dataset sufficient to train a reliable AI detector?

It is suitable for prototyping experiments or educational projects, but a larger volume will be required for production.

Can it be adapted to other languages?

Yes, it is possible to translate it or create multilingual versions by generating AI tests in the desired language.

Can it be used for supervised training?

Absolutely, each example is annotated with a binary class (0 = human, 1 = AI), making it an ideal base for supervised learning.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.