AI-Generated Essays Dataset
This dataset offers a balanced corpus of texts generated by humans and by AI, with binary annotation (0 = human, 1 = AI). It is designed to train synthetic text detectors or to explore the stylistic differences between human and automatic writing.
Description
The dataset AI-Generated Essays Dataset contains 1,460 essays, a small fraction (about 6%) of which were generated by artificial intelligence. Each line includes the full text and a label indicating whether it was written by a human or an AI. This corpus is used as a reference for training, testing and analyzing models capable of differentiating the origin of a text.
What is this dataset for?
- Train an AI-generated text detection model (TF-IDF, transformers, etc.)
- Analyze the stylistic differences between human and generated language.
- Create educational tools or data science challenges around the detection of synthetic text.
Can it be enriched or improved?
Yes. The corpus can be extended with longer texts or in other languages. It is also possible to add linguistic annotations (average sentence length, lexical complexity, etc.) or to combine augmentation methods (back translation, paraphrase, etc.) to better balance the classes.
🔎 In summary
🧠 Recommended for
- NLP trainers
- Data science students
- Light AI detection projects
🔧 Compatible tools
- Scikit-learn
- SpacY
- BERT
- SHAP
- LIME
💡 Tip
To compensate for the class imbalance, apply SMOTE oversampling or dynamic weighting in the loss function.
Frequently Asked Questions
Is this dataset sufficient to train a reliable AI detector?
It is suitable for prototyping experiments or educational projects, but a larger volume will be required for production.
Can it be adapted to other languages?
Yes, it is possible to translate it or create multilingual versions by generating AI tests in the desired language.
Can it be used for supervised training?
Absolutely, each example is annotated with a binary class (0 = human, 1 = AI), making it an ideal base for supervised learning.