Human vs LLM Text Corpus — Generated Text Detection
Comparative corpus between human texts and texts generated by LLMs models, useful for automated content detection or stylistic analysis.
Description
The dataset Human vs LLM Text Corpus contains over 788,000 text examples, divided between content written by humans and content automatically generated by various language models (LLMs). It is a reference resource for AI-generated text detection, classification, or research in computational linguistics.
What is this dataset for?
- Train models to automatically detect AI vs human texts
- Analyze the stylistic or structural differences between the two sources
- Evaluate the robustness of automatic generation detectors in different contexts
Can it be enriched or improved?
Yes, enhancements are possible, such as the addition of metadata (generative model used, length, theme) or the balancing of corpora according to the types of content. It can also be segmented by domain (scientific, creative, narrative...) to refine detection models.
🔎 In summary
🧠 Recommended for
- Generation AI detection researchers
- Academic projects in NLP
- Automatic moderation tools
🔧 Compatible tools
- Scikit-learn
- Hugging Face Transformers
- OpenAI
- SpacY
💡 Tip
Combine this dataset with public web texts to improve the generalization of an AI detection model.
Frequently Asked Questions
Is the dataset balanced between human content and generated content?
Yes, the texts are generally well distributed between humans and LLMs, making it suitable for binary classification tasks.
Are the models used to generate the texts specified?
Not always, some texts specify their origin (ChatGPT, etc.), but the information may be incomplete depending on the case.
Can it be used as it is for supervised fine-tuning?
Yes, it is ready to use for training supervised models, especially for detection or classification tasks.




