By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Human vs LLM Text Corpus — Generated Text Detection
Text

Human vs LLM Text Corpus — Generated Text Detection

Comparative corpus between human texts and texts generated by LLMs models, useful for automated content detection or stylistic analysis.

Download dataset
Size

Approximately 790,000 text entries, CSV format

Licence

MIT

Description

The dataset Human vs LLM Text Corpus contains over 788,000 text examples, divided between content written by humans and content automatically generated by various language models (LLMs). It is a reference resource for AI-generated text detection, classification, or research in computational linguistics.

What is this dataset for?

  • Train models to automatically detect AI vs human texts
  • Analyze the stylistic or structural differences between the two sources
  • Evaluate the robustness of automatic generation detectors in different contexts

Can it be enriched or improved?

Yes, enhancements are possible, such as the addition of metadata (generative model used, length, theme) or the balancing of corpora according to the types of content. It can also be segmented by domain (scientific, creative, narrative...) to refine detection models.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐✩ (Ready-to-use data)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (Low – already structured and well-separated data)
🏷️ Annotation richness⭐⭐⭐✩✩ (Medium – AI/human distinction, but little context)
📜 Commercial license✅ Yes (MIT)
👨‍💻 Beginner friendly🌟 Very good for starting with classification NLP
🔁 Fine-tuning ready🎯 Yes, ideal for binary or contrastive fine-tuning
🌍 Cultural diversity⚠️ Variable – depends on sources, to be validated beforehand

🧠 Recommended for

  • Generation AI detection researchers
  • Academic projects in NLP
  • Automatic moderation tools

🔧 Compatible tools

  • Scikit-learn
  • Hugging Face Transformers
  • OpenAI
  • SpacY

💡 Tip

Combine this dataset with public web texts to improve the generalization of an AI detection model.

Frequently Asked Questions

Is the dataset balanced between human content and generated content?

Yes, the texts are generally well distributed between humans and LLMs, making it suitable for binary classification tasks.

Are the models used to generate the texts specified?

Not always, some texts specify their origin (ChatGPT, etc.), but the information may be incomplete depending on the case.

Can it be used as it is for supervised fine-tuning?

Yes, it is ready to use for training supervised models, especially for detection or classification tasks.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.