Human vs LLM Text Corpus — Generated Text Detection

Comparative corpus between human texts and texts generated by LLMs models, useful for automated content detection or stylistic analysis.

Download dataset

Size

Approximately 790,000 text entries, CSV format

Licence

MIT

Description

‍

The dataset Human vs LLM Text Corpus contains over 788,000 text examples, divided between content written by humans and content automatically generated by various language models (LLMs). It is a reference resource for AI-generated text detection, classification, or research in computational linguistics.

‍

What is this dataset for?

‍

Train models to automatically detect AI vs human texts
Analyze the stylistic or structural differences between the two sources
Evaluate the robustness of automatic generation detectors in different contexts

‍

Can it be enriched or improved?

‍

Yes, enhancements are possible, such as the addition of metadata (generative model used, length, theme) or the balancing of corpora according to the types of content. It can also be segmented by domain (scientific, creative, narrative...) to refine detection models.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐✩ (Ready-to-use data)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low – already structured and well-separated data)
🏷️ Annotation richness	⭐⭐⭐✩✩ (Medium – AI/human distinction, but little context)
📜 Commercial license	✅ Yes (MIT)
👨‍💻 Beginner friendly	🌟 Very good for starting with classification NLP
🔁 Fine-tuning ready	🎯 Yes, ideal for binary or contrastive fine-tuning
🌍 Cultural diversity	⚠️ Variable – depends on sources, to be validated beforehand