Civil Comments - Corpus of moderated comments annotated for toxicity

The Civil Comments dataset contains a large set of moderated public comments collected between 2015 and 2017, annotated for various types of toxicity and abuse. It is used to train and evaluate automatic moderation and online civility analysis models.

Download dataset

Size

Approximately 2 million text comments in JSON format, with toxicity labels and types of abuse

Licence

CC0-1.0

Description

‍

Civil Comments is a massive corpus of comments in English from a comment plugin for news sites. Each comment is annotated for different types of toxicity (general toxicity, insults, threats, identity attacks, etc.).

‍

What is this dataset for?

‍

Train toxicity detection models and automated moderation
Analyzing the dynamics of hostile online interactions
Testing multi-class and multi-label classification systems on long texts

‍

Can it be enriched or improved?

‍

Yes, it is possible to add additional annotations (e.g. emotional nuances) or to extend the corpus with comments from other languages. Targeted cleaning can improve quality for certain uses.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐✩ (Standardized and documented data)
🧼 Need for cleaning	⭐⭐⭐✩✩ (Moderate – possible duplicates and irrelevant texts)
🏷️ Annotation richness	⭐⭐⭐⭐✩ (Multi-criteria labels on different toxicity types)
📜 Commercial license	✅ Yes (CC0)
👨‍💻 Beginner friendly	🌟 Yes, widely used in NLP tutorials
🔁 Fine-tuning ready	🎯 Perfect for training classification and moderation models
🌍 Cultural diversity	⚡ Limited to English, but large and diverse corpus