Civil Comments - Corpus of moderated comments annotated for toxicity
The Civil Comments dataset contains a large set of moderated public comments collected between 2015 and 2017, annotated for various types of toxicity and abuse. It is used to train and evaluate automatic moderation and online civility analysis models.
Approximately 2 million text comments in JSON format, with toxicity labels and types of abuse
CC0-1.0
Description
Civil Comments is a massive corpus of comments in English from a comment plugin for news sites. Each comment is annotated for different types of toxicity (general toxicity, insults, threats, identity attacks, etc.).
What is this dataset for?
- Train toxicity detection models and automated moderation
- Analyzing the dynamics of hostile online interactions
- Testing multi-class and multi-label classification systems on long texts
Can it be enriched or improved?
Yes, it is possible to add additional annotations (e.g. emotional nuances) or to extend the corpus with comments from other languages. Targeted cleaning can improve quality for certain uses.
🔎 In summary
🧠 Recommended for
- NLP researchers
- Moderation tool developers
- Social media analysts
🔧 Compatible tools
- Hugging Face Transformers
- TensorFlow
- PyTorch
- SpacY
💡 Tip
Use oversampling techniques for rare classes to balance the dataset during training.
Frequently Asked Questions
How big is the Civil Comments dataset?
It contains approximately 2 million annotated comments.
What annotations are available in this dataset?
The dataset includes labels for toxicity, insults, threats, identity attacks, explicit sexual content, etc.
Can this dataset be used to moderate comments in other languages?
This dataset is only in English, but the methodology can be adapted to other languages with similar corpora.




