By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Civil Comments - Corpus of moderated comments annotated for toxicity
Text

Civil Comments - Corpus of moderated comments annotated for toxicity

The Civil Comments dataset contains a large set of moderated public comments collected between 2015 and 2017, annotated for various types of toxicity and abuse. It is used to train and evaluate automatic moderation and online civility analysis models.

Download dataset
Size

Approximately 2 million text comments in JSON format, with toxicity labels and types of abuse

Licence

CC0-1.0

Description

Civil Comments is a massive corpus of comments in English from a comment plugin for news sites. Each comment is annotated for different types of toxicity (general toxicity, insults, threats, identity attacks, etc.).

What is this dataset for?

  • Train toxicity detection models and automated moderation
  • Analyzing the dynamics of hostile online interactions
  • Testing multi-class and multi-label classification systems on long texts

Can it be enriched or improved?

Yes, it is possible to add additional annotations (e.g. emotional nuances) or to extend the corpus with comments from other languages. Targeted cleaning can improve quality for certain uses.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐✩ (Standardized and documented data)
🧼 Need for cleaning⭐⭐⭐✩✩ (Moderate – possible duplicates and irrelevant texts)
🏷️ Annotation richness⭐⭐⭐⭐✩ (Multi-criteria labels on different toxicity types)
📜 Commercial license✅ Yes (CC0)
👨‍💻 Beginner friendly🌟 Yes, widely used in NLP tutorials
🔁 Fine-tuning ready🎯 Perfect for training classification and moderation models
🌍 Cultural diversity⚡ Limited to English, but large and diverse corpus

🧠 Recommended for

  • NLP researchers
  • Moderation tool developers
  • Social media analysts

🔧 Compatible tools

  • Hugging Face Transformers
  • TensorFlow
  • PyTorch
  • SpacY

💡 Tip

Use oversampling techniques for rare classes to balance the dataset during training.

Frequently Asked Questions

How big is the Civil Comments dataset?

It contains approximately 2 million annotated comments.

What annotations are available in this dataset?

The dataset includes labels for toxicity, insults, threats, identity attacks, explicit sexual content, etc.

Can this dataset be used to moderate comments in other languages?

This dataset is only in English, but the methodology can be adapted to other languages with similar corpora.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.