GoEmotions

GoEmotions is a text-based dataset with Reddit comments annotated for 27 distinct or neutral emotions. It makes it possible to train models on complex emotions in a real context.

Download dataset

Size

Approximately 58,000 plain text examples with multi-label annotations (JSON)

Licence

Apache 2.0

Description

‍

GoEmotions is a dataset built from Reddit comments that are manually annotated to identify the emotion expressed. Each entry can be associated with several emotions among 27 distinct categories or be neutral. It is a rich corpus for emotional classification, with complex and realistic cases.

‍

What is this dataset for?

‍

Training emotion-detection models from text
Develop empathetic chatbots or more humane virtual assistants
Improve automatic moderation and the detection of sensitive speech

‍

Can it be enriched or improved?

‍

Yes, you can complete the dataset with other sources of social comments, or translate it into other languages. It is also possible to add conversational contexts or combine data with metadata (e.g. subreddit) to refine emotional models. Additional annotations such as emotional intensity could also be incorporated.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of Use	⭐⭐⭐⭐☆ (Clear JSON format with explicit labels)
🧼Cleaning Required	⭐⭐⭐⭐⭐ (Very low, ready-to-use data)
🏷️Annotation Richness	⭐⭐⭐⭐☆ (Multi-label with 28 emotional categories)
📜Commercial License	✅ Yes (Apache 2.0)
👨‍💻Ideal for Beginners	👩‍💻 Highly suitable, well-documented dataset
🔁Reusable for Fine-Tuning	🔥 Excellent base for emotion models
🌍Cultural Diversity	🌐 Moderate, English only with Reddit bias

‍

🧠 Recommended for

Emotion detection projects
Conversational assistants
Social NLP search

‍

🔧 Compatible tools

Hugging Face Transformers
Scikit-learn
PyTorch
TensorFlow
SpacY

‍

💡 Tip

First, train a model on GoEmotions and then refine it with data specific to your field (e.g. service, forums, etc.)

Frequently Asked Questions

Does the GoEmotions dataset cover multiple languages?

No, it is entirely in English, but it is possible to translate it manually or automatically for multilingual cases.

Can GoEmotions be used in commercial projects?

Yes, the Apache 2.0 license allows commercial use, subject to compliance with the standard license terms.

Does this dataset contain biases?

Yes, like any social media data, it may contain biases related to Reddit and its users. It is important to take this into account when interpreting the results.

Similar datasets

MMLU

RAVDESS

MIMIC-III