By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
SMS Spam Collection
Text

SMS Spam Collection

Public dataset containing 5,574 SMS messages labeled spam or legitimate (ham), collected from various sources for SMS filtering research.

Download dataset
Size

5,574 SMS messages, plain text format (TXT/CSV)

Licence

CC BY 4.0

Description

The dataset SMS Spam Collection contains 5,574 multivariate SMS messages labeled as spam or ham (not spam). This data was collected from several sources, including academic forums and corpora, providing a solid basis for spam classification and filtering research.

What is this dataset for?

  • Training text classification algorithms for spam filtering
  • Research on natural language processing (NLP) applied to SMS
  • Evaluation of clustering and text analysis techniques

Can it be enriched or improved?

Yes, it is possible to add recent SMS data, to manually annotate ambiguous messages, or to integrate metadata (time, origin) to improve the performance of the models.

🔎 In summary

Criterion Evaluation
🧩Ease of Use ⭐⭐⭐⭐☆ (Simple, standard text format)
🧼Cleaning Required ⭐⭐⭐⭐☆ (Low to moderate – some duplicates and encoding issues to check)
🏷️Annotation Richness ⭐⭐⭐☆☆ (Basic – only spam/ham labels)
📜Commercial License ✅ Yes (CC BY 4.0)
👨‍💻Beginner Friendly 👍 Perfect for introduction to text classification
🔁Reusable for Fine-Tuning 🔥 Suitable for classic NLP models and fine-tuning
🌍Cultural Diversity 🌍 Mostly English messages, diverse sources

🧠 Recommended for

  • Junior data scientists
  • NLP researchers
  • Anti-spam application developers

🔧 Compatible tools

  • Scikit-learn
  • NLTK
  • TensorFlow
  • PyTorch
  • SpacY

💡 Tip

Consider preprocessing text messages to standardize abbreviations and special characters before training.

Frequently Asked Questions

Is this dataset suitable for training an SMS spam filter?

Yes, it is specifically designed for spam/ham classification of SMS messages.

What is the format of the data in this dataset?

Messages are in plain text format, often distributed in CSV format with two columns: label and message text.

Can this dataset be used for multilingual projects?

No, the messages are mostly in English, other sources should be integrated for multilingualism.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.