SMS Spam Collection

Public dataset containing 5,574 SMS messages labeled spam or legitimate (ham), collected from various sources for SMS filtering research.

Download dataset

Size

5,574 SMS messages, plain text format (TXT/CSV)

Licence

CC BY 4.0

Description

‍

The dataset SMS Spam Collection contains 5,574 multivariate SMS messages labeled as spam or ham (not spam). This data was collected from several sources, including academic forums and corpora, providing a solid basis for spam classification and filtering research.

‍

What is this dataset for?

‍

Training text classification algorithms for spam filtering
Research on natural language processing (NLP) applied to SMS
Evaluation of clustering and text analysis techniques

‍

Can it be enriched or improved?

‍

Yes, it is possible to add recent SMS data, to manually annotate ambiguous messages, or to integrate metadata (time, origin) to improve the performance of the models.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of Use	⭐⭐⭐⭐☆ (Simple, standard text format)
🧼Cleaning Required	⭐⭐⭐⭐☆ (Low to moderate – some duplicates and encoding issues to check)
🏷️Annotation Richness	⭐⭐⭐☆☆ (Basic – only spam/ham labels)
📜Commercial License	✅ Yes (CC BY 4.0)
👨‍💻Beginner Friendly	👍 Perfect for introduction to text classification
🔁Reusable for Fine-Tuning	🔥 Suitable for classic NLP models and fine-tuning
🌍Cultural Diversity	🌍 Mostly English messages, diverse sources