SMS Spam Collection
Public dataset containing 5,574 SMS messages labeled spam or legitimate (ham), collected from various sources for SMS filtering research.
Description
The dataset SMS Spam Collection contains 5,574 multivariate SMS messages labeled as spam or ham (not spam). This data was collected from several sources, including academic forums and corpora, providing a solid basis for spam classification and filtering research.
What is this dataset for?
- Training text classification algorithms for spam filtering
- Research on natural language processing (NLP) applied to SMS
- Evaluation of clustering and text analysis techniques
Can it be enriched or improved?
Yes, it is possible to add recent SMS data, to manually annotate ambiguous messages, or to integrate metadata (time, origin) to improve the performance of the models.
🔎 In summary
🧠 Recommended for
- Junior data scientists
- NLP researchers
- Anti-spam application developers
🔧 Compatible tools
- Scikit-learn
- NLTK
- TensorFlow
- PyTorch
- SpacY
💡 Tip
Consider preprocessing text messages to standardize abbreviations and special characters before training.
Frequently Asked Questions
Is this dataset suitable for training an SMS spam filter?
Yes, it is specifically designed for spam/ham classification of SMS messages.
What is the format of the data in this dataset?
Messages are in plain text format, often distributed in CSV format with two columns: label and message text.
Can this dataset be used for multilingual projects?
No, the messages are mostly in English, other sources should be integrated for multilingualism.