Web Page Phishing Detection Dataset
Balanced dataset of 11,430 annotated URLs (phishing vs legitimate), accompanied by 87 textual and structural characteristics extracted from the pages.
Description
The Web Page Phishing Detection Dataset is a resource designed for developing and evaluating phishing detection systems using machine learning. It includes 11,430 URLs divided evenly between phishing and legitimate ones. Each URL is associated with 87 characteristics extracted from its structure, HTML content, and external services. It's a great training game for supervised classification algorithms.
What is this dataset for?
- Develop machine learning models to detect phishing sites
- Evaluate the robustness of web security systems in the face of modern threats
- Create tools to automatically analyze suspicious URLs in browsers or antivirus
Can it be enriched or improved?
Yes. You can enrich the dataset by adding metadata (geolocation, WHOIS history), update the validity of URLs, or extend it with new classes such as spam or malware. It is also possible to enrich the features with vectors for textual embedding of HTML content.
🔎 In summary
🧠 Recommended for
- Cybersecurity analysts
- Applied NLP researchers
- Anti-Phishing Solution Developers
🔧 Compatible tools
- Pandas
- Scikit-learn
- XGBoost
- LightGBM
- Tensorflow
💡 Tip
Use an overall model (random forest + XGBoost) for very good results right from the start, without complex tuning.
Frequently Asked Questions
Does this dataset include the HTML content of the pages?
No, only extracted characteristics are provided. However, it is possible to crawl the pages to extract more information.
Are the URLs still active?
The dataset does not guarantee the current validity of the links. It is recommended to check the URLs before operational use.
Can this dataset be used to train a detector in real time?
Yes, it is perfectly suited to training detection models online or embedded in a filtering browser or proxy.




