Web Page Phishing Detection Dataset

Balanced dataset of 11,430 annotated URLs (phishing vs legitimate), accompanied by 87 textual and structural characteristics extracted from the pages.

Download dataset

Size

11,430 entries with 87 columns, tabular CSV format

Licence

CC BY 4.0

Description

‍

The Web Page Phishing Detection Dataset is a resource designed for developing and evaluating phishing detection systems using machine learning. It includes 11,430 URLs divided evenly between phishing and legitimate ones. Each URL is associated with 87 characteristics extracted from its structure, HTML content, and external services. It's a great training game for supervised classification algorithms.

‍

What is this dataset for?

‍

Develop machine learning models to detect phishing sites
Evaluate the robustness of web security systems in the face of modern threats
Create tools to automatically analyze suspicious URLs in browsers or antivirus

‍

Can it be enriched or improved?

‍

Yes. You can enrich the dataset by adding metadata (geolocation, WHOIS history), update the validity of URLs, or extend it with new classes such as spam or malware. It is also possible to enrich the features with vectors for textual embedding of HTML content.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐✩ (Ready-to-use for supervised classification)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Very low – clean and well-structured data)
🏷️ Annotation richness	⭐⭐⭐⭐✩ (87 features + binary label - phishing/legitimate)
📜 Commercial license	✅ Yes (CC BY 4.0)
👨‍💻 Beginner friendly	🌟 Good entry point for applied cybersecurity
🔁 Fine-tuning ready	🎯 Very good for training or evaluating existing models
🌍 Cultural diversity	⚠️ Various URLs, limited information on geographic origin