By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Web Page Phishing Detection Dataset
Text

Web Page Phishing Detection Dataset

Balanced dataset of 11,430 annotated URLs (phishing vs legitimate), accompanied by 87 textual and structural characteristics extracted from the pages.

Download dataset
Size

11,430 entries with 87 columns, tabular CSV format

Licence

CC BY 4.0

Description

The Web Page Phishing Detection Dataset is a resource designed for developing and evaluating phishing detection systems using machine learning. It includes 11,430 URLs divided evenly between phishing and legitimate ones. Each URL is associated with 87 characteristics extracted from its structure, HTML content, and external services. It's a great training game for supervised classification algorithms.

What is this dataset for?

  • Develop machine learning models to detect phishing sites
  • Evaluate the robustness of web security systems in the face of modern threats
  • Create tools to automatically analyze suspicious URLs in browsers or antivirus

Can it be enriched or improved?

Yes. You can enrich the dataset by adding metadata (geolocation, WHOIS history), update the validity of URLs, or extend it with new classes such as spam or malware. It is also possible to enrich the features with vectors for textual embedding of HTML content.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐✩ (Ready-to-use for supervised classification)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (Very low – clean and well-structured data)
🏷️ Annotation richness⭐⭐⭐⭐✩ (87 features + binary label - phishing/legitimate)
📜 Commercial license✅ Yes (CC BY 4.0)
👨‍💻 Beginner friendly🌟 Good entry point for applied cybersecurity
🔁 Fine-tuning ready🎯 Very good for training or evaluating existing models
🌍 Cultural diversity⚠️ Various URLs, limited information on geographic origin

🧠 Recommended for

  • Cybersecurity analysts
  • Applied NLP researchers
  • Anti-Phishing Solution Developers

🔧 Compatible tools

  • Pandas
  • Scikit-learn
  • XGBoost
  • LightGBM
  • Tensorflow

💡 Tip

Use an overall model (random forest + XGBoost) for very good results right from the start, without complex tuning.

Frequently Asked Questions

Does this dataset include the HTML content of the pages?

No, only extracted characteristics are provided. However, it is possible to crawl the pages to extract more information.

Are the URLs still active?

The dataset does not guarantee the current validity of the links. It is recommended to check the URLs before operational use.

Can this dataset be used to train a detector in real time?

Yes, it is perfectly suited to training detection models online or embedded in a filtering browser or proxy.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.