Resume Dataset

Dataset composed of resumes collected via web scraping, including plain text, original HTML, and associated PDF files, classified into 26 varied professional categories (e.g. IT, Health, Health, Finance, Education).

Download dataset

Size

Approximately 2,485 resumes in text, HTML and PDF divided into professional categories, metadata CSV

Licence

CC0: Public Domain

Description

‍

The Summary Dataset brings together more than 2,400 resumes in text, HTML and PDF formats, extracted from online sources. Each CV is associated with a specific professional category (e.g. HR, IT, IT, Finance, Education), allowing the classification and NLP analysis of textual data.

‍

What is this dataset for?

‍

Form automatic CV classification models according to business sectors
Analyze and extract structured information from professional documents
Create intelligent systems for managing applications or recommendations

‍

Can it be enriched or improved?

‍

This dataset can be enriched by adding multilingual resumes, standardizing PDF formats, and providing additional annotation (e.g. skills, experiences, degrees). Transforming CVs into structured formats (JSON) would improve exploitation.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐✩✩ (Requires handling of different formats)
🧼 Need for cleaning	⭐⭐⭐✩✩ (Moderate – PDF and HTML need parsing)
🏷️ Annotation richness	⭐⭐⭐✩✩ (Precise categories, no detailed annotations)
📜 Commercial license	✅ Yes (CC0)
👨‍💻 Beginner friendly	⚠️ Medium – useful for intermediate NLP projects
🔁 Fine-tuning ready	🗂️ Suitable for classification and information extraction
🌍 Cultural diversity	⚠️ Primarily English CVs, limited diversity