Summary Dataset
Dataset composed of resumes collected via web scraping, including plain text, original HTML, and associated PDF files, classified into 26 varied professional categories (e.g. IT, Health, Health, Finance, Education).
Approximately 2,485 resumes in text, HTML and PDF divided into professional categories, metadata CSV
CC0: Public Domain
Description
The Summary Dataset brings together more than 2,400 resumes in text, HTML and PDF formats, extracted from online sources. Each CV is associated with a specific professional category (e.g. HR, IT, IT, Finance, Education), allowing the classification and NLP analysis of textual data.
What is this dataset for?
- Form automatic CV classification models according to business sectors
- Analyze and extract structured information from professional documents
- Create intelligent systems for managing applications or recommendations
Can it be enriched or improved?
This dataset can be enriched by adding multilingual resumes, standardizing PDF formats, and providing additional annotation (e.g. skills, experiences, degrees). Transforming CVs into structured formats (JSON) would improve exploitation.
🔎 In summary
🧠 Recommended for
- NLP developers
- Automated recruitment
- Document processing
🔧 Compatible tools
- PyPDF2
- Hugging Face Transformers
- Scikit-learn
💡 Tip
Convert PDFs to plain text and standardize formats before training to optimize results.
Frequently Asked Questions
Does this dataset make it possible to automatically classify CVs according to the intended profession?
Yes, each resume is annotated with a professional category that can be used as a label for classification models.
What file formats are included in this dataset?
The dataset contains resumes in plain text, HTML, and PDF formats, with a CSV metadata file.
Does this dataset include resumes in multiple languages?
Mostly in English, with no multi-lingual annotations.