By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Summary Dataset
Text

Summary Dataset

Dataset composed of resumes collected via web scraping, including plain text, original HTML, and associated PDF files, classified into 26 varied professional categories (e.g. IT, Health, Health, Finance, Education).

Download dataset
Size

Approximately 2,485 resumes in text, HTML and PDF divided into professional categories, metadata CSV

Licence

CC0: Public Domain

Description

The Summary Dataset brings together more than 2,400 resumes in text, HTML and PDF formats, extracted from online sources. Each CV is associated with a specific professional category (e.g. HR, IT, IT, Finance, Education), allowing the classification and NLP analysis of textual data.

What is this dataset for?

  • Form automatic CV classification models according to business sectors
  • Analyze and extract structured information from professional documents
  • Create intelligent systems for managing applications or recommendations

Can it be enriched or improved?

This dataset can be enriched by adding multilingual resumes, standardizing PDF formats, and providing additional annotation (e.g. skills, experiences, degrees). Transforming CVs into structured formats (JSON) would improve exploitation.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐✩✩ (Requires handling of different formats)
🧼 Need for cleaning⭐⭐⭐✩✩ (Moderate – PDF and HTML need parsing)
🏷️ Annotation richness⭐⭐⭐✩✩ (Precise categories, no detailed annotations)
📜 Commercial license✅ Yes (CC0)
👨‍💻 Beginner friendly⚠️ Medium – useful for intermediate NLP projects
🔁 Fine-tuning ready🗂️ Suitable for classification and information extraction
🌍 Cultural diversity⚠️ Primarily English CVs, limited diversity

🧠 Recommended for

  • NLP developers
  • Automated recruitment
  • Document processing

🔧 Compatible tools

  • PyPDF2
  • Hugging Face Transformers
  • Scikit-learn

💡 Tip

Convert PDFs to plain text and standardize formats before training to optimize results.

Frequently Asked Questions

Does this dataset make it possible to automatically classify CVs according to the intended profession?

Yes, each resume is annotated with a professional category that can be used as a label for classification models.

What file formats are included in this dataset?

The dataset contains resumes in plain text, HTML, and PDF formats, with a CSV metadata file.

Does this dataset include resumes in multiple languages?

Mostly in English, with no multi-lingual annotations.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.