FineWeb-edu
Massive corpus filtered for its educational quality, from CommonCrawl, intended for training LLM on learning and comprehension-oriented tasks.
1.3T tokens in Parquet, filtered version of CommonCrawl, streaming available
ODC-by 1.0
Description
FineWeb-edu is a filtered version of the FineWeb web dataset, selected according to an educational utility score established by a classifier based on Llama3-70B. It includes 1.3T tokens from educational web pages, structured in Parquet files, and is intended for training LLM models on informative and educational content.
What is this dataset for?
- Train LLM models on reliable and targeted educational content
- Improve performance on benchmarks like MMLU, ARC, OpenBookQA
- Building learning assistants or assistants for answering complex questions
Can it be enriched or improved?
Yes, FineWeb-edu can be combined with other structured sources (e.g. Wikipedia, StackExchange) or specialized for disciplines (math, physics, etc.). Versions that are deduplicated or filtered according to specific grade levels can also be produced.
🔎 In summary
🧠 Recommended for
- Educational LLM developers
- NLP researchers
- Open-source educational institutions
🔧 Compatible tools
- Datatrove
- Hugging Face Datasets
- PyTorch
- Streaming Parquet
💡 Tip
For specific tasks, use the sampled versions (10B, 100B, 350B) to speed up your training iterations.
Frequently Asked Questions
Does FineWeb-edu only contain academic content?
No, it contains any type of content deemed “educational” by the classifier (e.g. practical guides, courses, encyclopedic explanations, etc.).
What is the difference between FineWeb and FineWeb-edu?
FineWeb-edu is a filtered version of FineWeb containing only pages that have obtained a high score in educational quality, evaluated by Llama3.
Can FineWeb-edu be used to train a multilingual model?
The content is mostly in English, but some multi-lingual pages may be included. It is recommended to complete with multilingual datasets.