FineWeb-edu

Massive corpus filtered for its educational quality, from CommonCrawl, intended for training LLM on learning and comprehension-oriented tasks.

Download dataset

Size

1.3T tokens in Parquet, filtered version of CommonCrawl, streaming available

Licence

ODC-by 1.0

Description

‍

FineWeb-edu is a filtered version of the FineWeb web dataset, selected according to an educational utility score established by a classifier based on Llama3-70B. It includes 1.3T tokens from educational web pages, structured in Parquet files, and is intended for training LLM models on informative and educational content.

‍

What is this dataset for?

‍

Train LLM models on reliable and targeted educational content
Improve performance on benchmarks like MMLU, ARC, OpenBookQA
Building learning assistants or assistants for answering complex questions

‍

Can it be enriched or improved?

‍

Yes, FineWeb-edu can be combined with other structured sources (e.g. Wikipedia, StackExchange) or specialized for disciplines (math, physics, etc.). Versions that are deduplicated or filtered according to specific grade levels can also be produced.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of use	⭐⭐⭐⭐☆ (Streaming via Parquet, optimized usage with datatrove)
🧼Need for cleaning	⭐⭐⭐⭐☆ (Low – pre-filtered for educational quality, already cleaned)
🏷️Annotation richness	⭐⭐☆☆☆ (Not manually annotated but sorted by LLM model)
📜Commercial license	✅ Yes (ODC-By 1.0)
👨‍💻Beginner friendly	⚠️ No – large, requires suitable tools (streaming, LFS, datatrove)
🔁Reusable for fine-tuning	🔥 Perfect for pre-training and educational fine-tuning
🌍Cultural diversity	🌐 Strongly dependent on global web content, moderate biases