By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
FineWeb-edu
Text

FineWeb-edu

Massive corpus filtered for its educational quality, from CommonCrawl, intended for training LLM on learning and comprehension-oriented tasks.

Download dataset
Size

1.3T tokens in Parquet, filtered version of CommonCrawl, streaming available

Licence

ODC-by 1.0

Description

FineWeb-edu is a filtered version of the FineWeb web dataset, selected according to an educational utility score established by a classifier based on Llama3-70B. It includes 1.3T tokens from educational web pages, structured in Parquet files, and is intended for training LLM models on informative and educational content.

What is this dataset for?

  • Train LLM models on reliable and targeted educational content
  • Improve performance on benchmarks like MMLU, ARC, OpenBookQA
  • Building learning assistants or assistants for answering complex questions

Can it be enriched or improved?

Yes, FineWeb-edu can be combined with other structured sources (e.g. Wikipedia, StackExchange) or specialized for disciplines (math, physics, etc.). Versions that are deduplicated or filtered according to specific grade levels can also be produced.

🔎 In summary

Criterion Evaluation
🧩Ease of use ⭐⭐⭐⭐☆ (Streaming via Parquet, optimized usage with datatrove)
🧼Need for cleaning ⭐⭐⭐⭐☆ (Low – pre-filtered for educational quality, already cleaned)
🏷️Annotation richness ⭐⭐☆☆☆ (Not manually annotated but sorted by LLM model)
📜Commercial license ✅ Yes (ODC-By 1.0)
👨‍💻Beginner friendly ⚠️ No – large, requires suitable tools (streaming, LFS, datatrove)
🔁Reusable for fine-tuning 🔥 Perfect for pre-training and educational fine-tuning
🌍Cultural diversity 🌐 Strongly dependent on global web content, moderate biases

🧠 Recommended for

  • Educational LLM developers
  • NLP researchers
  • Open-source educational institutions

🔧 Compatible tools

  • Datatrove
  • Hugging Face Datasets
  • PyTorch
  • Streaming Parquet

💡 Tip

For specific tasks, use the sampled versions (10B, 100B, 350B) to speed up your training iterations.

Frequently Asked Questions

Does FineWeb-edu only contain academic content?

No, it contains any type of content deemed “educational” by the classifier (e.g. practical guides, courses, encyclopedic explanations, etc.).

What is the difference between FineWeb and FineWeb-edu?

FineWeb-edu is a filtered version of FineWeb containing only pages that have obtained a high score in educational quality, evaluated by Llama3.

Can FineWeb-edu be used to train a multilingual model?

The content is mostly in English, but some multi-lingual pages may be included. It is recommended to complete with multilingual datasets.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.