Common Crawl
Common Crawl is one of the largest sources of plain text that is freely available. It is a public archive of billions of web pages explored and collected regularly by indexing robots. This massive corpus constitutes an essential database for training large language models (LLMs).
Several terabytes of text data, in WARC format (Web ARchive)
Open data licensed under Common Crawl. Verification required for commercial uses based on content
Description
The Common Crawl dataset includes:
- Several terabytes of plain text from the web
- WARC (Web ARchive) formats, used to store metadata, HTML content, and full HTTP responses
- Very broad coverage: news, blogs, forums, forums, encyclopedias, online stores, etc.
- Monthly versions available since 2008
What is this dataset for?
Common Crawl is used for:
- Training large language models (GPT, Falcon, LLama, etc.)
- The study of linguistic evolutions, biases and online representations
- Improving search engines and automatic indexing systems
- The construction of specialized corpora based on thematic or linguistic filters applied to the data
Can it be enriched or improved?
Yes, the dataset can be adapted and filtered for:
- Clean up duplicates, low quality pages, or non-text content
- Extract specific areas (medical, legal, education, etc.)
- Create multilingual versions or versions focused on certain regions of the world
- Annotate texts for classification, summary, extraction, or machine translation tasks
🔗 Source: Common Crawl Dataset
Frequently Asked Questions
Can Common Crawl be used directly as it is?
No Because of its volume and its raw structure, it requires significant processing: cleaning, extracting useful text, filtering by language or domain, etc.
Does Common Crawl contain personal information?
Since the corpus comes from the web, it can accidentally include personal information. It is therefore essential to apply privacy filters before any sensitive or commercial use.
Why is Common Crawl used for LLMs?
Its size, thematic diversity and accessibility make it an ideal basis for training models capable of generalizing to varied and complex contexts.