Common Crawl

Common Crawl is one of the largest sources of plain text that is freely available. It is a public archive of billions of web pages explored and collected regularly by indexing robots. This massive corpus constitutes an essential database for training large language models (LLMs).

Download dataset

Size

Several terabytes of text data, in WARC format (Web ARchive)

Licence

Open data licensed under Common Crawl. Verification required for commercial uses based on content

Description

‍
The Common Crawl dataset includes:

Several terabytes of plain text from the web
WARC (Web ARchive) formats, used to store metadata, HTML content, and full HTTP responses
Very broad coverage: news, blogs, forums, forums, encyclopedias, online stores, etc.
Monthly versions available since 2008

‍

What is this dataset for?

‍
Common Crawl is used for:

Training large language models (GPT, Falcon, LLama, etc.)
The study of linguistic evolutions, biases and online representations
Improving search engines and automatic indexing systems
The construction of specialized corpora based on thematic or linguistic filters applied to the data

‍

Can it be enriched or improved?

‍
Yes, the dataset can be adapted and filtered for:

Clean up duplicates, low quality pages, or non-text content
Extract specific areas (medical, legal, education, etc.)
Create multilingual versions or versions focused on certain regions of the world
Annotate texts for classification, summary, extraction, or machine translation tasks

‍

🔗 Source: Common Crawl Dataset

‍

Frequently Asked Questions

Can Common Crawl be used directly as it is?

No Because of its volume and its raw structure, it requires significant processing: cleaning, extracting useful text, filtering by language or domain, etc.

‍

Does Common Crawl contain personal information?

Since the corpus comes from the web, it can accidentally include personal information. It is therefore essential to apply privacy filters before any sensitive or commercial use.

‍

Why is Common Crawl used for LLMs?

Its size, thematic diversity and accessibility make it an ideal basis for training models capable of generalizing to varied and complex contexts.

Similar datasets

World Bank Open Data

LibriSpeech

DOTA (Dataset for Object Detection in Aerial Images)