By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Common Crawl
Text

Common Crawl

Common Crawl is one of the largest sources of plain text that is freely available. It is a public archive of billions of web pages explored and collected regularly by indexing robots. This massive corpus constitutes an essential database for training large language models (LLMs).

Download dataset
Size

Several terabytes of text data, in WARC format (Web ARchive)

Licence

Open data licensed under Common Crawl. Verification required for commercial uses based on content

Description


The Common Crawl dataset includes:

  • Several terabytes of plain text from the web
  • WARC (Web ARchive) formats, used to store metadata, HTML content, and full HTTP responses
  • Very broad coverage: news, blogs, forums, forums, encyclopedias, online stores, etc.
  • Monthly versions available since 2008

What is this dataset for?


Common Crawl is used for:

  • Training large language models (GPT, Falcon, LLama, etc.)
  • The study of linguistic evolutions, biases and online representations
  • Improving search engines and automatic indexing systems
  • The construction of specialized corpora based on thematic or linguistic filters applied to the data

Can it be enriched or improved?


Yes, the dataset can be adapted and filtered for:

  • Clean up duplicates, low quality pages, or non-text content
  • Extract specific areas (medical, legal, education, etc.)
  • Create multilingual versions or versions focused on certain regions of the world
  • Annotate texts for classification, summary, extraction, or machine translation tasks

🔗 Source: Common Crawl Dataset

Frequently Asked Questions

Can Common Crawl be used directly as it is?

No Because of its volume and its raw structure, it requires significant processing: cleaning, extracting useful text, filtering by language or domain, etc.

Does Common Crawl contain personal information?

Since the corpus comes from the web, it can accidentally include personal information. It is therefore essential to apply privacy filters before any sensitive or commercial use.

Why is Common Crawl used for LLMs?

Its size, thematic diversity and accessibility make it an ideal basis for training models capable of generalizing to varied and complex contexts.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.