GitHub Code Snippets

Very large corpus of code snippets collected from GitHub repositories with more than 10,000 stars. Multi-language, ideal for training code generation models.

Download dataset

Size

97 million snippets, text formats (JSON or plain), multi-languages (Python, JS, etc.)

Licence

CC BY 4.0

Description

‍

The dataset GitHub Code Snippets includes over 97 million open-source code snippets from very popular GitHub projects (over 10,000 stars). It covers numerous programming languages like Python, JavaScript, Java, Java, C++, Go, Rust, among others. Each extract is isolated, making it easy to process for NLP/code tasks. It is designed to train completion, syntactic analysis, or code recommendation models.

‍

What is this dataset for?

‍

Train LLM models specialized in code generation or completion
Create intelligent development assistants ("Copilot" type)
Analyze common code styles or syntactic structures

‍

Can it be enriched or improved?

‍

Yes, you can enrich it by associating each snippet with its detected language, adding the context of the source file or integrating metadata such as the repository name, the original license or the timestamp. It can also be cleaned to remove duplicates or filter out content that is too short or too simple.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of use	⭐⭐⭐☆☆ (Massive volume, requires appropriate tools)
🧼Need for cleaning	⭐⭐☆☆☆ (Moderate – Important to standardize formats)
🏷️Annotation richness	⭐☆☆☆☆ (Low – Mainly raw content)
📜Commercial license	✅ Yes (CC BY 4.0)
👨‍💻Beginner friendly	❌ No – complex and large-scale handling
🔁Reusable for fine-tuning	🔥 Excellent for code-oriented LLMs
🌍Cultural diversity	🌐 Good variety of languages, but biased toward popular GitHub projects