GitHub Code Snippets
Very large corpus of code snippets collected from GitHub repositories with more than 10,000 stars. Multi-language, ideal for training code generation models.
97 million snippets, text formats (JSON or plain), multi-languages (Python, JS, etc.)
CC BY 4.0
Description
The dataset GitHub Code Snippets includes over 97 million open-source code snippets from very popular GitHub projects (over 10,000 stars). It covers numerous programming languages like Python, JavaScript, Java, Java, C++, Go, Rust, among others. Each extract is isolated, making it easy to process for NLP/code tasks. It is designed to train completion, syntactic analysis, or code recommendation models.
What is this dataset for?
- Train LLM models specialized in code generation or completion
- Create intelligent development assistants ("Copilot" type)
- Analyze common code styles or syntactic structures
Can it be enriched or improved?
Yes, you can enrich it by associating each snippet with its detected language, adding the context of the source file or integrating metadata such as the repository name, the original license or the timestamp. It can also be cleaned to remove duplicates or filter out content that is too short or too simple.
🔎 In summary
🧠 Recommended for
- Generative AI researchers
- LLMs developers
- Copilot-like projects
🔧 Compatible tools
- Transformers (CodeT5, StarCoder)
- Jupyter
- Apache Arrow
- BigQuery
💡 Tip
Pre-filter by programming language and snippet size to improve training efficiency.
Frequently Asked Questions
Does the dataset contain complete files or only extracts?
These are code snippets only, with no full file context.
Can this dataset be used to generate code in production?
Yes, as long as it is refined with more contextualized examples, in particular to respect real coding practices.
Is it possible to automatically detect the language of each snippet?
Yes, tools like Pygments or GitHub Linguist can be used to detect and categorize snippets by language.