By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
GitHub Code Snippets
Text

GitHub Code Snippets

Very large corpus of code snippets collected from GitHub repositories with more than 10,000 stars. Multi-language, ideal for training code generation models.

Download dataset
Size

97 million snippets, text formats (JSON or plain), multi-languages (Python, JS, etc.)

Licence

CC BY 4.0

Description

The dataset GitHub Code Snippets includes over 97 million open-source code snippets from very popular GitHub projects (over 10,000 stars). It covers numerous programming languages like Python, JavaScript, Java, Java, C++, Go, Rust, among others. Each extract is isolated, making it easy to process for NLP/code tasks. It is designed to train completion, syntactic analysis, or code recommendation models.

What is this dataset for?

  • Train LLM models specialized in code generation or completion
  • Create intelligent development assistants ("Copilot" type)
  • Analyze common code styles or syntactic structures

Can it be enriched or improved?

Yes, you can enrich it by associating each snippet with its detected language, adding the context of the source file or integrating metadata such as the repository name, the original license or the timestamp. It can also be cleaned to remove duplicates or filter out content that is too short or too simple.

🔎 In summary

Criterion Evaluation
🧩Ease of use ⭐⭐⭐☆☆ (Massive volume, requires appropriate tools)
🧼Need for cleaning ⭐⭐☆☆☆ (Moderate – Important to standardize formats)
🏷️Annotation richness ⭐☆☆☆☆ (Low – Mainly raw content)
📜Commercial license ✅ Yes (CC BY 4.0)
👨‍💻Beginner friendly ❌ No – complex and large-scale handling
🔁Reusable for fine-tuning 🔥 Excellent for code-oriented LLMs
🌍Cultural diversity 🌐 Good variety of languages, but biased toward popular GitHub projects

🧠 Recommended for

  • Generative AI researchers
  • LLMs developers
  • Copilot-like projects

🔧 Compatible tools

  • Transformers (CodeT5, StarCoder)
  • Jupyter
  • Apache Arrow
  • BigQuery

💡 Tip

Pre-filter by programming language and snippet size to improve training efficiency.

Frequently Asked Questions

Does the dataset contain complete files or only extracts?

These are code snippets only, with no full file context.

Can this dataset be used to generate code in production?

Yes, as long as it is refined with more contextualized examples, in particular to respect real coding practices.

Is it possible to automatically detect the language of each snippet?

Yes, tools like Pygments or GitHub Linguist can be used to detect and categorize snippets by language.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.