SFT General Knowledge - LLM training dataset

A massive corpus for the supervised training of language models on various tasks: QA, writing, reasoning, etc.

Download dataset

Size

1.63 million examples (2.19 GB), JSON/parquet format

Licence

MIT

Description

‍

SFT-Dataset-General-Knowledge is a data set designed for the supervised training of large language models (LLM). It includes over 1.6 million instruction-response entries covering a broad range of general knowledge. The dataset is structured to allow precise and multi-domain fine-tuning.

‍

What is this dataset for?

‍

Train an LLM on varied and contextualized responses
Do fine-tuning in tuning or QA instruction
Evaluate the performance of a model on general understanding tasks

‍

Can it be enriched or improved?

‍

Yes. It is possible to filter or group data by theme (science, culture, tech...) for a specialization. Additional annotations (difficulty level, style, sources) can also reinforce its usefulness. The size of the dataset also allows for intelligent sampling.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐⭐ (Very simple – classic instruction/response format)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (Low – clean structure, but sorting needed for some specific cases)
🏷️ Annotation richness	⭐⭐⭐✩✩ (Medium – each entry contains instruction and response, no additional metadata)
📜 Commercial license	✅ Yes (MIT)
👨‍💻 Beginner friendly	🌟 Yes – ideal for testing fine-tuning on small samples
🔁 Fine-tuning ready	🎯 Perfect for SFT training
🌍 Cultural diversity	⚠️ Medium – generalist content, mostly English