Medical Instruction 100K

This free dataset brings together dialogues between humans and AI models in a medical setting. It covers prescriptions, natural treatments, medications, and wellness advice.

Download dataset

Size

Around 100,000 examples in JSONL

Licence

MIT

Description

‍

Medical Instruction 100K is a corpus of textual data intended for training language models in medical contexts. It compiles around 100,000 sample dialogues containing instructions and answers related to health: medication names, breathing tips, yogic exercises, or natural remedies.

‍

What is this dataset for?

‍

Train LLM models specialized in text-based medical assistance
Simulate dialogues between patients and wellness practitioners or coaches
Test the medical understanding of generative models on various scenarios

‍

Can it be enriched or improved?

‍

Yes. This dataset can be improved by adding annotations (risk levels, disease categories, languages), by translating it or by adapting it to local use cases (traditional medicine, local nutrition, etc.). It can also be used as a base for RLHF or instruct-tuning projects in a medical setting.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐⭐ (JSONL format, easy to load)
🧼 Need for cleaning	⭐⭐⭐⭐✩ (Light – check duplicates and consistency)
🏷️ Annotation richness	⭐⭐✩✩✩ (Low – no structured annotations)
📜 Commercial license	✅ Yes (MIT)
👨‍💻 Beginner friendly	✅ Yes – simple to use with minimal resources
🔁 Fine-tuning ready	🩺 Very suitable for specialized health models
🌍 Cultural diversity	⚠️ Medium – vocabulary mostly English, with natural/global elements