By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Medical Instruction 100K
Medical

Medical Instruction 100K

This free dataset brings together dialogues between humans and AI models in a medical setting. It covers prescriptions, natural treatments, medications, and wellness advice.

Download dataset
Size

Around 100,000 examples in JSONL

Licence

MIT

Description

Medical Instruction 100K is a corpus of textual data intended for training language models in medical contexts. It compiles around 100,000 sample dialogues containing instructions and answers related to health: medication names, breathing tips, yogic exercises, or natural remedies.

What is this dataset for?

  • Train LLM models specialized in text-based medical assistance
  • Simulate dialogues between patients and wellness practitioners or coaches
  • Test the medical understanding of generative models on various scenarios

Can it be enriched or improved?

Yes. This dataset can be improved by adding annotations (risk levels, disease categories, languages), by translating it or by adapting it to local use cases (traditional medicine, local nutrition, etc.). It can also be used as a base for RLHF or instruct-tuning projects in a medical setting.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐⭐ (JSONL format, easy to load)
🧼 Need for cleaning⭐⭐⭐⭐✩ (Light – check duplicates and consistency)
🏷️ Annotation richness⭐⭐✩✩✩ (Low – no structured annotations)
📜 Commercial license✅ Yes (MIT)
👨‍💻 Beginner friendly✅ Yes – simple to use with minimal resources
🔁 Fine-tuning ready🩺 Very suitable for specialized health models
🌍 Cultural diversity⚠️ Medium – vocabulary mostly English, with natural/global elements

🧠 Recommended for

  • Creators of health chatbots
  • Wellness coaching projects
  • Specialized LLMs

🔧 Compatible tools

  • Hugging Face Transformers
  • QLora
  • PyTorch
  • VLLM

💡 Tip

For better impact, cross-reference this dataset with clinically validated or multilingual sources.

Frequently Asked Questions

Can this dataset be used in clinical applications?

No, it is designed for exploratory or assistance uses. Any clinical application requires validation by medical experts.

Is it possible to filter the dataset by content type?

Currently no, but you can add thematic filters (pharmacology, well-being...) by manually annotating the examples.

Does this dataset contain multilingual sources?

No, the data is mostly in English. A controlled translation is recommended for multilingual use.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.