Medical Instruction 100K
This free dataset brings together dialogues between humans and AI models in a medical setting. It covers prescriptions, natural treatments, medications, and wellness advice.
Description
Medical Instruction 100K is a corpus of textual data intended for training language models in medical contexts. It compiles around 100,000 sample dialogues containing instructions and answers related to health: medication names, breathing tips, yogic exercises, or natural remedies.
What is this dataset for?
- Train LLM models specialized in text-based medical assistance
- Simulate dialogues between patients and wellness practitioners or coaches
- Test the medical understanding of generative models on various scenarios
Can it be enriched or improved?
Yes. This dataset can be improved by adding annotations (risk levels, disease categories, languages), by translating it or by adapting it to local use cases (traditional medicine, local nutrition, etc.). It can also be used as a base for RLHF or instruct-tuning projects in a medical setting.
🔎 In summary
🧠 Recommended for
- Creators of health chatbots
- Wellness coaching projects
- Specialized LLMs
🔧 Compatible tools
- Hugging Face Transformers
- QLora
- PyTorch
- VLLM
💡 Tip
For better impact, cross-reference this dataset with clinically validated or multilingual sources.
Frequently Asked Questions
Can this dataset be used in clinical applications?
No, it is designed for exploratory or assistance uses. Any clinical application requires validation by medical experts.
Is it possible to filter the dataset by content type?
Currently no, but you can add thematic filters (pharmacology, well-being...) by manually annotating the examples.
Does this dataset contain multilingual sources?
No, the data is mostly in English. A controlled translation is recommended for multilingual use.