Synthetic Clinical Notes Embedded
Massive textual medical dataset, structured in instruction-response format with columns enriched by embeddings for each example. Suitable for medical LLM training.
Description
Synthetic Clinical Notes Embedded is a vast synthetic dataset of 158,000 examples from simulated clinical notes, based on sources such as MIMIC III and PubMed Central. The data is structured in instruction/input/output format (Alpaca type) and enriched by embeddings generated with the model Baai/BGE-Small-EN-v1.5. It is particularly useful for training language models in the medical field.
What is this dataset for?
- Fine-tuning language models for generating or understanding medical texts
- Training in tasks of extracting information or coreferencing in patient records
- Direct use for clinical embeddings research
Can it be enriched or improved?
Yes, it can be increased with other types of synthetic clinical notes, adapted to other languages, or integrated additional annotations (medical entities, ICD categories, temporality of events). Embeddings can also be recalculated with other models as required.
🔎 In summary
🧠 Recommended for
- Medical NLP researchers
- Health data scientists
- Clinical assistant projects
🔧 Compatible tools
- Hugging Face Transformers
- LangChain
- SentenceTransformers
💡 Tip
Use pre-calculated embeddings to explore semantic diversity before fine-tuning, or to create intelligent clinical search engines.
Frequently Asked Questions
Is the data from real patients?
No, these are synthetic clinical notes generated from public data (PMC, MIMIC III) to avoid any breach of confidentiality.
Can this dataset be used to train multilingual models?
Currently in English only, it can however be translated or enriched for multilingual purposes via controlled approaches.
What are the embeddings integrated into the dataset used for?
They allow direct semantic analysis of inputs/outputs, and facilitate integration into search or clustering systems.