By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Synthetic Clinical Notes Embedded
Text

Synthetic Clinical Notes Embedded

Massive textual medical dataset, structured in instruction-response format with columns enriched by embeddings for each example. Suitable for medical LLM training.

Download dataset
Size

158,000 examples, Parquet format with embeddings, 648M tokens

Licence

MIT

Description

Synthetic Clinical Notes Embedded is a vast synthetic dataset of 158,000 examples from simulated clinical notes, based on sources such as MIMIC III and PubMed Central. The data is structured in instruction/input/output format (Alpaca type) and enriched by embeddings generated with the model Baai/BGE-Small-EN-v1.5. It is particularly useful for training language models in the medical field.

What is this dataset for?

Can it be enriched or improved?

Yes, it can be increased with other types of synthetic clinical notes, adapted to other languages, or integrated additional annotations (medical entities, ICD categories, temporality of events). Embeddings can also be recalculated with other models as required.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐⭐ (Clean format, ready-to-use for medical NLP)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (None – post-processed data)
🏷️ Annotation richness⭐⭐⭐⭐⭐ (Excellent – structured format + embeddings + thematic diversity)
📜 Commercial license✅ Yes (MIT)
👨‍💻 Beginner friendly⚠️ Moderate – good foundation in medical NLP required
🔁 Fine-tuning ready🩺 Excellent base for health LLMs
🌍 Cultural diversity⚠️ English only, but varied medical topics

🧠 Recommended for

  • Medical NLP researchers
  • Health data scientists
  • Clinical assistant projects

🔧 Compatible tools

  • Hugging Face Transformers
  • LangChain
  • SentenceTransformers

💡 Tip

Use pre-calculated embeddings to explore semantic diversity before fine-tuning, or to create intelligent clinical search engines.

Frequently Asked Questions

Is the data from real patients?

No, these are synthetic clinical notes generated from public data (PMC, MIMIC III) to avoid any breach of confidentiality.

Can this dataset be used to train multilingual models?

Currently in English only, it can however be translated or enriched for multilingual purposes via controlled approaches.

What are the embeddings integrated into the dataset used for?

They allow direct semantic analysis of inputs/outputs, and facilitate integration into search or clustering systems.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.