Synthetic Clinical Notes Embedded

Massive textual medical dataset, structured in instruction-response format with columns enriched by embeddings for each example. Suitable for medical LLM training.

Download dataset

Size

158,000 examples, Parquet format with embeddings, 648M tokens

Licence

MIT

Description

‍

Synthetic Clinical Notes Embedded is a vast synthetic dataset of 158,000 examples from simulated clinical notes, based on sources such as MIMIC III and PubMed Central. The data is structured in instruction/input/output format (Alpaca type) and enriched by embeddings generated with the model Baai/BGE-Small-EN-v1.5. It is particularly useful for training language models in the medical field.

‍

What is this dataset for?

‍

Fine-tuning language models for generating or understanding medical texts
Training in tasks of extracting information or coreferencing in patient records
Direct use for clinical embeddings research

‍

Can it be enriched or improved?

‍

Yes, it can be increased with other types of synthetic clinical notes, adapted to other languages, or integrated additional annotations (medical entities, ICD categories, temporality of events). Embeddings can also be recalculated with other models as required.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐⭐ (Clean format, ready-to-use for medical NLP)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (None – post-processed data)
🏷️ Annotation richness	⭐⭐⭐⭐⭐ (Excellent – structured format + embeddings + thematic diversity)
📜 Commercial license	✅ Yes (MIT)
👨‍💻 Beginner friendly	⚠️ Moderate – good foundation in medical NLP required
🔁 Fine-tuning ready	🩺 Excellent base for health LLMs
🌍 Cultural diversity	⚠️ English only, but varied medical topics