ChatML Format Dolly 15K

Converted version of the famous Dolly 15K dataset into a standard ChatML format, compatible with conversational assistant models.

Download dataset

Size

15,000 dialogues, Structured Parquet format

Licence

CC-BY-SA 3.0

Description

‍

The dataset ChatML-Databricks-Dolly-15K is a restructured version of the dataset Dolly 15K, converted to ChatML format. This format is widely used for training open-source conversational models compatible with structured prompts (e.g.: LLama, Mistral, etc.). Each example is a pair instruction + context followed by a response, represented as roiled messages (User and helper).

‍

What is this dataset for?

‍

Fine-tune an AI assistant model (chatbot)
Test the tuning instruction in a standardized format
Experimenting with the ChatML format for multirole inference

‍

Can it be enriched or improved?

‍

Yes, you can enrich this dataset by adding metadata (difficulty, thematic category), translating instructions or combining it with other similar formats. It is also possible to complete it with data from real or simulated dialogues.

‍

🔎 In summary

Criterion	Evaluation
🧩 Ease of use	⭐⭐⭐⭐⭐ (Ready-to-use format for LLMs)
🧼 Need for cleaning	⭐⭐⭐⭐⭐ (None – already restructured)
🏷️ Annotation richness	⭐⭐✩✩✩ (Simple, but sufficient for instructive dialogue)
📜 Commercial license	✅ Yes (CC-BY-SA 3.0)
👨‍💻 Beginner friendly	⚡ Very good starting point for fine-tuning
🔁 Fine-tuning ready	🤖 Optimal format for assistants
🌍 Cultural diversity	⚠️ Mostly English