ChatML Format Dolly 15K
Converted version of the famous Dolly 15K dataset into a standard ChatML format, compatible with conversational assistant models.
Description
The dataset ChatML-Databricks-Dolly-15K is a restructured version of the dataset Dolly 15K, converted to ChatML format. This format is widely used for training open-source conversational models compatible with structured prompts (e.g.: LLama, Mistral, etc.). Each example is a pair instruction + context followed by a response, represented as roiled messages (User
and helper
).
What is this dataset for?
- Fine-tune an AI assistant model (chatbot)
- Test the tuning instruction in a standardized format
- Experimenting with the ChatML format for multirole inference
Can it be enriched or improved?
Yes, you can enrich this dataset by adding metadata (difficulty, thematic category), translating instructions or combining it with other similar formats. It is also possible to complete it with data from real or simulated dialogues.
🔎 In summary
🧠 Recommended for
- Conversational agent developers
- Fine-tuning researchers
- Open-source LLM enthusiasts
🔧 Compatible tools
- Hugging Face Transformers
- VLLM
- Axolotl
- FastChat
- LoRa
💡 Tip
To maximize performance, adapt the messages to the exact structure expected by your target model (e.g. adding special tokens).
Frequently Asked Questions
Can this dataset be used with Mistral or LLama?
Yes, the ChatML format is largely compatible with open-source models like LLama, Mistral, etc.
What is the difference with the original Dolly dataset?
It is a version converted to ChatML format, better suited to models with a conversational architecture.
Is it multilingual?
No, this dataset is mostly in English. For multilingual purposes, it can be completed with other data sets.