By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
ChatML Format Dolly 15K
Text

ChatML Format Dolly 15K

Converted version of the famous Dolly 15K dataset into a standard ChatML format, compatible with conversational assistant models.

Download dataset
Size

15,000 dialogues, Structured Parquet format

Licence

CC-BY-SA 3.0

Description

The dataset ChatML-Databricks-Dolly-15K is a restructured version of the dataset Dolly 15K, converted to ChatML format. This format is widely used for training open-source conversational models compatible with structured prompts (e.g.: LLama, Mistral, etc.). Each example is a pair instruction + context followed by a response, represented as roiled messages (User and helper).

What is this dataset for?

  • Fine-tune an AI assistant model (chatbot)
  • Test the tuning instruction in a standardized format
  • Experimenting with the ChatML format for multirole inference

Can it be enriched or improved?

Yes, you can enrich this dataset by adding metadata (difficulty, thematic category), translating instructions or combining it with other similar formats. It is also possible to complete it with data from real or simulated dialogues.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐⭐ (Ready-to-use format for LLMs)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (None – already restructured)
🏷️ Annotation richness⭐⭐✩✩✩ (Simple, but sufficient for instructive dialogue)
📜 Commercial license✅ Yes (CC-BY-SA 3.0)
👨‍💻 Beginner friendly⚡ Very good starting point for fine-tuning
🔁 Fine-tuning ready🤖 Optimal format for assistants
🌍 Cultural diversity⚠️ Mostly English

🧠 Recommended for

  • Conversational agent developers
  • Fine-tuning researchers
  • Open-source LLM enthusiasts

🔧 Compatible tools

  • Hugging Face Transformers
  • VLLM
  • Axolotl
  • FastChat
  • LoRa

💡 Tip

To maximize performance, adapt the messages to the exact structure expected by your target model (e.g. adding special tokens).

Frequently Asked Questions

Can this dataset be used with Mistral or LLama?

Yes, the ChatML format is largely compatible with open-source models like LLama, Mistral, etc.

What is the difference with the original Dolly dataset?

It is a version converted to ChatML format, better suited to models with a conversational architecture.

Is it multilingual?

No, this dataset is mostly in English. For multilingual purposes, it can be completed with other data sets.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.