By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Cybersecurity Heimdall v1.1
Text

Cybersecurity Heimdall v1.1

Structured textual data set to train aligned and secure models in the field of defensive cybersecurity.

Download dataset
Size

21,258 system/user/assistant dialogues in Parquet

Licence

Apache 2.0

Description

Cybersecurity Heimdall v1.1 is an instructional training dataset dedicated to defensive cybersecurity. It contains over 21,000 realistic dialogues (triples). System / User / helper), built from more than 100,000 public technical sources. Each exchange is designed to follow security standards such as OWASP, NIST CSF, or MITRE ATT&CK, while integrating explicit denials for malicious requests.

What is this dataset for?

  • Train specialized language models in defensive cybersecurity
  • Improving the ethical alignment of LLMs on sensitive technical issues
  • Serve as a benchmark in QA, classification or synthesis tasks in computer security

Can it be enriched or improved?

Yes. It is possible to add scenarios linked to regional standards (RGPD, ISO 27001), multilingual translations or additional annotations (risk level, type of attack). The triplet structure allows easy customization, adapted to supervised fine-tuning.

🔎 In summary

Criterion Evaluation
🧩Ease of Use ⭐⭐⭐⭐⭐ (Very good – standard format, well structured)
🧼Cleaning Required ⭐⭐⭐☆☆ (Low – data already cleaned and validated)
🏷️Annotation Richness ⭐⭐⭐⭐⭐ (Excellent – system/user/assistant structure, diverse domains)
📜Commercial License ✅ Yes (Apache 2.0)
👨‍💻Beginner Friendly ⚠️ Not fully – technical content for an expert audience
🔁Reusable for Fine-Tuning 🔥 Perfect for defensive SFT LLMs
🌍Cultural Diversity 🌍 Limited – mainly focused on Western standards (OWASP, NIST, MITRE)

🧠 Recommended for

  • Cybersecurity researchers
  • AI security engineers
  • Cybersecurity Agent creators

🔧 Compatible tools

  • Hugging Face Transformers
  • TRL
  • QLora
  • DeepSpeed
  • LangChain

💡 Tip

Use system fields to inject ethical constraints and reinforce the automatic refusal of offensive prompts.

Frequently Asked Questions

Does this dataset include examples of red teaming?

No, it focuses on defensive approaches. Offensive tactics are not present in order to maintain a secure and ethical framework.

Can this dataset be used in a professional setting?

Yes, the Apache 2.0 license allows commercial or industrial use, provided you meet the license conditions.

Is it multilingual?

No, it's mostly in English. However, it can be enriched with translations for multilingual cybersecurity projects.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.