Cybersecurity Heimdall v1.1

Structured textual data set to train aligned and secure models in the field of defensive cybersecurity.

Download dataset

Size

21,258 system/user/assistant dialogues in Parquet

Licence

Apache 2.0

Description

‍

Cybersecurity Heimdall v1.1 is an instructional training dataset dedicated to defensive cybersecurity. It contains over 21,000 realistic dialogues (triples: System / User / helper), built from more than 100,000 public technical sources. Each exchange is designed to follow security standards such as OWASP, NIST CSF, or MITRE ATT&CK, while integrating explicit denials for malicious requests.

‍

What is this dataset for?

‍

Train specialized language models in defensive cybersecurity
Improving the ethical alignment of LLMs on sensitive technical issues
Serve as a benchmark in QA, classification or synthesis tasks in computer security

‍

Can it be enriched or improved?

‍

Yes. It is possible to add scenarios linked to regional standards (GDPR, ISO 27001), multilingual translations or additional annotations (risk level, type of attack). The triplet structure allows easy customization, adapted to supervised fine-tuning.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of Use	⭐⭐⭐⭐⭐ (Very good – standard format, well structured)
🧼Cleaning Required	⭐⭐⭐☆☆ (Low – data already cleaned and validated)
🏷️Annotation Richness	⭐⭐⭐⭐⭐ (Excellent – `system/user/assistant` structure, diverse domains)
📜Commercial License	✅ Yes (Apache 2.0)
👨‍💻Beginner Friendly	⚠️ Not fully – technical content for an expert audience
🔁Reusable for Fine-Tuning	🔥 Perfect for defensive SFT LLMs
🌍Cultural Diversity	🌍 Limited – mainly focused on Western standards (OWASP, NIST, MITRE)