By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
Prompt Injections Data Set
Text

Prompt Injections Data Set

The Prompt Injections dataset contains examples of prompt injections designed to manipulate or bypass LLMs. It includes various techniques, such as prompt leaking, jailbreaking, and switching, in multiple languages.

Download dataset
Size

Over 1000 text examples, multilingual (7 languages), CSV file or similar

Licence

Apache 2.0

Description

This dataset brings together more than 1000 examples of prompt injections in several languages (English, French, German, Spanish, Italian, Portuguese, Romanian) in several languages. These examples illustrate techniques for bypassing and manipulating language models, making it possible to better understand and counter these attacks.

What is this dataset for?

  • Improving the robustness of LLMs in the face of malicious injections
  • Train models to detect and neutralize prompt injections
  • Study the different methods of attacking language models

Can it be enriched or improved?

Yes, this corpus can be supplemented by recent examples or examples specific to certain contexts of use. An additional annotation on the nature of the attacks can also improve its value.

🔎 In summary

Criterion Evaluation
🧩 Ease of use⭐⭐⭐⭐⭐ (Simple, clear format and text-only)
🧼 Need for cleaning⭐⭐⭐⭐⭐ (Very low – ready-to-use data)
🏷️ Annotation richness⭐⭐✩✩✩ (Basic – examples without complex annotation)
📜 Commercial license✅ Yes (Apache 2.0)
👨‍💻 Beginner friendly✅ Yes, accessible for researchers and developers
🔁 Fine-tuning ready🛡️ Useful for fine-tuning in model safety and control
🌍 Cultural diversity⚡ Multilingual – 7 languages represented

🧠 Recommended for

  • AI security researchers
  • LLM developers
  • NLP analysts

🔧 Compatible tools

  • Hugging Face
  • PyTorch
  • TensorFlow
  • Jupyter notebooks

💡 Tip

Treat this data carefully, avoiding its malicious use, to reinforce the security of the systems.

Frequently Asked Questions

What injection techniques are covered by this dataset?

Prompt leaking, jailbreaking, switching mode, and other LLM bypass methods.

Is this dataset only in English?

No, it is multilingual with 7 languages including French, English, English, German, Spanish, Italian, Portuguese and Romanian.

Can this dataset be used to train a business model?

Yes, the Apache 2.0 license allows commercial use under conditions.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.