MAPS — Multilingual Agentic Performance & Security

MAPS is a unique benchmark for testing the performance and security of AI agents in a multilingual context, through various tasks from GAIA, MATH, SWE-Bench and ASB.

Download dataset

Size

96,800 tasks in JSON format, spread over 11 languages

Licence

MIT

Description

‍

MAPS (Multilingual Agentic Performance & Security), a dataset by Fujitsu, is the first multilingual benchmark to assess the performance and secure behaviors of AI agents across a wide variety of tasks. It includes more than 8,800 tasks translated into 11 languages, covering the areas of reasoning, coding, web research, and security in the face of adverse scenarios. The benchmark is based on four sub-datasets: GAIA, MATH, SWE-Bench and ASB, each targeting specific skills.

‍

What is this dataset for?

‍

Comparing the performance of different AI agents in multilingual contexts
Test the robustness and security of agents in the face of sensitive or adverse inputs
Evaluate cross-language generalization in reasoning, coding, and alignment

‍

Can it be enriched or improved?

‍

Yes. Other languages, additional tasks, or custom scenarios can be added. The JSON format makes it easy to integrate with other benchmarks or tools. Metrics or evaluations specific to certain fields (e.g. law, finance) can also be incorporated.

‍

🔎 In summary

Criterion	Evaluation
🧩Ease of use	⭐⭐⭐⭐☆ (Homogeneous format, clear structure)
🧼Need for cleaning	⭐⭐⭐☆☆ (Low – Data verified by bilingual annotators)
🏷️Annotation richness	⭐⭐⭐⭐⭐ (Excellent – human evaluation of translation quality)
📜Commercial license	✅ Yes (MIT)
👨‍💻Ideal for beginners	👩‍💻 Yes, with guidance on benchmarks
🔁Reusable for fine-tuning	🚀 More relevant for evaluation than training
🌍Cultural diversity	🌍 High – 11 languages represented

‍

🧠 Recommended for

Multilingual AI researchers
LLM Agent Developers
AI security laboratories

‍

🔧 Compatible tools

Python
Jupyter
Hugging Face Datasets
OpenAI Evals
LangChain

‍

💡 Tip

Filter tasks by language and domain to identify specific agent pain points.

Frequently Asked Questions

Can this benchmark be used to evaluate non-English speaking agents?

Absolutely. It was designed to test agents in 11 languages, including Arabic, Japanese, Hindi, French, etc.

Is it suitable for fine tuning?

The dataset is especially useful for evaluation. However, some tasks can be used as support for controlled fine-tuning.

Is it possible to add your own scenarios to the benchmark?

Yes, the JSON format makes it easy to add custom scenarios or languages. The benchmark can easily be extended or modified according to the test objectives.

Similar datasets

Audio

GigaSpeech

Text

ChatML Format Dolly 15K

Medical

MIMIC-III