MAPS — Multilingual Agentic Performance & Security
MAPS is a unique benchmark for testing the performance and security of AI agents in a multilingual context, through various tasks from GAIA, MATH, SWE-Bench and ASB.
Description
MAPS (Multilingual Agentic Performance & Security), a dataset by Fujitsu, is the first multilingual benchmark to assess the performance and secure behaviors of AI agents across a wide variety of tasks. It includes more than 8,800 tasks translated into 11 languages, covering the areas of reasoning, coding, web research, and security in the face of adverse scenarios. The benchmark is based on four sub-datasets: GAIA, MATH, SWE-Bench and ASB, each targeting specific skills.
What is this dataset for?
- Comparing the performance of different AI agents in multilingual contexts
- Test the robustness and security of agents in the face of sensitive or adverse inputs
- Evaluate cross-language generalization in reasoning, coding, and alignment
Can it be enriched or improved?
Yes. Other languages, additional tasks, or custom scenarios can be added. The JSON format makes it easy to integrate with other benchmarks or tools. Metrics or evaluations specific to certain fields (e.g. law, finance) can also be incorporated.
🔎 In summary
🧠 Recommended for
- Multilingual AI researchers
- LLM Agent Developers
- AI security laboratories
🔧 Compatible tools
- Python
- Jupyter
- Hugging Face Datasets
- OpenAI Evals
- LangChain
💡 Tip
Filter tasks by language and domain to identify specific agent pain points.
Frequently Asked Questions
Can this benchmark be used to evaluate non-English speaking agents?
Absolutely. It was designed to test agents in 11 languages, including Arabic, Japanese, Hindi, French, etc.
Is it suitable for fine tuning?
The dataset is especially useful for evaluation. However, some tasks can be used as support for controlled fine-tuning.
Is it possible to add your own scenarios to the benchmark?
Yes, the JSON format makes it easy to add custom scenarios or languages. The benchmark can easily be extended or modified according to the test objectives.