By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Open Datasets
MAPS — Multilingual Agentic Performance & Security
Text

MAPS — Multilingual Agentic Performance & Security

MAPS is a unique benchmark for testing the performance and security of AI agents in a multilingual context, through various tasks from GAIA, MATH, SWE-Bench and ASB.

Download dataset
Size

96,800 tasks in JSON format, spread over 11 languages

Licence

MIT

Description

MAPS (Multilingual Agentic Performance & Security), a dataset by Fujitsu, is the first multilingual benchmark to assess the performance and secure behaviors of AI agents across a wide variety of tasks. It includes more than 8,800 tasks translated into 11 languages, covering the areas of reasoning, coding, web research, and security in the face of adverse scenarios. The benchmark is based on four sub-datasets: GAIA, MATH, SWE-Bench and ASB, each targeting specific skills.

What is this dataset for?

  • Comparing the performance of different AI agents in multilingual contexts
  • Test the robustness and security of agents in the face of sensitive or adverse inputs
  • Evaluate cross-language generalization in reasoning, coding, and alignment

Can it be enriched or improved?

Yes. Other languages, additional tasks, or custom scenarios can be added. The JSON format makes it easy to integrate with other benchmarks or tools. Metrics or evaluations specific to certain fields (e.g. law, finance) can also be incorporated.

🔎 In summary

Criterion Evaluation
🧩Ease of use ⭐⭐⭐⭐☆ (Homogeneous format, clear structure)
🧼Need for cleaning ⭐⭐⭐☆☆ (Low – Data verified by bilingual annotators)
🏷️Annotation richness ⭐⭐⭐⭐⭐ (Excellent – human evaluation of translation quality)
📜Commercial license ✅ Yes (MIT)
👨‍💻Ideal for beginners 👩‍💻 Yes, with guidance on benchmarks
🔁Reusable for fine-tuning 🚀 More relevant for evaluation than training
🌍Cultural diversity 🌍 High – 11 languages represented

🧠 Recommended for

  • Multilingual AI researchers
  • LLM Agent Developers
  • AI security laboratories

🔧 Compatible tools

  • Python
  • Jupyter
  • Hugging Face Datasets
  • OpenAI Evals
  • LangChain

💡 Tip

Filter tasks by language and domain to identify specific agent pain points.

Frequently Asked Questions

Can this benchmark be used to evaluate non-English speaking agents?

Absolutely. It was designed to test agents in 11 languages, including Arabic, Japanese, Hindi, French, etc.

Is it suitable for fine tuning?

The dataset is especially useful for evaluation. However, some tasks can be used as support for controlled fine-tuning.

Is it possible to add your own scenarios to the benchmark?

Yes, the JSON format makes it easy to add custom scenarios or languages. The benchmark can easily be extended or modified according to the test objectives.

Similar datasets

See more
Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.

Category

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique.