FLORES+: Multilingual Translation Benchmark
A multi-lingual benchmark for evaluating translation quality in over 200 languages, derived from a variety of sources such as Wikinews and Wikivoyage.
Approximately 2,000 sentences per language × 222 languages, structured text format
CC-BY-SA 4.0
Description
FLORES+ is a multilingual benchmark used to test the accuracy of machine translation across 222 languages. It contains sentences from various sources (Wikinews, Wikivoyage, Wikijunior), translated from English into a wide range of languages. The corpus is divided into standardized splits (dev, devtest), facilitating comparisons between models.
What is this dataset for?
- Evaluate the performance of translation models in low- and high-resource languages
- Testing multilingual systems in a controlled context
- Explore LLM or NMT language coverage
Can it be enriched or improved?
Yes. You can add new language pairs, complete the game with additional human translations or enrich the metadata by language (linguistic family, typology). It can also be used as a basis for creating specialized benchmarks by field (legal, medical, etc.).
🔎 In summary
🧠 Recommended for
- Translation researchers
- Low resource language specialists
- Multilingual model developers
🔧 Compatible tools
- MarianMt
- Fairseq
- Hugging Face Transformers
- BLUE/METEOR
💡 Tip
Use differentiated metrics (BLEU, COMET, ChRF) according to the languages for a detailed evaluation.
Frequently Asked Questions
Can FLORES+ be used to evaluate models on rare languages?
Yes, it is one of its main assets: its coverage includes numerous low-resource languages.
Does the dataset contain parallel texts for learning?
No, it's designed for evaluation. Each source sentence is translated into multiple languages, but it's not a training corpus.
Is this benchmark compatible with fine-tuned translation models?
Absolutely, it is frequently used to validate the quality of trained or adapted models.