MultiNli (Multi-Genre Natural Language Inference Corpus)

MultiNLI (Multi-Genre Natural Language Inference) is a reference data set for evaluating the logical understanding of language by NLP models. It was designed to test the ability of models to determine the relationship between two sentences: involvement, contradiction, or neutrality.

Download dataset

Size

Approximately 400,000 sentence pairs, TSV format

Licence

Free for academic use. Restrictions may apply depending on commercial use

Description

‍
The MultiNli dataset includes:

Approximately 400,000 pairs of manually annotated sentences
Three logical relationships: bias, contradiction, neutral
A diversity of textual sources covering formal and informal contexts
A TSV format that is easy to integrate into traditional NLP pipelines

‍

What is this dataset for?

‍
MultiNli is mainly used for:

Training textual entailment recognition models
Assessing the ability of models to detect logical relationships between sentences
The fine-tuning of language models on contextual comprehension tasks
Analysis of the robustness and logical coherence of the responses generated by the models

‍

Can it be enriched or improved?

‍
Yes, MultiNli can be enriched or adapted for:

Create multilingual versions to evaluate models in other languages
Add metadata about genres or domains for finer filtering
Combine with SNLI (Stanford NLI) for wider coverage
Automatically generate new pairs with paraphrase or contradiction models

‍

🔗 Source: MultiNli Dataset

‍

Frequently Asked Questions

What is the difference between MultiNLI and SNLI?

SNLI is focused on a single domain (image descriptions), while MultiNLI covers multiple text genres, making it possible to better test the generalization of models across different language styles.

Can MultiNli be used for evaluation and training?

Yes, it is frequently used both for fine-tuning and for evaluating the logical inference quality of a model.

Why is MultiNli important for generation models?

Even though it's not a generation dataset, MultiNli helps train models to maintain logical consistency in their responses, which is critical for applications like chatbots or voice assistants.

Similar datasets

Medical

PhysioNet

Text

UCI Machine Learning Repository

Medical

TCIA Dataset (The Cancer Imaging Archive)