MultiBLiMP: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Task Description

MultiBLiMP is a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs.

Paper: https://arxiv.org/abs/2504.02768
GitHub Repo: https://github.com/jumelet/multiblimp/
Hugging Face Dataset Repo: https://huggingface.co/datasets/jumelet/multiblimp

Implementation

multiblimp_{lang} runs MultiBLiMP for a given language, where {lang} must be replaced by the language's ISO 639-3 code (e.g., eng for English, abk for Abkhazian, wbp for Warlpiri, etc.).
multiblimp tag runs MultiBLiMP for all languages

Note: The original implementation is provided here, and the dataset repository also includes a link to a more flexible version of the implementation here. This implementation follows these as closely as possible, but the original implementations normalize length by number of tokens, which is not supported by the Language Model Evaluation Harness (see [1], [2], [3]). For this reason, the implementation provided here includes both the acc (accuracy based on comparing the unnormalized log-probability of the correct and incorrect versions of each sentence) and acc_norm (the same as acc but with sentence log-probability normalized by number of bytes) metrics.

Dataset Details

This table (from the Hugging Face Dataset Repo) lists the languages covered in MultiBLiMP and the number of items for each language.

ISO Code	Language	n
abk	Abkhazian	40
aqz	Akuntsu	14
sqi	Albanian	243
amh	Amharic	112
grc	Ancient Greek	3695
hbo	Ancient Hebrew	983
apu	Apurinã	28
hye	Armenian	1415
eus	Basque	273
bel	Belarusian	2570
ben	Bengali	21
bho	Bhojpuri	34
bor	Borôro	241
bre	Breton	260
bul	Bulgarian	2458
bua	Buriat	103
cat	Catalan	2284
chu	Church Slavonic	4166
xcl	Classical Armenian	1623
ces	Czech	4256
dan	Danish	50
nld	Dutch	2331
egy	Egyptian (Ancient)	22
eng	English	770
myv	Erzya	464
est	Estonian	2575
fao	Faroese	232
fin	Finnish	2570
fra	French	2548
glg	Galician	753
kat	Georgian	204
deu	German	2298
aln	Gheg Albanian	677
got	Gothic	1579
guj	Gujarati	7
heb	Hebrew	2330
azz	H-P Nahuatl	207
hin	Hindi	1447
hit	Hittite	50
hun	Hungarian	845
isl	Icelandic	2801
gle	Irish	28
ita	Italian	2999
quc	K'iche'	131
xnr	Kangri	86
krl	Karelian	260
kxh	Karo (Ethiopia)	120
kaz	Kazakh	173
kir	Kirghiz	185
koi	Komi-Permyak	43
kpv	Komi-Zyrian	320
lat	Latin	3149
lav	Latvian	3032
lij	Ligurian	254
lit	Lithuanian	1180
olo	Livvi	190
nds	Low German	1774
mkd	Macedonian	39
mar	Marathi	460
frm	Middle French	294
ell	Modern Greek	1096
mdf	Moksha	82
yrl	Nhengatu	720
pcm	Nigerian Pidgin	26
kmr	Northern Kurdish	544
sme	Northern Sami	2536
fro	Old French	1976
orv	Old Russian	4615
ota	Ottoman Turkish	99
fas	Persian	2553
xpg	Phrygian	50
pol	Polish	3272
por	Portuguese	3048
ron	Romanian	2056
rus	Russian	3832
san	Sanskrit	4442
gla	Scottish Gaelic	66
hbs	Serbo-Croatian	3286
sms	Skolt Sami	263
slk	Slovak	4145
slv	Slovenian	4483
spa	Spanish	2541
arb	Standard Arabic	1215
swe	Swedish	201
tam	Tamil	382
ttc	Tektiteko	69
tpn	Tupinambá	9
tur	Turkish	1742
uig	Uighur	758
ukr	Ukrainian	2744
hsb	Upper Sorbian	186
urd	Urdu	550
urb	Urubú-Kaapor	13
uzb	Uzbek	50
vep	Veps	187
wbp	Warlpiri	12
cym	Welsh	1120
hyw	Western Armenian	1153
wol	Wolof	705
sah	Yakut	144
nhi	Tenango Nahuatl	38

Citation

@misc{jumelet2025multiblimp10massivelymultilingual,
      title={MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs},
      author={Jaap Jumelet and Leonie Weissweiler and Arianna Bisazza},
      year={2025},
      eprint={2504.02768},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.02768},
}

New Task Checklist

Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiBLiMP: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Task Description

Implementation

Dataset Details

Citation

New Task Checklist

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MultiBLiMP: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Task Description

Implementation

Dataset Details

Citation

New Task Checklist