Skip to content

Latest commit

 

History

History
143 lines (131 loc) · 6.74 KB

File metadata and controls

143 lines (131 loc) · 6.74 KB

MultiBLiMP: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Task Description

MultiBLiMP is a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs.

Implementation

  • multiblimp_{lang} runs MultiBLiMP for a given language, where {lang} must be replaced by the language's ISO 639-3 code (e.g., eng for English, abk for Abkhazian, wbp for Warlpiri, etc.).
  • multiblimp tag runs MultiBLiMP for all languages

Note: The original implementation is provided here, and the dataset repository also includes a link to a more flexible version of the implementation here. This implementation follows these as closely as possible, but the original implementations normalize length by number of tokens, which is not supported by the Language Model Evaluation Harness (see [1], [2], [3]). For this reason, the implementation provided here includes both the acc (accuracy based on comparing the unnormalized log-probability of the correct and incorrect versions of each sentence) and acc_norm (the same as acc but with sentence log-probability normalized by number of bytes) metrics.

Dataset Details

This table (from the Hugging Face Dataset Repo) lists the languages covered in MultiBLiMP and the number of items for each language.

ISO Code Language n
abk Abkhazian 40
aqz Akuntsu 14
sqi Albanian 243
amh Amharic 112
grc Ancient Greek 3695
hbo Ancient Hebrew 983
apu Apurinã 28
hye Armenian 1415
eus Basque 273
bel Belarusian 2570
ben Bengali 21
bho Bhojpuri 34
bor Borôro 241
bre Breton 260
bul Bulgarian 2458
bua Buriat 103
cat Catalan 2284
chu Church Slavonic 4166
xcl Classical Armenian 1623
ces Czech 4256
dan Danish 50
nld Dutch 2331
egy Egyptian (Ancient) 22
eng English 770
myv Erzya 464
est Estonian 2575
fao Faroese 232
fin Finnish 2570
fra French 2548
glg Galician 753
kat Georgian 204
deu German 2298
aln Gheg Albanian 677
got Gothic 1579
guj Gujarati 7
heb Hebrew 2330
azz H-P Nahuatl 207
hin Hindi 1447
hit Hittite 50
hun Hungarian 845
isl Icelandic 2801
gle Irish 28
ita Italian 2999
quc K'iche' 131
xnr Kangri 86
krl Karelian 260
kxh Karo (Ethiopia) 120
kaz Kazakh 173
kir Kirghiz 185
koi Komi-Permyak 43
kpv Komi-Zyrian 320
lat Latin 3149
lav Latvian 3032
lij Ligurian 254
lit Lithuanian 1180
olo Livvi 190
nds Low German 1774
mkd Macedonian 39
mar Marathi 460
frm Middle French 294
ell Modern Greek 1096
mdf Moksha 82
yrl Nhengatu 720
pcm Nigerian Pidgin 26
kmr Northern Kurdish 544
sme Northern Sami 2536
fro Old French 1976
orv Old Russian 4615
ota Ottoman Turkish 99
fas Persian 2553
xpg Phrygian 50
pol Polish 3272
por Portuguese 3048
ron Romanian 2056
rus Russian 3832
san Sanskrit 4442
gla Scottish Gaelic 66
hbs Serbo-Croatian 3286
sms Skolt Sami 263
slk Slovak 4145
slv Slovenian 4483
spa Spanish 2541
arb Standard Arabic 1215
swe Swedish 201
tam Tamil 382
ttc Tektiteko 69
tpn Tupinambá 9
tur Turkish 1742
uig Uighur 758
ukr Ukrainian 2744
hsb Upper Sorbian 186
urd Urdu 550
urb Urubú-Kaapor 13
uzb Uzbek 50
vep Veps 187
wbp Warlpiri 12
cym Welsh 1120
hyw Western Armenian 1153
wol Wolof 705
sah Yakut 144
nhi Tenango Nahuatl 38

Citation

@misc{jumelet2025multiblimp10massivelymultilingual,
      title={MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs},
      author={Jaap Jumelet and Leonie Weissweiler and Arianna Bisazza},
      year={2025},
      eprint={2504.02768},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.02768},
}

New Task Checklist

  • Is the task an existing benchmark in the literature?
    • Have you referenced the original paper that introduced the task?
    • If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?