Title: TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
Abstract:
TurBLiMP is the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models. The dataset covers 16 core grammatical phenomena in Turkish, with 1,000 minimal pairs per phenomenon.
Homepage: https://github.com/ezgibasar/TurBLiMP
bibtex
@misc{basar2025turblimpturkishbenchmarklinguistic,
title={TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs},
author={Ezgi Ba{\c{s}}ar and Francesca Padovani and Jaap Jumelet and Arianna Bisazza},
year={2025},
eprint={2506.13487},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.13487}
}
turblimp_core: Runs all 16 grammatical 'core' subtasks of TurBLiMP (additional experimental paradigms which have no correct answer are included in the original release; these are not included here).
turblimp_anaphor_agreement: Reflexive pronoun agreement violationsturblimp_argument_structure_transitive: Case marking errors with transitive verbsturblimp_argument_structure_ditransitive: Case marking errors with ditransitive verbsturblimp_binding: Principle B violations in binding theoryturblimp_determiners: Obligatory use of the indefinite articleturblimp_ellipsis: Backward gapping with non-parallel word ordersturblimp_irregular_forms: Incorrect aorist allomorph usageturblimp_island_effects: Wh-adjunct extraction from complex NPsturblimp_nominalization: Incorrect nominalization suffix selectionturblimp_npi_licensing: Negative polarity items in non-negative contextsturblimp_passives: Unlicensed use of by-phrases in impersonal passivesturblimp_quantifiers: Quantifier usage with bare nounsturblimp_relative_clauses: Incorrect case marking in relative clausesturblimp_scrambling: Illicit postverbal scrambling from embedded clausesturblimp_subject_agreement: Person/number agreement violationsturblimp_suspended_affixation: Improper tense suffix suspension
Implementation Note: The original implementation normalizes length by number of tokens, which is not supported by the Language Model Evaluation Harness (see [1], [2], [3]). For this reason, the implementation provided here includes both the acc (accuracy based on comparing the unnormalized log-probability of the correct and incorrect versions of each sentence) and acc_norm (the same as acc but with sentence log-probability normalized by number of bytes) metrics.
For adding novel benchmarks/datasets to the library:
- Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?