TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Paper

Title: TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Abstract:

TurBLiMP is the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models. The dataset covers 16 core grammatical phenomena in Turkish, with 1,000 minimal pairs per phenomenon.

Homepage: https://github.com/ezgibasar/TurBLiMP

Citation

bibtex
@misc{basar2025turblimpturkishbenchmarklinguistic,
  title={TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs},
  author={Ezgi Ba{\c{s}}ar and Francesca Padovani and Jaap Jumelet and Arianna Bisazza},
  year={2025},
  eprint={2506.13487},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.13487}
}

Groups, Tags, and Tasks

Groups

turblimp_core: Runs all 16 grammatical 'core' subtasks of TurBLiMP (additional experimental paradigms which have no correct answer are included in the original release; these are not included here).

Tasks

turblimp_anaphor_agreement: Reflexive pronoun agreement violations
turblimp_argument_structure_transitive: Case marking errors with transitive verbs
turblimp_argument_structure_ditransitive: Case marking errors with ditransitive verbs
turblimp_binding: Principle B violations in binding theory
turblimp_determiners: Obligatory use of the indefinite article
turblimp_ellipsis: Backward gapping with non-parallel word orders
turblimp_irregular_forms: Incorrect aorist allomorph usage
turblimp_island_effects: Wh-adjunct extraction from complex NPs
turblimp_nominalization: Incorrect nominalization suffix selection
turblimp_npi_licensing: Negative polarity items in non-negative contexts
turblimp_passives: Unlicensed use of by-phrases in impersonal passives
turblimp_quantifiers: Quantifier usage with bare nouns
turblimp_relative_clauses: Incorrect case marking in relative clauses
turblimp_scrambling: Illicit postverbal scrambling from embedded clauses
turblimp_subject_agreement: Person/number agreement violations
turblimp_suspended_affixation: Improper tense suffix suspension

Implementation Note: The original implementation normalizes length by number of tokens, which is not supported by the Language Model Evaluation Harness (see [1], [2], [3]). For this reason, the implementation provided here includes both the acc (accuracy based on comparing the unnormalized log-probability of the correct and incorrect versions of each sentence) and acc_norm (the same as acc but with sentence log-probability normalized by number of bytes) metrics.

Checklist

For adding novel benchmarks/datasets to the library:

Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Paper

Citation

Groups, Tags, and Tasks

Groups

Tasks

Checklist

Changelog

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs

Paper

Citation

Groups, Tags, and Tasks

Groups

Tasks

Checklist

Changelog