Skip to content

New NorEval tasks#3572

Open
davda54 wants to merge 3 commits intoEleutherAI:mainfrom
davda54:new-noreval-tasks
Open

New NorEval tasks#3572
davda54 wants to merge 3 commits intoEleutherAI:mainfrom
davda54:new-noreval-tasks

Conversation

@davda54
Copy link

@davda54 davda54 commented Feb 9, 2026

NorEval: 8 new tasks and corrected NorBelebele data source

Summary

This PR extends the NorEval benchmark suite with 8 new evaluation tasks and fixes the data source for 1 existing task:

  • Add NoCola (Norwegian Corpus of Linguistic Acceptability): a minimal-pair linguistic acceptability task.
  • Add NorOpenBookQA without fact (Bokmål and Nynorsk): closed-book variants of the existing NorOpenBookQA task that omit the supporting fact from the prompt.
  • Add NorSumm Translation (Bokmål-Nynorsk, both directions): paragraph-level translation between Norway's two written standards using parallel texts from NorSumm.
  • Add SLIDE (Scandinavian LID Evaluation): multi-label classification of text into Bokmål, Nynorsk, Danish, Swedish, or other languages.
  • Add Tatoeba Sámi (Bokmål-Northern Sámi, both directions): machine-translation evaluation between Norwegian Bokmål and Northern Sámi.
  • Add MultiBLiMP for Sámi: now we include detailed MultiBLiMP evaluation of linguistic acceptability for Northern Sámi. Unlike the multilingual implementation already included in harness, we group the samples by linguistic phenomena being tested for more detailed results.
  • Fix NorBelebele data source to use the corrected Norwegian translations from ltg/norbelebele instead of the original facebook/belebele.

Many of these tasks have already been used for evaluating Norwegian language models in Small Languages, Big Models: A Study of Continual Training on Languages of Norway.

Motivation

1. NoCola: linguistic acceptability judgment

The Norwegian Corpus of Linguistic Acceptability (Jentoft & Samuel, NoDaLiDa 2023) contains ~99,000 minimal pairs of grammatically correct and incorrect Norwegian Bokmål sentences. Each pair is annotated with one of 11 error categories (inflection, punctuation, word order, spelling, etc.).

The task is formulated as multiple choice: given a pair of sentences, the model must assign higher likelihood to the grammatically correct sentence. This provides a direct measure of a model's syntactic and grammatical knowledge of Norwegian, analogous to BLiMP for English. Since the evaluation relies solely on comparing log-likelihoods of correct vs. incorrect sentences, it is particularly well-suited for evaluating base language models.

The evaluation is similar to the existing NorEval NCB task, but NoCoLA has many more samples and is not limited only to detecting correct punctuation.

Task configs: 1 config (nocola), using accuracy as the metric.

2. NorOpenBookQA without fact: closed-book science QA

The existing NorOpenBookQA task (Mikhailov et al., NoDaLiDa 2025) evaluates elementary science knowledge as 4-way multiple-choice QA. The original "open-book" variants include a supporting fact in the prompt that helps answer the question.

The new no-fact variants omit this supporting fact, turning the task into a closed-book QA evaluation. This is similar to how the original English OpenBookQA is typically evaluated. This tests whether the model has internalized the relevant world knowledge rather than whether it can use a provided fact to select the correct answer (basically testing its reading-comprehension skills). The closed-book setting is a more demanding evaluation of a model's knowledge and is arguably more representative of how models are used in practice.

Both Bokmål and Nynorsk variants are provided, each with 5 prompt templates (p0-p4) covering different prompt formulations (bare question stem, explicit answer listing, A/B/C/D labeling, etc.).

Task configs: 10 configs (5 prompts x 2 languages: noropenbookqa_no_fact_nob_p{0..4}, noropenbookqa_no_fact_nno_p{0..4}), using accuracy and normalized accuracy.

3. NorSumm Translation: Bokmål-Nynorsk translation

The NorSumm Bokmål-Nynorsk Translation dataset provides 189 manually translated parallel paragraphs between Norwegian Bokmål and Nynorsk, extracted from the NorSumm summarization corpus (Touileb et al.).

Both directions are evaluated (Nynorsk-to-Bokmål and Bokmål-to-Nynorsk), each with 4 prompt templates ranging from minimal ("Nynorsk: ... Bokmål:") to explicit instruction-style prompts.

Task configs: 8 configs (4 prompts x 2 directions: norsumm_nno_nob_translation_p{0..3}, norsumm_nob_nno_translation_p{0..3}), using BLEU and chrF.

4. SLIDE: Scandinavian language identification

SLIDE (Fedorova et al., RESOURCEFUL 2025) is a multi-label Scandinavian language identification dataset with sentences sourced from Universal Dependencies and Tatoeba. Each sentence is labeled with one or more of 5 categories: Bokmål, Nynorsk, Danish, Swedish, or other. The validation and test sets are manually verified and annotated by native speakers.

Distinguishing between closely related Scandinavian languages is a challenging and linguistically interesting task. Short sentences can be valid in multiple languages simultaneously (e.g., "Det er interessant." is valid in Bokmål, Nynorsk, and Danish), which is why the task uses multi-label classification. The evaluation uses a custom process_multilabel function to handle cases where multiple labels are correct. We use the "loose accuracy" metric, which counts a prediction as correct if it matches any of the correct labels.

5 prompt templates are provided, covering different formulations of the language identification question.

Task configs: 5 configs (slide_p{0..4}), using accuracy.

5. Tatoeba Sámi: Northern Sámi-Bokmål translation

The Saami Tatoeba dataset contains 187 Northern Sámi sentences with parallel Norwegian Bokmål translations, collected from Tatoeba and manually verified or translated where necessary.

Northern Sámi is the most widely-spoken Sámi language, with approximately 25,000 speakers. It is classified as an endangered language. Evaluating translation to and from Northern Sámi provides an important measure of how well language models handle low-resource indigenous languages of Norway, and is especially relevant given ongoing efforts to develop NLP resources for Sámi languages.

Both translation directions are evaluated (Bokmål-to-Sámi and Sámi-to-Bokmål), each with 5 prompt templates. Notably, prompt p4 uses Northern Sámi language names ("Dárogiella" for Norwegian, "Davvisámegiella" for Northern Sámi).

Task configs: 10 configs (5 prompts x 2 directions: tatoeba_nob_sme_p{0..4}, tatoeba_sme_nob_p{0..4}), using BLEU and chrF.

6. Fix NorBelebele data source

The existing NorBelebele reading comprehension task was using facebook/belebele with the nob_Latn subset. This data source contains known Norwegian translation errors originating from the FLORES translations.

The corrected NorBelebele dataset (ltg/norbelebele) incorporates improved Norwegian Bokmål translations from the openlanguagedata/flores_plus project, as documented in Improved Norwegian Bokmål Translations for FLORES (Maehlum et al., WMT 2025). Using the corrected data source ensures that the reading comprehension evaluation is not confounded by translation artifacts.

Changes: Updated dataset_path from facebook/belebele to ltg/norbelebele and removed the now-unnecessary dataset_name: nob_Latn.

Files changed

File Change
lm_eval/tasks/noreval/nocola/nocola.yaml New file: NoCola linguistic acceptability task config
lm_eval/tasks/noreval/noropenbookqa/nob_no_fact/noropenbookqa_nob_no_fact_p{0..4}.yaml New files: NorOpenBookQA Bokmål without fact (5 prompt templates)
lm_eval/tasks/noreval/noropenbookqa/nno_no_fact/noropenbookqa_nno_no_fact_p{0..4}.yaml New files: NorOpenBookQA Nynorsk without fact (5 prompt templates)
lm_eval/tasks/noreval/norsumm_translation/_norsumm_translation_yaml New file: Base config for NorSumm translation tasks
lm_eval/tasks/noreval/norsumm_translation/norsumm_nno_nob/tatoeba_nno_nob_p{0..3}.yaml New files: NorSumm Nynorsk-to-Bokmål translation (4 prompt templates)
lm_eval/tasks/noreval/norsumm_translation/norsumm_nob_nno/tatoeba_nob_nno_p{0..3}.yaml New files: NorSumm Bokmål-to-Nynorsk translation (4 prompt templates)
lm_eval/tasks/noreval/slide/_slide_yaml New file: Base config for SLIDE language identification
lm_eval/tasks/noreval/slide/slide_p{0..4}.yaml New files: SLIDE Scandinavian language identification (5 prompt templates)
lm_eval/tasks/noreval/tatoeba/tatoeba_nob_sme/tatoeba_nob_sme_p{0..4}.yaml New files: Tatoeba Bokmål-to-Northern Sámi translation (5 prompt templates)
lm_eval/tasks/noreval/tatoeba/tatoeba_sme_nob/tatoeba_sme_nob_p{0..4}.yaml New files: Tatoeba Northern Sámi-to-Bokmål translation (5 prompt templates)
lm_eval/tasks/noreval/tatoeba/noreval_multiblimp/ New files: Detailed MultiBLiMP evaluation of Sámi with specific configs for each grammatical group
lm_eval/tasks/noreval/norbelebele/_norbelebele_yaml Fix data source: facebook/belebele -> ltg/norbelebele

Verification of the code changes

We have run the updated NorEval benchmarks on 25 language models, the benchmark is available at https://ltgoslo.github.io/noreval-stats/#models=all. The results are consistent with expectations based on the dataset specifications and the nature of the tasks.

@davda54 davda54 requested a review from baberabb as a code owner February 9, 2026 12:35
@davda54
Copy link
Author

davda54 commented Feb 13, 2026

Tagging @vmkhlv as the main maintainer of NorEval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant