Open
Conversation
Author
|
Tagging @vmkhlv as the main maintainer of NorEval |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NorEval: 8 new tasks and corrected NorBelebele data source
Summary
This PR extends the NorEval benchmark suite with 8 new evaluation tasks and fixes the data source for 1 existing task:
ltg/norbelebeleinstead of the originalfacebook/belebele.Many of these tasks have already been used for evaluating Norwegian language models in Small Languages, Big Models: A Study of Continual Training on Languages of Norway.
Motivation
1. NoCola: linguistic acceptability judgment
The Norwegian Corpus of Linguistic Acceptability (Jentoft & Samuel, NoDaLiDa 2023) contains ~99,000 minimal pairs of grammatically correct and incorrect Norwegian Bokmål sentences. Each pair is annotated with one of 11 error categories (inflection, punctuation, word order, spelling, etc.).
The task is formulated as multiple choice: given a pair of sentences, the model must assign higher likelihood to the grammatically correct sentence. This provides a direct measure of a model's syntactic and grammatical knowledge of Norwegian, analogous to BLiMP for English. Since the evaluation relies solely on comparing log-likelihoods of correct vs. incorrect sentences, it is particularly well-suited for evaluating base language models.
The evaluation is similar to the existing NorEval NCB task, but NoCoLA has many more samples and is not limited only to detecting correct punctuation.
Task configs: 1 config (
nocola), using accuracy as the metric.2. NorOpenBookQA without fact: closed-book science QA
The existing NorOpenBookQA task (Mikhailov et al., NoDaLiDa 2025) evaluates elementary science knowledge as 4-way multiple-choice QA. The original "open-book" variants include a supporting fact in the prompt that helps answer the question.
The new no-fact variants omit this supporting fact, turning the task into a closed-book QA evaluation. This is similar to how the original English OpenBookQA is typically evaluated. This tests whether the model has internalized the relevant world knowledge rather than whether it can use a provided fact to select the correct answer (basically testing its reading-comprehension skills). The closed-book setting is a more demanding evaluation of a model's knowledge and is arguably more representative of how models are used in practice.
Both Bokmål and Nynorsk variants are provided, each with 5 prompt templates (p0-p4) covering different prompt formulations (bare question stem, explicit answer listing, A/B/C/D labeling, etc.).
Task configs: 10 configs (5 prompts x 2 languages:
noropenbookqa_no_fact_nob_p{0..4},noropenbookqa_no_fact_nno_p{0..4}), using accuracy and normalized accuracy.3. NorSumm Translation: Bokmål-Nynorsk translation
The NorSumm Bokmål-Nynorsk Translation dataset provides 189 manually translated parallel paragraphs between Norwegian Bokmål and Nynorsk, extracted from the NorSumm summarization corpus (Touileb et al.).
Both directions are evaluated (Nynorsk-to-Bokmål and Bokmål-to-Nynorsk), each with 4 prompt templates ranging from minimal ("Nynorsk: ... Bokmål:") to explicit instruction-style prompts.
Task configs: 8 configs (4 prompts x 2 directions:
norsumm_nno_nob_translation_p{0..3},norsumm_nob_nno_translation_p{0..3}), using BLEU and chrF.4. SLIDE: Scandinavian language identification
SLIDE (Fedorova et al., RESOURCEFUL 2025) is a multi-label Scandinavian language identification dataset with sentences sourced from Universal Dependencies and Tatoeba. Each sentence is labeled with one or more of 5 categories: Bokmål, Nynorsk, Danish, Swedish, or other. The validation and test sets are manually verified and annotated by native speakers.
Distinguishing between closely related Scandinavian languages is a challenging and linguistically interesting task. Short sentences can be valid in multiple languages simultaneously (e.g., "Det er interessant." is valid in Bokmål, Nynorsk, and Danish), which is why the task uses multi-label classification. The evaluation uses a custom
process_multilabelfunction to handle cases where multiple labels are correct. We use the "loose accuracy" metric, which counts a prediction as correct if it matches any of the correct labels.5 prompt templates are provided, covering different formulations of the language identification question.
Task configs: 5 configs (
slide_p{0..4}), using accuracy.5. Tatoeba Sámi: Northern Sámi-Bokmål translation
The Saami Tatoeba dataset contains 187 Northern Sámi sentences with parallel Norwegian Bokmål translations, collected from Tatoeba and manually verified or translated where necessary.
Northern Sámi is the most widely-spoken Sámi language, with approximately 25,000 speakers. It is classified as an endangered language. Evaluating translation to and from Northern Sámi provides an important measure of how well language models handle low-resource indigenous languages of Norway, and is especially relevant given ongoing efforts to develop NLP resources for Sámi languages.
Both translation directions are evaluated (Bokmål-to-Sámi and Sámi-to-Bokmål), each with 5 prompt templates. Notably, prompt p4 uses Northern Sámi language names ("Dárogiella" for Norwegian, "Davvisámegiella" for Northern Sámi).
Task configs: 10 configs (5 prompts x 2 directions:
tatoeba_nob_sme_p{0..4},tatoeba_sme_nob_p{0..4}), using BLEU and chrF.6. Fix NorBelebele data source
The existing NorBelebele reading comprehension task was using
facebook/belebelewith thenob_Latnsubset. This data source contains known Norwegian translation errors originating from the FLORES translations.The corrected NorBelebele dataset (
ltg/norbelebele) incorporates improved Norwegian Bokmål translations from theopenlanguagedata/flores_plusproject, as documented in Improved Norwegian Bokmål Translations for FLORES (Maehlum et al., WMT 2025). Using the corrected data source ensures that the reading comprehension evaluation is not confounded by translation artifacts.Changes: Updated
dataset_pathfromfacebook/belebeletoltg/norbelebeleand removed the now-unnecessarydataset_name: nob_Latn.Files changed
lm_eval/tasks/noreval/nocola/nocola.yamllm_eval/tasks/noreval/noropenbookqa/nob_no_fact/noropenbookqa_nob_no_fact_p{0..4}.yamllm_eval/tasks/noreval/noropenbookqa/nno_no_fact/noropenbookqa_nno_no_fact_p{0..4}.yamllm_eval/tasks/noreval/norsumm_translation/_norsumm_translation_yamllm_eval/tasks/noreval/norsumm_translation/norsumm_nno_nob/tatoeba_nno_nob_p{0..3}.yamllm_eval/tasks/noreval/norsumm_translation/norsumm_nob_nno/tatoeba_nob_nno_p{0..3}.yamllm_eval/tasks/noreval/slide/_slide_yamllm_eval/tasks/noreval/slide/slide_p{0..4}.yamllm_eval/tasks/noreval/tatoeba/tatoeba_nob_sme/tatoeba_nob_sme_p{0..4}.yamllm_eval/tasks/noreval/tatoeba/tatoeba_sme_nob/tatoeba_sme_nob_p{0..4}.yamllm_eval/tasks/noreval/tatoeba/noreval_multiblimp/lm_eval/tasks/noreval/norbelebele/_norbelebele_yamlfacebook/belebele->ltg/norbelebeleVerification of the code changes
We have run the updated NorEval benchmarks on 25 language models, the benchmark is available at https://ltgoslo.github.io/noreval-stats/#models=all. The results are consistent with expectations based on the dataset specifications and the nature of the tasks.