🇳🇴 NorEval

Paper

Title: NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
Abstract: arxiv.org/abs/2504.07749
Homepage: github.com/ltgoslo/noreval

Overview of the NorEval design. 😼 denotes datasets used in NorBench, NLEBench, ScandEval, and SEB; 🚀 represents datasets that have not been used in the existing Norwegian benchmarks; and 😎 denotes our novel datasets introduced as part of NorEval. EN=English; BM=Norwegian Bokmål; NN=Norwegian Nynorsk.

🇳🇴 NorEval is a multi-task Norwegian language understanding and generation evaluation benchmark that combines 19 existing peer-reviewed datasets with five datasets created from scratch. NorEval covers nine diverse task categories: sentiment analysis, Norwegian language knowledge, Norwegian-specific & world knowledge, machine reading comprehension, commonsense reasoning, machine translation, text summarization, instruction following, and truthfulness. Our main evaluation principles are:

🌐 Linguistic diversity: support for both of the official written standards of Norwegian: Bokmål and Nynorsk (the minority variant).
📊 Task diversity: coverage of various least addressed tasks for Norwegian. In particular, only three out of 24 NorEval datasets are included in existing Norwegian benchmarks to date: NorBench, NLEBench, ScandEval, and SEB.
🧠 Data quality: focus on only peer-reviewed human-created datasets to ensure reliable evaluation in the context of the Norwegian language, culture, and values.
📏 Prompt sensitivity: evaluation across 100+ human-written prompts to account for the prompt sensitivity.
👩🏻‍🔬 Standardized evaluation: integration of NorEval into LM Evaluation Harness for flexible and reproducible evaluation.

Tasks

Name	Bokmål	Nynorsk	k-shot	Task type	Task category
NoReC Sentence	`norec_sentence`	❌	✅	Text classification	Sentiment analysis
NoReC Document	`norec_document`	❌	✅	Text classification	Sentiment analysis
NCB	`ncb`	❌	❌	Sentence ranking	Norwegian language knowledge
NorIdiom	`noridiom_nob`	`noridiom_nno`	❌	Sentence completion	Norwegian language knowledge
Belebele	`norbelebele`	❌	❌	Multiple-choice question answering	Machine reading comprehension
NRK-Quiz-QA	`nrk_quiz_qa_nob`	`nrk_quiz_qa_nno`	❌	Multiple-choice question answering	Norwegian-specific & world knowledge
NorOpenBookQA	`noropenbookqa_nob`	`noropenbookqa_nno`	✅	Multiple-choice question answering	Norwegian-specific & world knowledge
NorCommonsenseQA	`norcommonsenseqa_nob`	`norcommonsenseqa_nno`	❌	Multiple-choice question answering	Commonsense reasoning
NorTruthfulQA Multiple choice	`nortruthfulqa_mc_nob`	`nortruthfulqa_mc_nno`	❌	Multiple-choice question answering	Truthfulness
NorQuAD	`norquad`	❌	✅	Generative question answering	Machine reading comprehension
NorTruthfulQA Generation	`nortruthfulqa_gen_nob`	`nortruthfulqa_gen_nno`	❌	Generative question answering	Truthfulness
ASK-GEC	`ask_gec`	❌	✅	Sequence-to-sequence generation	Norwegian language knowledge
NorSumm	`norsumm_nob`	`norsumm_nno`	✅	Sequence-to-sequence generation	Text summarization
Tatoeba (English → Bokmål/Nynorsk)	`tatoeba_eng_nob`	`tatoeba_eng_nno`	✅	Sequence-to-sequence generation	Machine translation
Tatoeba (Bokmål/Nynorsk → English)	`tatoeba_nob_eng`	`tatoeba_nno_eng`	✅	Sequence-to-sequence generation	Machine translation
NorRewrite-Instruct	`norrewrite_instruct`	❌	❌	Sequence-to-sequence generation	Instruction following
NorSummarize-Instruct	`norsummarize_instruct`	❌	❌	Sequence-to-sequence generation	Instruction following

Table description

Name: a dataset name with a HuggingFace link.
Bokmål: the LM Evaluation Harness task name for the Norwegian Bokmål dataset.
Nynorsk: the LM Evaluation Harness task name for the Norwegian Nynorsk dataset, if available.
k-shot: the support for k-shot evaluation regimes with k > 0. We follow the original datasets' design and focus mainly on the zero-shot evaluation by default.
- ✅ means that the user can run the evaluation in both zero-shot and k-shot regimes.
- ❌ denotes that only the zero-shot evaluation regime is available due to the lack of the training or validation set to sample the demonstration examples from. Technically, k-shot evaluation on the test set is possible using sampling without replacement, given that the model is not proprietary and not accessed via an API.
Task type: the task type.
Task category: the task category.

Comments on Belebele

Belebele for Norwegian Bokmål is already available in LM Evaluation Harness as belebele_nob_Latn. However, our version (norbelebele) supports five prompt templates written by Norwegian native speakers, which are different from the default prompt template used in Belebele.

Citation

@article{mikhailov2025noreval,
  title={NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark},
  author={Mikhailov, Vladislav and Enstad, Tita and Samuel, David and Farseth{\aa}s, Hans Christian and Kutuzov, Andrey and Velldal, Erik and {\O}vrelid, Lilja},
  journal={arXiv preprint arXiv:2504.07749},
  year={2025}
}

Checklist

Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation?
  - Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:

Is the "Main" variant of this task clearly denoted?
Have you provided a short sentence in a README on what each new variant adds / evaluates?
Have you noted which, if any, published evaluation setups are matched by this variant?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🇳🇴 NorEval

Paper

Tasks

Comments on Belebele

Citation

Checklist

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🇳🇴 NorEval

Paper

Tasks

Comments on Belebele

Citation

Checklist