Skip to content

Latest commit

 

History

History
84 lines (64 loc) · 7.63 KB

File metadata and controls

84 lines (64 loc) · 7.63 KB

🇳🇴 NorEval

Paper

noreval

Overview of the NorEval design. 😼 denotes datasets used in NorBench, NLEBench, ScandEval, and SEB; 🚀 represents datasets that have not been used in the existing Norwegian benchmarks; and 😎 denotes our novel datasets introduced as part of NorEval. EN=English; BM=Norwegian Bokmål; NN=Norwegian Nynorsk.

🇳🇴 NorEval is a multi-task Norwegian language understanding and generation evaluation benchmark that combines 19 existing peer-reviewed datasets with five datasets created from scratch. NorEval covers nine diverse task categories: sentiment analysis, Norwegian language knowledge, Norwegian-specific & world knowledge, machine reading comprehension, commonsense reasoning, machine translation, text summarization, instruction following, and truthfulness. Our main evaluation principles are:

  • 🌐 Linguistic diversity: support for both of the official written standards of Norwegian: Bokmål and Nynorsk (the minority variant).
  • 📊 Task diversity: coverage of various least addressed tasks for Norwegian. In particular, only three out of 24 NorEval datasets are included in existing Norwegian benchmarks to date: NorBench, NLEBench, ScandEval, and SEB.
  • 🧠 Data quality: focus on only peer-reviewed human-created datasets to ensure reliable evaluation in the context of the Norwegian language, culture, and values.
  • 📏 Prompt sensitivity: evaluation across 100+ human-written prompts to account for the prompt sensitivity.
  • 👩🏻‍🔬 Standardized evaluation: integration of NorEval into LM Evaluation Harness for flexible and reproducible evaluation.

Tasks

Name Bokmål Nynorsk k-shot Task type Task category
NoReC Sentence norec_sentence Text classification Sentiment analysis
NoReC Document norec_document Text classification Sentiment analysis
NCB ncb Sentence ranking Norwegian language knowledge
NorIdiom noridiom_nob noridiom_nno Sentence completion Norwegian language knowledge
Belebele norbelebele Multiple-choice question answering Machine reading comprehension
NRK-Quiz-QA nrk_quiz_qa_nob nrk_quiz_qa_nno Multiple-choice question answering Norwegian-specific & world knowledge
NorOpenBookQA noropenbookqa_nob noropenbookqa_nno Multiple-choice question answering Norwegian-specific & world knowledge
NorCommonsenseQA norcommonsenseqa_nob norcommonsenseqa_nno Multiple-choice question answering Commonsense reasoning
NorTruthfulQA Multiple choice nortruthfulqa_mc_nob nortruthfulqa_mc_nno Multiple-choice question answering Truthfulness
NorQuAD norquad Generative question answering Machine reading comprehension
NorTruthfulQA Generation nortruthfulqa_gen_nob nortruthfulqa_gen_nno Generative question answering Truthfulness
ASK-GEC ask_gec Sequence-to-sequence generation Norwegian language knowledge
NorSumm norsumm_nob norsumm_nno Sequence-to-sequence generation Text summarization
Tatoeba (English → Bokmål/Nynorsk) tatoeba_eng_nob tatoeba_eng_nno Sequence-to-sequence generation Machine translation
Tatoeba (Bokmål/Nynorsk → English) tatoeba_nob_eng tatoeba_nno_eng Sequence-to-sequence generation Machine translation
NorRewrite-Instruct norrewrite_instruct Sequence-to-sequence generation Instruction following
NorSummarize-Instruct norsummarize_instruct Sequence-to-sequence generation Instruction following
Table description
  • Name: a dataset name with a HuggingFace link.
  • Bokmål: the LM Evaluation Harness task name for the Norwegian Bokmål dataset.
  • Nynorsk: the LM Evaluation Harness task name for the Norwegian Nynorsk dataset, if available.
  • k-shot: the support for k-shot evaluation regimes with k > 0. We follow the original datasets' design and focus mainly on the zero-shot evaluation by default.
    • ✅ means that the user can run the evaluation in both zero-shot and k-shot regimes.
    • ❌ denotes that only the zero-shot evaluation regime is available due to the lack of the training or validation set to sample the demonstration examples from. Technically, k-shot evaluation on the test set is possible using sampling without replacement, given that the model is not proprietary and not accessed via an API.
  • Task type: the task type.
  • Task category: the task category.
Comments on Belebele

Belebele for Norwegian Bokmål is already available in LM Evaluation Harness as belebele_nob_Latn. However, our version (norbelebele) supports five prompt templates written by Norwegian native speakers, which are different from the default prompt template used in Belebele.

Citation

@article{mikhailov2025noreval,
  title={NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark},
  author={Mikhailov, Vladislav and Enstad, Tita and Samuel, David and Farseth{\aa}s, Hans Christian and Kutuzov, Andrey and Velldal, Erik and {\O}vrelid, Lilja},
  journal={arXiv preprint arXiv:2504.07749},
  year={2025}
}

Checklist

  • Is the task an existing benchmark in the literature?
    • Have you referenced the original paper that introduced the task?
    • If yes, does the original paper provide a reference implementation?
      • Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:

  • Is the "Main" variant of this task clearly denoted?
  • Have you provided a short sentence in a README on what each new variant adds / evaluates?
  • Have you noted which, if any, published evaluation setups are matched by this variant?