Skip to content

Latest commit

 

History

History
157 lines (136 loc) · 7.94 KB

File metadata and controls

157 lines (136 loc) · 7.94 KB

CatalanBench

Paper

CatalanBench is a benchmark for evaluating language models in Catalan tasks. This is, it evaluates the ability of a language model to understand and generate Catalan text. CatalanBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of CatalanBench will be published in a paper soon.

The new evaluation datasets included in CatalanBench are:

Task Category Homepage
ARC_ca Question Answering https://huggingface.co/datasets/projecte-aina/arc_ca
MGSM_ca Math https://huggingface.co/datasets/projecte-aina/mgsm_ca
OpenBookQA_ca Question Answering https://huggingface.co/datasets/projecte-aina/openbookqa_ca
Parafraseja Paraphrasing https://huggingface.co/datasets/projecte-aina/Parafraseja
PIQA_ca Question Answering https://huggingface.co/datasets/projecte-aina/piqa_ca
SIQA_ca Question Answering https://huggingface.co/datasets/projecte-aina/siqa_ca
XStoryCloze_ca Commonsense Reasoning https://huggingface.co/datasets/projecte-aina/xstorycloze_ca

The datasets included in CatalanBench that have been made public in previous pubications are:

Task Category Paper title Homepage
Belebele_ca Reading Comprehension The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants https://huggingface.co/datasets/facebook/belebele
caBREU Summarization Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan https://huggingface.co/datasets/projecte-aina/caBreu
CatalanQA Question Answering Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan https://huggingface.co/datasets/projecte-aina/catalanqa
CatCoLA Linguistic Acceptability CatCoLA: Catalan Corpus of Linguistic Acceptability https://huggingface.co/datasets/nbel/CatCoLA
Cocoteros_va Commonsense Reasoning COCOTEROS_VA: Valencian translation of the COCOTEROS Spanish dataset https://huggingface.co/datasets/gplsi/cocoteros_va
EsCoLA Linguistic Acceptability EsCoLA: Spanish Corpus of Linguistic Acceptability
COPA-ca Commonsense Reasoning Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan https://huggingface.co/datasets/projecte-aina/COPA-ca
CoQCat Question Answering Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan https://huggingface.co/datasets/projecte-aina/CoQCat
FLORES_ca Translation The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation https://huggingface.co/datasets/facebook/flores
PAWS-ca Paraphrasing Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan https://huggingface.co/datasets/projecte-aina/PAWS-ca
TE-ca Natural Language Inference Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan https://huggingface.co/datasets/projecte-aina/teca
TruthfulQA_va Truthfulness TruthfulQA: Measuring How Models Mimic Human Falsehoods: The case of Valencian https://huggingface.co/datasets/gplsi/truthfulqa_va
VeritasQA_ca Truthfulness VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability TBA
WNLI-ca Natural Language Inference Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan https://huggingface.co/datasets/projecte-aina/wnli-ca
XNLI-ca Natural Language Inference Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan https://huggingface.co/datasets/projecte-aina/xnli-ca
XNLI-va Natural Language Inference Building a Data Infrastructure for a Mid-Resource Language: The Case of Valencian https://huggingface.co/datasets/gplsi/xnli_va
XQuAD-ca Question Answering Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan https://huggingface.co/datasets/projecte-aina/xquad-ca

Citation

Paper for CatalanBench coming soon.

@inproceedings{baucells-etal-2025-iberobench,
    title = "{I}bero{B}ench: A Benchmark for {LLM} Evaluation in {I}berian Languages",
    author = "Baucells, Irene  and
      Aula-Blasco, Javier  and
      de-Dios-Flores, Iria  and
      Paniagua Su{\'a}rez, Silvia  and
      Perez, Naiara  and
      Salles, Anna  and
      Sotelo Docio, Susana  and
      Falc{\~a}o, J{\'u}lia  and
      Saiz, Jose Javier  and
      Sepulveda Torres, Robiert  and
      Barnes, Jeremy  and
      Gamallo, Pablo  and
      Gonzalez-Agirre, Aitor  and
      Rigau, German  and
      Villegas, Marta",
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-main.699/",
    pages = "10491--10519",
}

Groups and Tasks

Groups

  • catalan_bench: All tasks included in CatalanBench.
  • flores_ca: All FLORES translation tasks from or to Catalan.

Tags

  • cabreu: Three CaBREU tasks for each type of summary (extractive, abstractive and extreme).
  • phrases_va: Two Phrases_va tasks for language adaptation between Catalan and Valencian.

Tasks

The following tasks evaluate tasks on CatalanBench dataset using various scoring methods.

  • arc_ca_challenge
  • arc_ca_easy
  • belebele_cat_Latn
  • cabreu
  • catalanqa
  • catcola
  • cocoteros_va
  • copa_ca
  • coqcat
  • flores_ca
  • flores_ca-de
  • flores_ca-en
  • flores_ca-es
  • flores_ca-eu
  • flores_ca-fr
  • flores_ca-gl
  • flores_ca-it
  • flores_ca-pt
  • flores_de-ca
  • flores_en-ca
  • flores_es-ca
  • flores_eu-ca
  • flores_fr-ca
  • flores_gl-ca
  • flores_it-ca
  • flores_pt-ca
  • mgsm_direct_ca
  • openbookqa_ca
  • parafraseja
  • paws_ca
  • phrases_ca
  • piqa_ca
  • siqa_ca
  • teca
  • truthfulqa_va
  • veritasqa_gen_ca
  • veritasqa_mc1_ca
  • veritasqa_mc2_ca
  • wnli_ca
  • xnli_ca
  • xnli_va
  • xquad_ca
  • xstorycloze_ca

Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:

  • belebele_cat_Latn: Belebele Catalan

Checklist

  • Is the task an existing benchmark in the literature?
    • Have you referenced the original paper that introduced the task?
    • If yes, does the original paper provide a reference implementation?
      • Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:

  • Is the "Main" variant of this task clearly denoted?
  • Have you provided a short sentence in a README on what each new variant adds / evaluates?
  • Have you noted which, if any, published evaluation setups are matched by this variant?

Changelog

version 2.0: (2025-Mar-18) add cococteros_va task. version 2.1: (2025-Jul-30) add xnli_va task. version 2.2: (2025-Jul-30) add truthfulqa_va task. version 2.3: (2026-Jan-16) exclude line breaks from stop criteria in mgsm_direct_ca