CatalanBench

Paper

CatalanBench is a benchmark for evaluating language models in Catalan tasks. This is, it evaluates the ability of a language model to understand and generate Catalan text. CatalanBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of CatalanBench will be published in a paper soon.

The new evaluation datasets included in CatalanBench are:

Task	Category	Homepage
ARC_ca	Question Answering	https://huggingface.co/datasets/projecte-aina/arc_ca
MGSM_ca	Math	https://huggingface.co/datasets/projecte-aina/mgsm_ca
OpenBookQA_ca	Question Answering	https://huggingface.co/datasets/projecte-aina/openbookqa_ca
Parafraseja	Paraphrasing	https://huggingface.co/datasets/projecte-aina/Parafraseja
PIQA_ca	Question Answering	https://huggingface.co/datasets/projecte-aina/piqa_ca
SIQA_ca	Question Answering	https://huggingface.co/datasets/projecte-aina/siqa_ca
XStoryCloze_ca	Commonsense Reasoning	https://huggingface.co/datasets/projecte-aina/xstorycloze_ca

The datasets included in CatalanBench that have been made public in previous pubications are:

Task	Category	Paper title	Homepage
Belebele_ca	Reading Comprehension	The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants	https://huggingface.co/datasets/facebook/belebele
caBREU	Summarization	Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan	https://huggingface.co/datasets/projecte-aina/caBreu
CatalanQA	Question Answering	Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan	https://huggingface.co/datasets/projecte-aina/catalanqa
CatCoLA	Linguistic Acceptability	CatCoLA: Catalan Corpus of Linguistic Acceptability	https://huggingface.co/datasets/nbel/CatCoLA
Cocoteros_va	Commonsense Reasoning	COCOTEROS_VA: Valencian translation of the COCOTEROS Spanish dataset	https://huggingface.co/datasets/gplsi/cocoteros_va
EsCoLA	Linguistic Acceptability	EsCoLA: Spanish Corpus of Linguistic Acceptability
COPA-ca	Commonsense Reasoning	Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan	https://huggingface.co/datasets/projecte-aina/COPA-ca
CoQCat	Question Answering	Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan	https://huggingface.co/datasets/projecte-aina/CoQCat
FLORES_ca	Translation	The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation	https://huggingface.co/datasets/facebook/flores
PAWS-ca	Paraphrasing	Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan	https://huggingface.co/datasets/projecte-aina/PAWS-ca
TE-ca	Natural Language Inference	Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan	https://huggingface.co/datasets/projecte-aina/teca
TruthfulQA_va	Truthfulness	TruthfulQA: Measuring How Models Mimic Human Falsehoods: The case of Valencian	https://huggingface.co/datasets/gplsi/truthfulqa_va
VeritasQA_ca	Truthfulness	VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability	TBA
WNLI-ca	Natural Language Inference	Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan	https://huggingface.co/datasets/projecte-aina/wnli-ca
XNLI-ca	Natural Language Inference	Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan	https://huggingface.co/datasets/projecte-aina/xnli-ca
XNLI-va	Natural Language Inference	Building a Data Infrastructure for a Mid-Resource Language: The Case of Valencian	https://huggingface.co/datasets/gplsi/xnli_va
XQuAD-ca	Question Answering	Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan	https://huggingface.co/datasets/projecte-aina/xquad-ca

Citation

Paper for CatalanBench coming soon.

@inproceedings{baucells-etal-2025-iberobench,
    title = "{I}bero{B}ench: A Benchmark for {LLM} Evaluation in {I}berian Languages",
    author = "Baucells, Irene  and
      Aula-Blasco, Javier  and
      de-Dios-Flores, Iria  and
      Paniagua Su{\'a}rez, Silvia  and
      Perez, Naiara  and
      Salles, Anna  and
      Sotelo Docio, Susana  and
      Falc{\~a}o, J{\'u}lia  and
      Saiz, Jose Javier  and
      Sepulveda Torres, Robiert  and
      Barnes, Jeremy  and
      Gamallo, Pablo  and
      Gonzalez-Agirre, Aitor  and
      Rigau, German  and
      Villegas, Marta",
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-main.699/",
    pages = "10491--10519",
}

Groups and Tasks

Groups

catalan_bench: All tasks included in CatalanBench.
flores_ca: All FLORES translation tasks from or to Catalan.

Tasks

The following tasks evaluate tasks on CatalanBench dataset using various scoring methods.

arc_ca_challenge
arc_ca_easy
belebele_cat_Latn
cabreu
catalanqa
catcola
cocoteros_va
copa_ca
coqcat
flores_ca
flores_ca-de
flores_ca-en
flores_ca-es
flores_ca-eu
flores_ca-fr
flores_ca-gl
flores_ca-it
flores_ca-pt
flores_de-ca
flores_en-ca
flores_es-ca
flores_eu-ca
flores_fr-ca
flores_gl-ca
flores_it-ca
flores_pt-ca
mgsm_direct_ca
openbookqa_ca
parafraseja
paws_ca
phrases_ca
piqa_ca
siqa_ca
teca
truthfulqa_va
veritasqa_gen_ca
veritasqa_mc1_ca
veritasqa_mc2_ca
wnli_ca
xnli_ca
xnli_va
xquad_ca
xstorycloze_ca

Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:

belebele_cat_Latn: Belebele Catalan

Checklist

Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation?
  - Yes, original implementation contributed by author of the benchmark

If other tasks on this dataset are already supported:

Is the "Main" variant of this task clearly denoted?
Have you provided a short sentence in a README on what each new variant adds / evaluates?
Have you noted which, if any, published evaluation setups are matched by this variant?

Changelog

version 2.0: (2025-Mar-18) add cococteros_va task. version 2.1: (2025-Jul-30) add xnli_va task. version 2.2: (2025-Jul-30) add truthfulqa_va task. version 2.3: (2026-Jan-16) exclude line breaks from stop criteria in mgsm_direct_ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CatalanBench

Paper

Citation

Groups and Tasks

Groups

Tags

Tasks

Checklist

Changelog

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

CatalanBench

Paper

Citation

Groups and Tasks

Groups

Tags

Tasks

Checklist

Changelog