CatalanBench is a benchmark for evaluating language models in Catalan tasks. This is, it evaluates the ability of a language model to understand and generate Catalan text. CatalanBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of CatalanBench will be published in a paper soon.
The new evaluation datasets included in CatalanBench are:
| Task | Category | Homepage |
|---|---|---|
| ARC_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/arc_ca |
| MGSM_ca | Math | https://huggingface.co/datasets/projecte-aina/mgsm_ca |
| OpenBookQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/openbookqa_ca |
| Parafraseja | Paraphrasing | https://huggingface.co/datasets/projecte-aina/Parafraseja |
| PIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/piqa_ca |
| SIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/siqa_ca |
| XStoryCloze_ca | Commonsense Reasoning | https://huggingface.co/datasets/projecte-aina/xstorycloze_ca |
The datasets included in CatalanBench that have been made public in previous pubications are:
Paper for CatalanBench coming soon.
@inproceedings{baucells-etal-2025-iberobench,
title = "{I}bero{B}ench: A Benchmark for {LLM} Evaluation in {I}berian Languages",
author = "Baucells, Irene and
Aula-Blasco, Javier and
de-Dios-Flores, Iria and
Paniagua Su{\'a}rez, Silvia and
Perez, Naiara and
Salles, Anna and
Sotelo Docio, Susana and
Falc{\~a}o, J{\'u}lia and
Saiz, Jose Javier and
Sepulveda Torres, Robiert and
Barnes, Jeremy and
Gamallo, Pablo and
Gonzalez-Agirre, Aitor and
Rigau, German and
Villegas, Marta",
editor = "Rambow, Owen and
Wanner, Leo and
Apidianaki, Marianna and
Al-Khalifa, Hend and
Eugenio, Barbara Di and
Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.699/",
pages = "10491--10519",
}
catalan_bench: All tasks included in CatalanBench.flores_ca: All FLORES translation tasks from or to Catalan.
cabreu: Three CaBREU tasks for each type of summary (extractive, abstractive and extreme).phrases_va: Two Phrases_va tasks for language adaptation between Catalan and Valencian.
The following tasks evaluate tasks on CatalanBench dataset using various scoring methods.
arc_ca_challengearc_ca_easybelebele_cat_Latncabreucatalanqacatcolacocoteros_vacopa_cacoqcatflores_caflores_ca-deflores_ca-enflores_ca-esflores_ca-euflores_ca-frflores_ca-glflores_ca-itflores_ca-ptflores_de-caflores_en-caflores_es-caflores_eu-caflores_fr-caflores_gl-caflores_it-caflores_pt-camgsm_direct_caopenbookqa_caparafrasejapaws_caphrases_capiqa_casiqa_catecatruthfulqa_vaveritasqa_gen_caveritasqa_mc1_caveritasqa_mc2_cawnli_caxnli_caxnli_vaxquad_caxstorycloze_ca
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
belebele_cat_Latn: Belebele Catalan
- Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation?
- Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
- Is the "Main" variant of this task clearly denoted?
- Have you provided a short sentence in a README on what each new variant adds / evaluates?
- Have you noted which, if any, published evaluation setups are matched by this variant?
version 2.0: (2025-Mar-18) add cococteros_va task.
version 2.1: (2025-Jul-30) add xnli_va task.
version 2.2: (2025-Jul-30) add truthfulqa_va task.
version 2.3: (2026-Jan-16) exclude line breaks from stop criteria in mgsm_direct_ca