Skip to content

Commit 8bc4aff

Browse files
authored
add arab_culture task (EleutherAI#3006)
* add arab_culture tasks * add target_delimeter and remove debugging code
1 parent 5a481f4 commit 8bc4aff

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+1004
-1
lines changed

lm_eval/tasks/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@
1616
| [arabic_leaderboard_complete](arabic_leaderboard_complete/README.md) | A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) |
1717
| [arabic_leaderboard_light](arabic_leaderboard_light/README.md) | A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated. | Arabic (Some MT) |
1818
| [arabicmmlu](arabicmmlu/README.md) | Localized Arabic version of MMLU with multiple-choice questions from 40 subjects. | Arabic |
19-
| [AraDICE](aradice/README.md) | A collection of multiple tasks carefully designed to evaluate dialectal and cultural capabilities in large language models (LLMs). | Arabic |
19+
| [ArabCulture](arab_culture/README.md) | Benchmark for evaluating modeles' commonsense cultural knowledge across different 13 different Arab Countries. | Arabic |
20+
[AraDICE](aradice/README.md) | A collection of multiple tasks carefully designed to evaluate dialectal and cultural capabilities in large language models (LLMs). | Arabic |
2021
| [arc](arc/README.md) | Tasks involving complex reasoning over a diverse set of questions. | English |
2122
| [arithmetic](arithmetic/README.md) | Tasks involving numerical computations and arithmetic reasoning. | English |
2223
| [asdiv](asdiv/README.md) | Tasks involving arithmetic and mathematical reasoning challenges. | English |
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Arab Culture
2+
3+
### Paper
4+
5+
Title: Commonsense Reasoning in Arab Culture
6+
7+
8+
Abstract: https://arxiv.org/abs/2502.12788
9+
10+
Despite progress in Arabic large language models, such as Jais and AceGPT, their evaluation on commonsense reasoning has largely relied on machine-translated datasets, which lack cultural depth and may introduce Anglocentric biases. Commonsense reasoning is shaped by geographical and cultural contexts, and existing English datasets fail to capture the diversity of the Arab world. To address this, we introduce \datasetname, a commonsense reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13 countries across the Gulf, Levant, North Africa, and the Nile Valley. The dataset was built from scratch by engaging native speakers to write and validate culturally relevant questions for their respective countries. \datasetname spans 12 daily life domains with 54 fine-grained subtopics, reflecting various aspects of social norms, traditions, and everyday experiences. Zero-shot evaluations show that open-weight language models with up to 32B parameters struggle to comprehend diverse Arab cultures, with performance varying across regions. These findings highlight the need for more culturally aware models and datasets tailored to the Arabic-speaking world.
11+
12+
Homepage: https://github.com/fajri91/ArabicCulture
13+
14+
15+
### Citation
16+
17+
```
18+
@misc{sadallah2025commonsensereasoningarabculture,
19+
title={Commonsense Reasoning in Arab Culture},
20+
author={Abdelrahman Sadallah and Junior Cedric Tonga and Khalid Almubarak and Saeed Almheiri and Farah Atif and Chatrine Qwaider and Karima Kadaoui and Sara Shatnawi and Yaser Alesh and Fajri Koto},
21+
year={2025},
22+
eprint={2502.12788},
23+
archivePrefix={arXiv},
24+
primaryClass={cs.CL},
25+
url={https://arxiv.org/abs/2502.12788},
26+
}
27+
```
28+
29+
### There are two variant of this task: `arab_culture`, and `arab_culture_completion`
30+
31+
- The `arab_culture` is the normal MCQ evaluation type, which appends the answers to the question, and then measure the likelihood of the different choices markers (A,B,C or "أ","ب","ج"). For more info, follow the MMLU style [tempelate](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu/default/_default_template_yaml#L7-L8)
32+
- The `arab_culture_completion` do the evaluation in a sentence completion manner, by appending each asnwer to the question separetley and chooses the answer with the higher likelihood. See [this](https://github.com/EleutherAI/lm-evaluation-harness/blob/1f9bc88fe61f6bfa36f74e91ce3d59ab5685e4f1/lm_eval/tasks/arc/arc_easy.yaml#L10-L12) for more information
33+
34+
### Groups and Tasks
35+
36+
#### Groups
37+
38+
* `arabculture`: evaluates all ArabCulture tasks.
39+
40+
* `arab_culture_gulf`: evaluates Gulf countires ArabCulture tasks.
41+
* `arab_culture_levant`: evaluates Levant countires ArabCulture tasks.
42+
* `arab_culture_nile_valley`: evaluates Nile Valley countires ArabCulture tasks.
43+
* `arab_culture_north_africa`: evaluates North Africa ArabCulture tasks.
44+
45+
### Evaluation modes
46+
This bechmark allows for different evaluation settings by allowing to adding more extra context for the model:
47+
48+
We have three settings:
49+
* without any information
50+
```
51+
COUNTRY=False
52+
REGION=False
53+
```
54+
* with only region information
55+
```
56+
COUNTRY=False
57+
REGION=True
58+
```
59+
* with region and country information
60+
```
61+
COUNTRY=True
62+
REGION=True
63+
```
64+
65+
**Please add these flags add environment variables.**
66+
67+
68+
* We also allow for prompting in English, which we found to acheive higher results on most of the evaluated models (please refer to our paper).
69+
70+
* To change the language of the prompt, Define the `ARABIC` environment variable.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
aggregate_metric_list:
2+
metric: acc
3+
weight_by_size: true
4+
group: arab_culture
5+
metadata:
6+
description: Arab Culture tasks
7+
version: 0
8+
task:
9+
- arab_culture_gulf
10+
- arab_culture_levant
11+
- arab_culture_north_africa
12+
- arab_culture_nile_valley
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
aggregate_metric_list:
2+
metric: acc
3+
weight_by_size: true
4+
group: arab_culture_gulf
5+
group_alias: Gulf
6+
metadata:
7+
description: arab Culture tasks
8+
version: 0
9+
task:
10+
- arab_culture_gulf_tasks
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
aggregate_metric_list:
2+
metric: acc
3+
weight_by_size: true
4+
group: arab_culture_levant
5+
group_alias: Levant
6+
metadata:
7+
description: arab Culture tasks
8+
version: 0
9+
task:
10+
- arab_culture_levant_tasks
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
aggregate_metric_list:
2+
metric: acc
3+
weight_by_size: true
4+
group: arab_culture_nile_valley
5+
group_alias: Nile Valley
6+
metadata:
7+
description: arab Culture tasks
8+
version: 0
9+
task:
10+
- arab_culture_nile_valley_tasks
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
aggregate_metric_list:
2+
metric: acc
3+
weight_by_size: true
4+
group: arab_culture_north_africa
5+
group_alias: North Africa
6+
metadata:
7+
description: arab Culture tasks
8+
version: 0
9+
task:
10+
- arab_culture_north_africa_tasks
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
dataset_path: MBZUAI/ArabCulture
2+
test_split: test
3+
fewshot_split: test
4+
fewshot_config:
5+
sampler: first_n
6+
output_type: multiple_choice
7+
doc_to_text: !function utils_mcq.doc_to_text
8+
doc_to_choice: !function utils_mcq.doc_to_choice
9+
doc_to_target: !function utils_mcq.doc_to_target
10+
target_delimiter: ""
11+
metric_list:
12+
- metric: acc
13+
aggregation: mean
14+
higher_is_better: true
15+
- metric: acc_norm
16+
aggregation: mean
17+
higher_is_better: true
18+
metadata:
19+
version: 0.0
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
"""
2+
Take in a YAML, and output all "other" splits with this YAML
3+
"""
4+
5+
import argparse
6+
import logging
7+
import os
8+
9+
import yaml
10+
from tqdm import tqdm
11+
12+
13+
eval_logger = logging.getLogger("lm-eval")
14+
15+
countries = {
16+
"KSA": "Gulf",
17+
"UAE": "Gulf",
18+
"Yemen": "Gulf",
19+
"Lebanon": "Levant",
20+
"Syria": "Levant",
21+
"Palestine": "Levant",
22+
"Jordan": "Levant",
23+
"Tunisia": "North Africa",
24+
"Algeria": "North Africa",
25+
"Morocco": "North Africa",
26+
"Libya": "North Africa",
27+
"Egypt": "Nile Valley",
28+
"Sudan": "Nile Valley",
29+
}
30+
31+
VERSION = 0
32+
33+
34+
def parse_args():
35+
parser = argparse.ArgumentParser()
36+
parser.add_argument(
37+
"--base_yaml_path", default="_default_arab_culture_mcq_template_yaml"
38+
)
39+
parser.add_argument("--save_prefix_path", default="arab_culture")
40+
return parser.parse_args()
41+
42+
43+
if __name__ == "__main__":
44+
args = parse_args()
45+
46+
# get filename of base_yaml so we can `"include": ` it in our "other" YAMLs.
47+
base_yaml_name = os.path.split(args.base_yaml_path)[-1]
48+
# with open(args.base_yaml_path, encoding="utf-8") as f:
49+
# base_yaml = yaml.full_load(f)
50+
51+
ALL_REGIONS = []
52+
for country, region in tqdm(countries.items()):
53+
if region not in ALL_REGIONS:
54+
ALL_REGIONS.append(region)
55+
56+
# description = f"The following are multiple choice questions (with answers) about {' '.join(subject.split('_'))}.\n\n"
57+
58+
yaml_dict = {
59+
"include": base_yaml_name,
60+
"tag": f"arab_culture_{region.lower().replace(' ', '_')}_tasks",
61+
"task": f"arab_culture_{country.lower().replace(' ', '_')}",
62+
"task_alias": country,
63+
"dataset_name": country,
64+
# "description": description,
65+
}
66+
67+
file_save_path = (
68+
args.save_prefix_path
69+
+ f"_{country.lower().replace(' ', '_').replace('(', '').replace(')', '')}.yaml"
70+
)
71+
eval_logger.info(f"Saving yaml for subset {country} to {file_save_path}")
72+
with open(file_save_path, "w", encoding="utf-8") as yaml_file:
73+
yaml.dump(
74+
yaml_dict,
75+
yaml_file,
76+
allow_unicode=True,
77+
default_style='"',
78+
)
79+
80+
arab_culture_mcq_regions = [
81+
f"arab_culture_{region.lower().replace(' ', '_')}" for region in ALL_REGIONS
82+
]
83+
84+
file_save_path = args.save_prefix_path + ".yaml"
85+
86+
eval_logger.info(f"Saving benchmark config to {file_save_path}")
87+
88+
for region in ALL_REGIONS:
89+
file_save_path = (
90+
args.save_prefix_path + f"_{region.lower().replace(' ', '_')}.yaml"
91+
)
92+
eval_logger.info(f"Saving yaml for subset {region} to {file_save_path}")
93+
with open("_" + file_save_path, "w", encoding="utf-8") as yaml_file:
94+
yaml.dump(
95+
{
96+
"group": f"arab_culture_{region.lower().replace(' ', '_')}",
97+
"group_alias": region,
98+
"task": [f"arab_culture_{region.lower().replace(' ', '_')}_tasks"],
99+
"aggregate_metric_list": {"metric": "acc", "weight_by_size": True},
100+
"metadata": {
101+
"description": "arab Culture tasks",
102+
"version": VERSION,
103+
},
104+
},
105+
yaml_file,
106+
indent=4,
107+
default_flow_style=False,
108+
)
109+
110+
file_save_path = args.save_prefix_path + ".yaml"
111+
with open("_" + file_save_path, "w", encoding="utf-8") as yaml_file:
112+
yaml.dump(
113+
{
114+
"group": "arab_culture",
115+
"task": arab_culture_mcq_regions,
116+
"aggregate_metric_list": {"metric": "acc", "weight_by_size": True},
117+
"metadata": {"description": "Arab Culture tasks", "version": VERSION},
118+
},
119+
yaml_file,
120+
indent=4,
121+
default_flow_style=False,
122+
)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"dataset_name": "Algeria"
2+
"include": "_default_arab_culture_mcq_template_yaml"
3+
"tag": "arab_culture_north_africa_tasks"
4+
"task": "arab_culture_algeria"
5+
"task_alias": "Algeria"

0 commit comments

Comments
 (0)