BAREC Shared Task 2025: Arabic Readability Assessment

The BAREC Shared Task 2025 will take place at The Third Arabic Natural Language Processing Conference (ArabicNLP 2025) at EMNLP 2025.

Click here to register for the shared task!

Task Description

The BAREC Shared Task 2025 focuses on fine-grained readability classification across 19 levels using the Balanced Arabic Readability Evaluation Corpus (BAREC), a dataset of over 1 million words. Participants will build models for both sentence- and document-level classification.

Data

The BAREC Corpus: The BAREC Corpus (Elmadani et al., 2025) cconsists of 1,922 documents and 69,441 sentences classified into 19 readability levels.
The SAMER Corpus: The SAMER Corpus (Alhafni et al., 2024) consists of 4,289 documents and 20,358 fragments classified into three readability levels.
The SAMER Lexicon: The SAMER Lexicon (Al Khalil et al., 2020) is a 40K-lemma leveled readability lexicon. The lexicon consists of 40K lemma and part-of-speech pairs annotated into five readability levels.

Shared Task Tracks

Participants can compete in one or more of the following tracks, each imposing different resource constraints:

Strict Track: Models must be trained exclusively on the BAREC Corpus.
- Sentence-level Readability Assessment: CodaBench Link
- Document-level Readability Assessment: CodaBench Link
Constrained Track: Models may use the BAREC Corpus, SAMER Corpus (including document, fragment, and word-level annotations), and the SAMER Lexicon.
- Sentence-level Readability Assessment: CodaBench Link
- Document-level Readability Assessment: CodaBench Link
Open Track: No restrictions on external resources, allowing the use of any publicly available data.
- Sentence-level Readability Assessment: CodaBench Link
- Document-level Readability Assessment: CodaBench Link

With two sub-tasks and three tracks, the task results in a total of six possible combinations. Participants are allowed to compete in multiple sub-tasks and tracks.

Evaluation

We define the Readability Assessment task as an ordinal classification task. The following metrics are used for evaluation:

Accuracy (Acc¹⁹): The percentage of cases where reference and prediction classes match in the 19-level scheme.
Accuracy (Acc⁷, Acc⁵, Acc³): The percentage of cases where reference and prediction classes match after collapsing the 19 levels into 7, 5, or 3 levels, respectively.
Adjacent Accuracy (±1 Acc¹⁹): Also known as off-by-1 accuracy. The proportion of predictions that are either exactly correct or off by at most one level in the 19-level scheme.
Average Distance (Dist): Also known as Mean Absolute Error (MAE). Measures the average absolute difference between predicted and true labels.
Quadratic Weighted Kappa (QWK): An extension of Cohen’s Kappa that measures the agreement between predicted and true labels, applying a quadratic penalty to larger misclassifications (i.e., predictions farther from the true label are penalized more heavily).

We provide instructions on how to run the evaluation script below.

Requirements:

You will need to have conda installed. To setup the environment, you would need to run:

git clone https://github.com/CAMeL-Lab/barec-shared-task-2025.git
cd barec-shared-task-2025

conda create -n barec python=3.9

conda activate barec

pip install -r requirements.txt

Running the Evaluation

To evaluate your predictions, use the provided evaluation script. The script requires three arguments:

--output: Path to your output CSV file containing predictions.
--split: The data split to evaluate on (Dev or Test).
--task: The task type (Sent for sentence-level or Doc for document-level readability).

To evaluate your system's output, you would need to run:

python scripts/eval.py --output /path/to/output_csv --split [Dev|Test] --task [Sent|Doc]

Example usage:

python scripts/eval.py --output examples/Dev_Sentence_Level.csv --split Dev --task Sent

Output CSV Format

Your output CSV file should have the following columns:

For sentence-level tasks (--task Sent):
- Sentence ID: The unique identifier for each sentence.
- Prediction: Your predicted readability level for each sentence (integer from 1 to 19).
For document-level tasks (--task Doc):
- Document ID: The unique identifier for each document.
- Prediction: Your predicted readability level for each document (integer from 1 to 19).

Example (Sentence-level):

Sentence ID	Prediction
1001	7
1002	12
...	...

Example (Document-level):

Document ID	Prediction
2001	5
2002	14
...	...

Make sure the IDs in your output file match exactly those in the provided split (Dev or Test) for the chosen task.

Example Output

After running the evaluation script, you will see output similar to the following in your terminal:

Evaluating Sentence-level readability on Dev split using examples/Dev_Sentence_Level.csv
Accuracy: 56.6211%
Accuracy +/-1: 69.8632%
Average absolute distance: 1.143776
Quadratic Cohen's Kappa: 80.0040%
Accuracy (7 levels): 65.8687%
Accuracy (5 levels): 70.2736%
Accuracy (3 levels): 76.4569%
Evaluation completed successfully.

Each metric reflects the performance of your predictions on the selected split and task.

Organizers

License

This repo is available under the MIT license. See the LICENSE for more info.

References

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment. Khalid N. Elmadani, Nizar Habash, and Hanada Taha-Thomure. 2025. In Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria.
Guidelines for fine-grained sentence-level Arabic readability annotation. Nizar Habash, Hanada Taha-Thomure, Khalid N. Elmadani, Zeina Zeino, and Abdallah Abushmaes. 2025. In Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX), Vienna, Austria.
The SAMER Arabic Text Simplification Corpus. Bashar Alhafni, Reem Hazim, Juan David Pineros Liberato, Muhamed Al Khalil, and Nizar Habash. 2024. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia.
A Large-Scale Leveled Readability Lexicon for Standard Arabic. Muhamed Al Khalil, Nizar Habash, and Zhengyang Jiang. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
examples		examples
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BAREC Shared Task 2025: Arabic Readability Assessment

Task Description

Data

Shared Task Tracks

Evaluation

Requirements:

Running the Evaluation

Output CSV Format

Example Output

Organizers

License

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

CAMeL-Lab/barec-shared-task-2025

Folders and files

Latest commit

History

Repository files navigation

BAREC Shared Task 2025: Arabic Readability Assessment

Task Description

Data

Shared Task Tracks

Evaluation

Requirements:

Running the Evaluation

Output CSV Format

Example Output

Organizers

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages