Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding [๐ฅ EMNLP-2025 (Main)]
Wafa Alghallabi * ย
Ritesh Thawkar * ย
Sara Ghaboura * ย
Ketan More * ย
Omkar Thawakar * ย
Hisham Cholakkal ย
Salman Khan ย
Rao M. Anwer
*Equal Contribution
Fann or Flop is the first comprehensive benchmark designed to evaluate large language models (LLMs) on their ability to understand Arabic poetry. It contains nearly 7,000 poem-explanation pairs covering 12 poetic eras, 21 genres, and multiple meters, providing a culturally rich and linguistically challenging testbed for Arabic NLP.
๐ฅ๐ฅ [20 Aug 2025] ๐ฅ๐ฅ Fann or Flop accepted to EMNLP 2025 main track.
๐ฅ [26 May 2025] Fann or Flop the 1st benchmark for assessing the LLM's ability to comprehend and analyze Arabic poetry is released.
๐ค [19 Feb 2025] Fann or Flop dataset available on Hugging Face.
- Expert-Annotated Explanations: Verse-level commentary verified by native Arabic scholars.
- 12 Historical Eras: From Pre-Islamic and Umayyad to Modern poetry.
- Multi-Dimensional Evaluation: Faithfulness, fluency, metaphor, historical context, and rhetorical awareness.
- Structured Taxonomy: Each poem tagged with
meter,genre, andera. - QA-Style Format: Ideal for generative and comprehension-based evaluation in LLMs.
Each JSON entry is structured as follows:
| Field | Type | Description |
|---|---|---|
id |
string |
Unique poem identifier |
title |
string |
Title of the poem |
author |
string |
Name of the poet |
source |
string |
URL to the poem source |
tags |
list[str] |
List of meter, genre, and era |
meter |
string |
Poetic meter (e.g., ุงููุงู ู, ุงูุทููู) |
genre |
string |
Genre label (e.g., ู ุฏุญ, ุฑุซุงุก) |
era |
string |
Historical literary era (e.g., ุงูุนุตุฑ ุงูุนุจุงุณู) |
verse_count |
int |
Number of verses |
poem_verses |
string |
Full poem text, numbered and formatted |
explanation |
list[dict] |
Verse-wise explanation with fields: verse, explanation |
raw_explanation |
string |
Full explanation in paragraph format |
Sample entries are available in the samples/ folder.
The dataset spans 12 major Arabic poetic eras:
| Era | Approx. Time Range | Example Poets |
|---|---|---|
| Pre-Islamic | ~6th Century | Imruโ al-Qays, Antarah ibn Shaddad |
| Umayyad | 661โ750 CE | Jarir, Al-Farazdaq |
| Abbasid | 750โ1258 CE | Al-Mutanabbi, Abu Nuwas |
| Andalusian | 756โ1492 CE | Ibn Zaydun, Ibn Khafaja |
| Modern | 19th c. โ Present | Hafiz Ibrahim, Ahmad Shawqi |
| (+7 more eras...) | See paper for full list | - |
Each poem is assigned its literary context through expert-verified metadata.
Figure 2. Fann or Flop Pipeline. Fann or Flop is built out of the multi-stage pipeline. It begins with scraping Arabic poems from a trusted online archive using a custom web scraper. Extracted poems are matched to an initial expert-verified taxonomy and filtered to remove duplicates, ambiguous metadata, and invalid entries. The filtered texts then undergo normalization (e.g., unifying diacritics, punctuation, and letter forms) and Arabic-specific tokenization, with non-poetic or irrelevant content excluded. Manual corrections are applied to fix OCR and encoding errors. In the final stage, linguistic experts verify each sample to ensure proper alignment with genre and era labels.
We provide an evaluation framework using:
- BLEU / chrF++ for lexical overlap
- BERTScore (Arabic transformer) for semantic similarity
- Textual Entailment using mDeBERTa (NLI)
- GPT-4o used to evaluate:
- Faithfulness / Consistency
- Fluency / Grammaticality
- Interpretive Depth
- Rubric includes:
- Literal Meaning (0โ1)
- Thematic / Emotional Depth (0โ2)
- Cultural Context (0โ2)
- Literary Devices (0โ3)
- Expressiveness / Coherence (0โ2)
- Rubric includes:
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("omkarthawakar/FannOrFlop")The evaluation/ directory contains scripts to reproduce the benchmark results and evaluate your own models.
-
Navigate to the evaluation directory:
cd evaluation -
Dependencies: Ensure you have Python 3.x installed. Install necessary packages. It's recommended to use a virtual environment.
pip install torch transformers evaluate scikit-learn numpy openai camel-tools tqdm
(Note:
camel-toolsis crucial for Arabic text processing.) -
Ground Truth Data: The primary ground truth file is
FannOrFlop.json. Most scripts expect this file to be present in theevaluation/directory or for its path to be configured within the script or via command-line arguments. -
Model Prediction Files: Your model's generated explanations should be in a JSON format. Each file should contain a list of poem objects. Each poem object must include an
"id"and a key containing a list of verse-explanation pairs (typically"verse_explanations").Sample Model Prediction JSON (
your_model_explanations.json):[ { "id": "poem_5123", "title": "ุฎุงูู ุนููุฏู ู ูุนุงููุฏุงู ุฎูููู ุนููุฏู", // Optional, but good for reference // Other metadata like genre, meter, author can be included "verse_explanations": [ { "verse": "ุฎูุงูู ุนูููุฏู ู ูุนูุงููุฏุงู ุฎููููู ุนูููุฏู\nู ูููู ูููููู ุฎููููููุชู ููุฎููุงููุตู ููุฏูู", "explanation": "Generated explanation for verse 1..." }, { "verse": "ุจูุงูู ุจูุงูุญูุณููู ููุญูุฏููู ูููู ููููุงุฒูุน\nูู ุดููุฑููู ููุจูููุชู ุจููุงูุจูุซูู ููุญูุฏู", "explanation": "Generated explanation for verse 2..." } // ... more verses for this poem ] } // ... more poems ]
All commands below assume you are in the evaluation/ directory.
1. BERTScore (bertscore.py)
- Purpose: Calculates BERTScore (Precision, Recall, F1) using AraBERT for semantic similarity.
- Configuration: Modify the
modeljsonsdictionary withinbertscore.pyto include your model's name and the path to its prediction JSON file. Ensuregtjsonpoints toFannOrFlop.json.# Example in bertscore.py modeljsons = { "YourModelName": "path/to/your_model_explanations.json", } gtjson = "FannOrFlop.json" # Or correct path
- Usage:
python bertscore.py
- Output: Prints macro-averaged Precision, Recall, and F1-score to the console.
2. BLEU (bleu.py)
- Purpose: Calculates BLEU, Coverage, and BLEU*Coverage for lexical overlap.
- Configuration: Inside
bleu.py, update thegtjson,predjson, andmodelnamevariables in theif __name__ == "__main__":block.# Example in bleu.py gtjson = "FannOrFlop.json" predjson = "path/to/your_model_explanations.json" modelname = "YourModelName"
- Usage:
python bleu.py
- Output: Prints macro-averaged BLEU, Coverage, and BLEU*Coverage to the console.
3. chrF Score (chrf_score.py)
- Purpose: Calculates chrF, Coverage, and chrF*Coverage (character n-gram metric).
- Configuration: Modify the
modeljsonsdictionary withinchrf_score.pysimilarly tobertscore.py. Ensuregtjsonpoints toFannOrFlop.json. - Usage:
python chrf_score.py
- Output: Prints macro-averaged chrF, Coverage, and chrF*Coverage to the console.
4. LLM-as-Judge Evaluation (judge_eval.py)
- Purpose: Uses an LLM (e.g., GPT-4o) to evaluate Faithfulness, Fluency, and Overall scores.
- Prerequisites: Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY='your_api_key_here'
- Configuration: In
judge_eval.py, modify the following variables at the top of the script:MODEL_NAME: A name for your model.PREDICTIONS_FILE: Path to your model's prediction JSON file.GROUND_TRUTH_FILE: Path toFannOrFlop.json(default isFannOrFlop.json).LLM_JUDGE_MODEL: The LLM to use for judging (e.g., "gpt-4o", "gpt-3.5-turbo").
- Usage:
python judge_eval.py
- Output: Saves detailed scores to a JSON file in the
judge_results/directory (e.g.,judge_results/YourModelName-results.json) and prints progress.
5. Average LLM Judge Scores (get_average_scores_for_llm_judge.py)
- Purpose: Calculates average and standard deviation for scores generated by
judge_eval.py. - Prerequisites: Run
judge_eval.pyfirst to generate result files injudge_results/. - Usage:
python get_average_scores_for_llm_judge.py
- Output: Prints average Faithfulness and Fluency scores (with SD) to the console for each model found in
judge_results/.
6. Textual Entailment (text_entailment.py)
- Purpose: Calculates bidirectional textual entailment scores between ground truth and generated explanations.
- Configuration:
- Edit
text_entailment.pyand update themodels_to_predictionsdictionary to map your model names to their prediction JSON file paths.# Example in text_entailment.py models_to_predictions = { "YourModelName": "path/to/your_model_explanations.json", # Add other models if evaluating multiple }
- The script uses command-line arguments for other configurations. Key arguments:
--gt_file: Path to the ground truth JSON (default:FannOrFlop.json).--gt_key: Key in ground truth for explanations list (default:explanation).--pred_key: Key in prediction files for explanations list (default:verse_explanations).--base_output_dir: Directory to save detailed results (default:explanation_closeness_results).
- Edit
- Usage (example):
python text_entailment.py --gt_file FannOrFlop.json --base_output_dir results/entailment_scores
- Output: Saves detailed JSON results per model in subdirectories of
base_output_dir. Prints overall summary scores to the console.
| Model | BLEU | chrF(++) | BERTScore | Textual Entailment | Faithfulness / Consistency | Fluency / Grammaticality | Interpretive Depth |
|---|---|---|---|---|---|---|---|
| GPT-4o-2024-08-06 (OpenAI, 2024) | 0.0395 | 0.2882 | 0.6410 | 0.6775 | 3.92 (ยฑ 0.99) | 4.96 (ยฑ 0.20) | 7.52 |
| GPT-4o-mini-2024-07-18 (OpenAI, 2024) | 0.0395 | 0.2542 | 0.6124 | 0.4383 | 2.91 (ยฑ 0.75) | 4.28 (ยฑ 0.57) | 7.50 |
| Gemini-2.5-Flash (AI, 2025b) | 0.0153 | 0.2618 | 0.6319 | 0.7475 | 4.25 (ยฑ 1.00) | 4.98 (ยฑ 0.16) | 7.22 |
| Gemini-2.0-Flash (AI, 2025a) | 0.0395 | 0.2618 | 0.6393 | 0.7154 | 3.99 (ยฑ 1.04) | 4.95 (ยฑ 0.22) | 6.50 |
| Gemini-1.5-Pro (Reid et al., 2024) | 0.0395 | 0.2618 | 0.6333 | 0.6180 | 3.59 (ยฑ 1.00) | 4.80 (ยฑ 0.41) | 5.38 |
| Fanar-Star (Team et al., 2025) | 0.0138 | 0.1538 | 0.5677 | 0.6468 | 2.16 (ยฑ 0.92) | 3.40 (ยฑ 0.76) | 2.88 |
| Model | BLEU | chrF(++) | BERTScore | Textual Entailment | Faithfulness / Consistency | Fluency / Grammaticality | Interpretive Depth |
|---|---|---|---|---|---|---|---|
| Deepseek-V3 (Liu et al., 2024) | 0.0395 | 0.2771 | 0.6335 | 0.5117 | 3.36 (ยฑ 0.91) | 4.98 (ยฑ 0.16) | 4.75 |
| Deepseek-R1 (Guo et al., 2025) | 0.0395 | 0.2771 | 0.6335 | 0.5117 | 3.38 (ยฑ 0.92) | 4.98 (ยฑ 0.16) | 4.25 |
| Llama-3.3-70B (Meta AI, 2024) | 0.0153 | 0.2618 | 0.6393 | 0.5364 | 2.51 (ยฑ 0.90) | 3.37 (ยฑ 0.73) | 7.20 |
| Qwen-3 (Team, 2025) | 0.0296 | 0.2837 | 0.6158 | 0.6468 | 3.98 (ยฑ 0.90) | 4.73 (ยฑ 0.45) | 6.50 |
| Aya-Expanse (Dang et al., 2024) | 0.0329 | 0.2771 | 0.6328 | 0.6468 | 3.76 (ยฑ 0.90) | 4.68 (ยฑ 0.47) | 5.88 |
| Jais (Sengupta et al., 2023) | 0.0312 | 0.2698 | 0.6245 | 0.6023 | 3.21 (ยฑ 0.88) | 4.35 (ยฑ 0.52) | 5.35 |
| ALLaM-7B (Bari et al., 2024) | 0.0119 | 0.0463 | 0.5375 | 0.5997 | 1.32 (ยฑ 0.62) | 2.11 (ยฑ 0.89) | 3.12 |
| AceGPT-v2-70B-Chat (Huang et al., 2023) | 0.0402 | 0.0412 | 0.5759 | 0.6061 | 2.52 (ยฑ 0.91) | 3.46 (ยฑ 0.95) | 4.12 |
If you use ARB dataset in your research, please consider citing:
@misc{alghallabi2025fannflopmultigenremultiera,
title={Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs},
author={Wafa Alghallabi and Ritesh Thawkar and Sara Ghaboura and Ketan More and Omkar Thawakar and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
year={2025},
eprint={2505.18152},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.18152},
}
