Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding [🔥 EMNLP-2025 (Main)]

Wafa Alghallabi ^* Ritesh Thawkar ^* Sara Ghaboura ^* Ketan More ^* Omkar Thawakar ^*
Hisham Cholakkal Salman Khan Rao M. Anwer

^{*Equal Contribution}

Fann Or Flop?

Fann or Flop is the first comprehensive benchmark designed to evaluate large language models (LLMs) on their ability to understand Arabic poetry. It contains nearly 7,000 poem-explanation pairs covering 12 poetic eras, 21 genres, and multiple meters, providing a culturally rich and linguistically challenging testbed for Arabic NLP.

Latest Updates

🔥🔥 [20 Aug 2025] 🔥🔥 Fann or Flop accepted to EMNLP 2025 main track.
🔥 [26 May 2025] Fann or Flop the 1^st benchmark for assessing the LLM's ability to comprehend and analyze Arabic poetry is released.
🤗 [19 Feb 2025] Fann or Flop dataset available on Hugging Face.

✨ Key Features

Expert-Annotated Explanations: Verse-level commentary verified by native Arabic scholars.
12 Historical Eras: From Pre-Islamic and Umayyad to Modern poetry.
Multi-Dimensional Evaluation: Faithfulness, fluency, metaphor, historical context, and rhetorical awareness.
Structured Taxonomy: Each poem tagged with meter, genre, and era.
QA-Style Format: Ideal for generative and comprehension-based evaluation in LLMs.

Dataset Structure

Each JSON entry is structured as follows:

Field	Type	Description
`id`	`string`	Unique poem identifier
`title`	`string`	Title of the poem
`author`	`string`	Name of the poet
`source`	`string`	URL to the poem source
`tags`	`list[str]`	List of `meter`, `genre`, and `era`
`meter`	`string`	Poetic meter (e.g., الكامل, الطويل)
`genre`	`string`	Genre label (e.g., مدح, رثاء)
`era`	`string`	Historical literary era (e.g., العصر العباسي)
`verse_count`	`int`	Number of verses
`poem_verses`	`string`	Full poem text, numbered and formatted
`explanation`	`list[dict]`	Verse-wise explanation with fields: `verse`, `explanation`
`raw_explanation`	`string`	Full explanation in paragraph format

Sample entries are available in the samples/ folder.

Taxonomy Overview

The dataset spans 12 major Arabic poetic eras:

Era	Approx. Time Range	Example Poets
Pre-Islamic	~6th Century	Imru’ al-Qays, Antarah ibn Shaddad
Umayyad	661–750 CE	Jarir, Al-Farazdaq
Abbasid	750–1258 CE	Al-Mutanabbi, Abu Nuwas
Andalusian	756–1492 CE	Ibn Zaydun, Ibn Khafaja
Modern	19th c. – Present	Hafiz Ibrahim, Ahmad Shawqi
(+7 more eras...)	See paper for full list	-

Each poem is assigned its literary context through expert-verified metadata.

Fann Or Flop Pipeline

Figure 2. Fann or Flop Pipeline. Fann or Flop is built out of the multi-stage pipeline. It begins with scraping Arabic poems from a trusted online archive using a custom web scraper. Extracted poems are matched to an initial expert-verified taxonomy and filtered to remove duplicates, ambiguous metadata, and invalid entries. The filtered texts then undergo normalization (e.g., unifying diacritics, punctuation, and letter forms) and Arabic-specific tokenization, with non-poetic or irrelevant content excluded. Manual corrections are applied to fix OCR and encoding errors. In the final stage, linguistic experts verify each sample to ensure proper alignment with genre and era labels.

Evaluation Protocol

We provide an evaluation framework using:

🔹 Automatic Metrics

BLEU / chrF++ for lexical overlap
BERTScore (Arabic transformer) for semantic similarity
Textual Entailment using mDeBERTa (NLI)

🔹 LLM-as-Judge

GPT-4o used to evaluate:
- Faithfulness / Consistency
- Fluency / Grammaticality

🔹 Human Evaluation

Interpretive Depth
- Rubric includes:
  - Literal Meaning (0–1)
  - Thematic / Emotional Depth (0–2)
  - Cultural Context (0–2)
  - Literary Devices (0–3)
  - Expressiveness / Coherence (0–2)

Download

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("omkarthawakar/FannOrFlop")

Evaluation Suit

The evaluation/ directory contains scripts to reproduce the benchmark results and evaluate your own models.

General Setup

Navigate to the evaluation directory:
```
cd evaluation
```
Dependencies: Ensure you have Python 3.x installed. Install necessary packages. It's recommended to use a virtual environment.
```
pip install torch transformers evaluate scikit-learn numpy openai camel-tools tqdm
```
(Note: camel-tools is crucial for Arabic text processing.)
Ground Truth Data: The primary ground truth file is FannOrFlop.json. Most scripts expect this file to be present in the evaluation/ directory or for its path to be configured within the script or via command-line arguments.

Model Prediction Files: Your model's generated explanations should be in a JSON format. Each file should contain a list of poem objects. Each poem object must include an "id" and a key containing a list of verse-explanation pairs (typically "verse_explanations").

Sample Model Prediction JSON (your_model_explanations.json):

[
  {
    "id": "poem_5123",
    "title": "خانَ عَهدي مُعاوِداً خَونَ عَهدي", // Optional, but good for reference
    // Other metadata like genre, meter, author can be included
    "verse_explanations": [
      {
        "verse": "خـانَ عَهـدي مُعـاوِداً خَـونَ عَهـدي\nمَــن لَــهُ خُلَّــتي وَخــالِصُ وُدّي",
        "explanation": "Generated explanation for verse 1..."
      },
      {
        "verse": "بـانَ بِالحُسـنِ وَحـدَهُ لَـم يُنـازِع\nهُ شـَريكٌ وَبِنـتُ بِـالبَثِّ وَحـدي",
        "explanation": "Generated explanation for verse 2..."
      }
      // ... more verses for this poem
    ]
  }
  // ... more poems
]

Running Evaluation Scripts

All commands below assume you are in the evaluation/ directory.

1. BERTScore (bertscore.py)

Purpose: Calculates BERTScore (Precision, Recall, F1) using AraBERT for semantic similarity.
Configuration: Modify the modeljsons dictionary within bertscore.py to include your model's name and the path to its prediction JSON file. Ensure gtjson points to FannOrFlop.json.
```
# Example in bertscore.py
modeljsons = {
    "YourModelName": "path/to/your_model_explanations.json",
}
gtjson = "FannOrFlop.json" # Or correct path
```
Usage:
```
python bertscore.py
```
Output: Prints macro-averaged Precision, Recall, and F1-score to the console.

2. BLEU (bleu.py)

Purpose: Calculates BLEU, Coverage, and BLEU*Coverage for lexical overlap.

Configuration: Inside bleu.py, update the gtjson, predjson, and modelname variables in the if __name__ == "__main__": block.

# Example in bleu.py
gtjson = "FannOrFlop.json"
predjson = "path/to/your_model_explanations.json"
modelname = "YourModelName"

Usage:
```
python bleu.py
```
Output: Prints macro-averaged BLEU, Coverage, and BLEU*Coverage to the console.

3. chrF Score (chrf_score.py)

Purpose: Calculates chrF, Coverage, and chrF*Coverage (character n-gram metric).
Configuration: Modify the modeljsons dictionary within chrf_score.py similarly to bertscore.py. Ensure gtjson points to FannOrFlop.json.
Usage:
```
python chrf_score.py
```
Output: Prints macro-averaged chrF, Coverage, and chrF*Coverage to the console.

4. LLM-as-Judge Evaluation (judge_eval.py)

Purpose: Uses an LLM (e.g., GPT-4o) to evaluate Faithfulness, Fluency, and Overall scores.
Prerequisites: Set your OpenAI API key as an environment variable:
```
export OPENAI_API_KEY='your_api_key_here'
```
Configuration: In judge_eval.py, modify the following variables at the top of the script:
- MODEL_NAME: A name for your model.
- PREDICTIONS_FILE: Path to your model's prediction JSON file.
- GROUND_TRUTH_FILE: Path to FannOrFlop.json (default is FannOrFlop.json).
- LLM_JUDGE_MODEL: The LLM to use for judging (e.g., "gpt-4o", "gpt-3.5-turbo").
Usage:
```
python judge_eval.py
```
Output: Saves detailed scores to a JSON file in the judge_results/ directory (e.g., judge_results/YourModelName-results.json) and prints progress.

5. Average LLM Judge Scores (get_average_scores_for_llm_judge.py)

Purpose: Calculates average and standard deviation for scores generated by judge_eval.py.
Prerequisites: Run judge_eval.py first to generate result files in judge_results/.

Usage:

python get_average_scores_for_llm_judge.py

Output: Prints average Faithfulness and Fluency scores (with SD) to the console for each model found in judge_results/.

6. Textual Entailment (text_entailment.py)

Purpose: Calculates bidirectional textual entailment scores between ground truth and generated explanations.
Configuration:
1. Edit text_entailment.py and update the models_to_predictions dictionary to map your model names to their prediction JSON file paths.
```
# Example in text_entailment.py
models_to_predictions = {
    "YourModelName": "path/to/your_model_explanations.json",
    # Add other models if evaluating multiple
}
```
2. The script uses command-line arguments for other configurations. Key arguments:
  - --gt_file: Path to the ground truth JSON (default: FannOrFlop.json).
  - --gt_key: Key in ground truth for explanations list (default: explanation).
  - --pred_key: Key in prediction files for explanations list (default: verse_explanations).
  - --base_output_dir: Directory to save detailed results (default: explanation_closeness_results).

Usage (example):

python text_entailment.py --gt_file FannOrFlop.json --base_output_dir results/entailment_scores

Output: Saves detailed JSON results per model in subdirectories of base_output_dir. Prints overall summary scores to the console.

Leaderboard (Sample Results)

Open-Source Models

Model	BLEU	chrF(++)	BERTScore	Textual Entailment	Faithfulness / Consistency	Fluency / Grammaticality	Interpretive Depth
GPT-4o-2024-08-06 (OpenAI, 2024)	0.0395	0.2882	0.6410	0.6775	3.92 (± 0.99)	4.96 (± 0.20)	7.52
GPT-4o-mini-2024-07-18 (OpenAI, 2024)	0.0395	0.2542	0.6124	0.4383	2.91 (± 0.75)	4.28 (± 0.57)	7.50
Gemini-2.5-Flash (AI, 2025b)	0.0153	0.2618	0.6319	0.7475	4.25 (± 1.00)	4.98 (± 0.16)	7.22
Gemini-2.0-Flash (AI, 2025a)	0.0395	0.2618	0.6393	0.7154	3.99 (± 1.04)	4.95 (± 0.22)	6.50
Gemini-1.5-Pro (Reid et al., 2024)	0.0395	0.2618	0.6333	0.6180	3.59 (± 1.00)	4.80 (± 0.41)	5.38
Fanar-Star (Team et al., 2025)	0.0138	0.1538	0.5677	0.6468	2.16 (± 0.92)	3.40 (± 0.76)	2.88

Open-Source Models

Model	BLEU	chrF(++)	BERTScore	Textual Entailment	Faithfulness / Consistency	Fluency / Grammaticality	Interpretive Depth
Deepseek-V3 (Liu et al., 2024)	0.0395	0.2771	0.6335	0.5117	3.36 (± 0.91)	4.98 (± 0.16)	4.75
Deepseek-R1 (Guo et al., 2025)	0.0395	0.2771	0.6335	0.5117	3.38 (± 0.92)	4.98 (± 0.16)	4.25
Llama-3.3-70B (Meta AI, 2024)	0.0153	0.2618	0.6393	0.5364	2.51 (± 0.90)	3.37 (± 0.73)	7.20
Qwen-3 (Team, 2025)	0.0296	0.2837	0.6158	0.6468	3.98 (± 0.90)	4.73 (± 0.45)	6.50
Aya-Expanse (Dang et al., 2024)	0.0329	0.2771	0.6328	0.6468	3.76 (± 0.90)	4.68 (± 0.47)	5.88
Jais (Sengupta et al., 2023)	0.0312	0.2698	0.6245	0.6023	3.21 (± 0.88)	4.35 (± 0.52)	5.35
ALLaM-7B (Bari et al., 2024)	0.0119	0.0463	0.5375	0.5997	1.32 (± 0.62)	2.11 (± 0.89)	3.12
AceGPT-v2-70B-Chat (Huang et al., 2023)	0.0402	0.0412	0.5759	0.6061	2.52 (± 0.91)	3.46 (± 0.95)	4.12

💬 Citation

If you use ARB dataset in your research, please consider citing:

@misc{alghallabi2025fannflopmultigenremultiera,
      title={Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs}, 
      author={Wafa Alghallabi and Ritesh Thawkar and Sara Ghaboura and Ketan More and Omkar Thawakar and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
      year={2025},
      eprint={2505.18152},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18152}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
evaluation		evaluation
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding [🔥 EMNLP-2025 (Main)]

Fann Or Flop?

Latest Updates

✨ Key Features

Dataset Structure

Taxonomy Overview

Fann Or Flop Pipeline

Evaluation Protocol

🔹 Automatic Metrics

🔹 LLM-as-Judge

🔹 Human Evaluation

Download

Evaluation Suit

General Setup

Running Evaluation Scripts

Leaderboard (Sample Results)

Open-Source Models

Open-Source Models

💬 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

mbzuai-oryx/FannOrFlop

Folders and files

Latest commit

History

Repository files navigation

Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding [🔥 EMNLP-2025 (Main)]

Fann Or Flop?

Latest Updates

✨ Key Features

Dataset Structure

Taxonomy Overview

Fann Or Flop Pipeline

Evaluation Protocol

🔹 Automatic Metrics

🔹 LLM-as-Judge

🔹 Human Evaluation

Download

Evaluation Suit

General Setup

Running Evaluation Scripts

Leaderboard (Sample Results)

Open-Source Models

Open-Source Models

💬 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages