Skip to content

mbzuai-oryx/FannOrFlop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

18 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding [๐Ÿ”ฅ EMNLP-2025 (Main)]

Wafa Alghallabi * ย  Ritesh Thawkar * ย  Sara Ghaboura * ย  Ketan More * ย  Omkar Thawakar * ย 
Hisham Cholakkal ย  Salman Khan ย  Rao M. Anwer


arXiv Our Page GitHub issues GitHub stars GitHub license
*Equal Contribution

Fann Or Flop?

Fann or Flop is the first comprehensive benchmark designed to evaluate large language models (LLMs) on their ability to understand Arabic poetry. It contains nearly 7,000 poem-explanation pairs covering 12 poetic eras, 21 genres, and multiple meters, providing a culturally rich and linguistically challenging testbed for Arabic NLP.




Latest Updates

๐Ÿ”ฅ๐Ÿ”ฅ [20 Aug 2025] ๐Ÿ”ฅ๐Ÿ”ฅ Fann or Flop accepted to EMNLP 2025 main track.
๐Ÿ”ฅ [26 May 2025] Fann or Flop the 1st benchmark for assessing the LLM's ability to comprehend and analyze Arabic poetry is released.
๐Ÿค— [19 Feb 2025] Fann or Flop dataset available on Hugging Face.



โœจ Key Features

  • Expert-Annotated Explanations: Verse-level commentary verified by native Arabic scholars.
  • 12 Historical Eras: From Pre-Islamic and Umayyad to Modern poetry.
  • Multi-Dimensional Evaluation: Faithfulness, fluency, metaphor, historical context, and rhetorical awareness.
  • Structured Taxonomy: Each poem tagged with meter, genre, and era.
  • QA-Style Format: Ideal for generative and comprehension-based evaluation in LLMs.


Dataset Structure

Each JSON entry is structured as follows:

Field Type Description
id string Unique poem identifier
title string Title of the poem
author string Name of the poet
source string URL to the poem source
tags list[str] List of meter, genre, and era
meter string Poetic meter (e.g., ุงู„ูƒุงู…ู„, ุงู„ุทูˆูŠู„)
genre string Genre label (e.g., ู…ุฏุญ, ุฑุซุงุก)
era string Historical literary era (e.g., ุงู„ุนุตุฑ ุงู„ุนุจุงุณูŠ)
verse_count int Number of verses
poem_verses string Full poem text, numbered and formatted
explanation list[dict] Verse-wise explanation with fields: verse, explanation
raw_explanation string Full explanation in paragraph format

Sample entries are available in the samples/ folder.



Taxonomy Overview

The dataset spans 12 major Arabic poetic eras:

Era Approx. Time Range Example Poets
Pre-Islamic ~6th Century Imruโ€™ al-Qays, Antarah ibn Shaddad
Umayyad 661โ€“750 CE Jarir, Al-Farazdaq
Abbasid 750โ€“1258 CE Al-Mutanabbi, Abu Nuwas
Andalusian 756โ€“1492 CE Ibn Zaydun, Ibn Khafaja
Modern 19th c. โ€“ Present Hafiz Ibrahim, Ahmad Shawqi
(+7 more eras...) See paper for full list -

Each poem is assigned its literary context through expert-verified metadata.



Fann Or Flop Pipeline

pipeline

Figure 2. Fann or Flop Pipeline. Fann or Flop is built out of the multi-stage pipeline. It begins with scraping Arabic poems from a trusted online archive using a custom web scraper. Extracted poems are matched to an initial expert-verified taxonomy and filtered to remove duplicates, ambiguous metadata, and invalid entries. The filtered texts then undergo normalization (e.g., unifying diacritics, punctuation, and letter forms) and Arabic-specific tokenization, with non-poetic or irrelevant content excluded. Manual corrections are applied to fix OCR and encoding errors. In the final stage, linguistic experts verify each sample to ensure proper alignment with genre and era labels.


Evaluation Protocol

We provide an evaluation framework using:

๐Ÿ”น Automatic Metrics

  • BLEU / chrF++ for lexical overlap
  • BERTScore (Arabic transformer) for semantic similarity
  • Textual Entailment using mDeBERTa (NLI)

๐Ÿ”น LLM-as-Judge

  • GPT-4o used to evaluate:
    • Faithfulness / Consistency
    • Fluency / Grammaticality

๐Ÿ”น Human Evaluation

  • Interpretive Depth
    • Rubric includes:
      • Literal Meaning (0โ€“1)
      • Thematic / Emotional Depth (0โ€“2)
      • Cultural Context (0โ€“2)
      • Literary Devices (0โ€“3)
      • Expressiveness / Coherence (0โ€“2)


Download

from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("omkarthawakar/FannOrFlop")


Evaluation Suit

The evaluation/ directory contains scripts to reproduce the benchmark results and evaluate your own models.

General Setup

  1. Navigate to the evaluation directory:

    cd evaluation
  2. Dependencies: Ensure you have Python 3.x installed. Install necessary packages. It's recommended to use a virtual environment.

    pip install torch transformers evaluate scikit-learn numpy openai camel-tools tqdm

    (Note: camel-tools is crucial for Arabic text processing.)

  3. Ground Truth Data: The primary ground truth file is FannOrFlop.json. Most scripts expect this file to be present in the evaluation/ directory or for its path to be configured within the script or via command-line arguments.

  4. Model Prediction Files: Your model's generated explanations should be in a JSON format. Each file should contain a list of poem objects. Each poem object must include an "id" and a key containing a list of verse-explanation pairs (typically "verse_explanations").

    Sample Model Prediction JSON (your_model_explanations.json):

    [
      {
        "id": "poem_5123",
        "title": "ุฎุงู†ูŽ ุนูŽู‡ุฏูŠ ู…ูุนุงูˆูุฏุงู‹ ุฎูŽูˆู†ูŽ ุนูŽู‡ุฏูŠ", // Optional, but good for reference
        // Other metadata like genre, meter, author can be included
        "verse_explanations": [
          {
            "verse": "ุฎู€ุงู†ูŽ ุนูŽู‡ู€ุฏูŠ ู…ูุนู€ุงูˆูุฏุงู‹ ุฎูŽู€ูˆู†ูŽ ุนูŽู‡ู€ุฏูŠ\nู…ูŽู€ู€ู† ู„ูŽู€ู€ู‡ู ุฎูู„ู‘ูŽู€ู€ุชูŠ ูˆูŽุฎู€ู€ุงู„ูุตู ูˆูุฏู‘ูŠ",
            "explanation": "Generated explanation for verse 1..."
          },
          {
            "verse": "ุจู€ุงู†ูŽ ุจูุงู„ุญูุณู€ู†ู ูˆูŽุญู€ุฏูŽู‡ู ู„ูŽู€ู… ูŠูู†ู€ุงุฒูุน\nู‡ู ุดู€ูŽุฑูŠูƒูŒ ูˆูŽุจูู†ู€ุชู ุจูู€ุงู„ุจูŽุซู‘ู ูˆูŽุญู€ุฏูŠ",
            "explanation": "Generated explanation for verse 2..."
          }
          // ... more verses for this poem
        ]
      }
      // ... more poems
    ]

Running Evaluation Scripts

All commands below assume you are in the evaluation/ directory.

1. BERTScore (bertscore.py)

  • Purpose: Calculates BERTScore (Precision, Recall, F1) using AraBERT for semantic similarity.
  • Configuration: Modify the modeljsons dictionary within bertscore.py to include your model's name and the path to its prediction JSON file. Ensure gtjson points to FannOrFlop.json.
    # Example in bertscore.py
    modeljsons = {
        "YourModelName": "path/to/your_model_explanations.json",
    }
    gtjson = "FannOrFlop.json" # Or correct path
  • Usage:
    python bertscore.py
  • Output: Prints macro-averaged Precision, Recall, and F1-score to the console.

2. BLEU (bleu.py)

  • Purpose: Calculates BLEU, Coverage, and BLEU*Coverage for lexical overlap.
  • Configuration: Inside bleu.py, update the gtjson, predjson, and modelname variables in the if __name__ == "__main__": block.
    # Example in bleu.py
    gtjson = "FannOrFlop.json"
    predjson = "path/to/your_model_explanations.json"
    modelname = "YourModelName"
  • Usage:
    python bleu.py
  • Output: Prints macro-averaged BLEU, Coverage, and BLEU*Coverage to the console.

3. chrF Score (chrf_score.py)

  • Purpose: Calculates chrF, Coverage, and chrF*Coverage (character n-gram metric).
  • Configuration: Modify the modeljsons dictionary within chrf_score.py similarly to bertscore.py. Ensure gtjson points to FannOrFlop.json.
  • Usage:
    python chrf_score.py
  • Output: Prints macro-averaged chrF, Coverage, and chrF*Coverage to the console.

4. LLM-as-Judge Evaluation (judge_eval.py)

  • Purpose: Uses an LLM (e.g., GPT-4o) to evaluate Faithfulness, Fluency, and Overall scores.
  • Prerequisites: Set your OpenAI API key as an environment variable:
    export OPENAI_API_KEY='your_api_key_here'
  • Configuration: In judge_eval.py, modify the following variables at the top of the script:
    • MODEL_NAME: A name for your model.
    • PREDICTIONS_FILE: Path to your model's prediction JSON file.
    • GROUND_TRUTH_FILE: Path to FannOrFlop.json (default is FannOrFlop.json).
    • LLM_JUDGE_MODEL: The LLM to use for judging (e.g., "gpt-4o", "gpt-3.5-turbo").
  • Usage:
    python judge_eval.py
  • Output: Saves detailed scores to a JSON file in the judge_results/ directory (e.g., judge_results/YourModelName-results.json) and prints progress.

5. Average LLM Judge Scores (get_average_scores_for_llm_judge.py)

  • Purpose: Calculates average and standard deviation for scores generated by judge_eval.py.
  • Prerequisites: Run judge_eval.py first to generate result files in judge_results/.
  • Usage:
    python get_average_scores_for_llm_judge.py
  • Output: Prints average Faithfulness and Fluency scores (with SD) to the console for each model found in judge_results/.

6. Textual Entailment (text_entailment.py)

  • Purpose: Calculates bidirectional textual entailment scores between ground truth and generated explanations.
  • Configuration:
    1. Edit text_entailment.py and update the models_to_predictions dictionary to map your model names to their prediction JSON file paths.
      # Example in text_entailment.py
      models_to_predictions = {
          "YourModelName": "path/to/your_model_explanations.json",
          # Add other models if evaluating multiple
      }
    2. The script uses command-line arguments for other configurations. Key arguments:
      • --gt_file: Path to the ground truth JSON (default: FannOrFlop.json).
      • --gt_key: Key in ground truth for explanations list (default: explanation).
      • --pred_key: Key in prediction files for explanations list (default: verse_explanations).
      • --base_output_dir: Directory to save detailed results (default: explanation_closeness_results).
  • Usage (example):
    python text_entailment.py --gt_file FannOrFlop.json --base_output_dir results/entailment_scores
  • Output: Saves detailed JSON results per model in subdirectories of base_output_dir. Prints overall summary scores to the console.

Leaderboard (Sample Results)

Open-Source Models

Model BLEU chrF(++) BERTScore Textual Entailment Faithfulness / Consistency Fluency / Grammaticality Interpretive Depth
GPT-4o-2024-08-06 (OpenAI, 2024) 0.0395 0.2882 0.6410 0.6775 3.92 (ยฑ 0.99) 4.96 (ยฑ 0.20) 7.52
GPT-4o-mini-2024-07-18 (OpenAI, 2024) 0.0395 0.2542 0.6124 0.4383 2.91 (ยฑ 0.75) 4.28 (ยฑ 0.57) 7.50
Gemini-2.5-Flash (AI, 2025b) 0.0153 0.2618 0.6319 0.7475 4.25 (ยฑ 1.00) 4.98 (ยฑ 0.16) 7.22
Gemini-2.0-Flash (AI, 2025a) 0.0395 0.2618 0.6393 0.7154 3.99 (ยฑ 1.04) 4.95 (ยฑ 0.22) 6.50
Gemini-1.5-Pro (Reid et al., 2024) 0.0395 0.2618 0.6333 0.6180 3.59 (ยฑ 1.00) 4.80 (ยฑ 0.41) 5.38
Fanar-Star (Team et al., 2025) 0.0138 0.1538 0.5677 0.6468 2.16 (ยฑ 0.92) 3.40 (ยฑ 0.76) 2.88

Open-Source Models

Model BLEU chrF(++) BERTScore Textual Entailment Faithfulness / Consistency Fluency / Grammaticality Interpretive Depth
Deepseek-V3 (Liu et al., 2024) 0.0395 0.2771 0.6335 0.5117 3.36 (ยฑ 0.91) 4.98 (ยฑ 0.16) 4.75
Deepseek-R1 (Guo et al., 2025) 0.0395 0.2771 0.6335 0.5117 3.38 (ยฑ 0.92) 4.98 (ยฑ 0.16) 4.25
Llama-3.3-70B (Meta AI, 2024) 0.0153 0.2618 0.6393 0.5364 2.51 (ยฑ 0.90) 3.37 (ยฑ 0.73) 7.20
Qwen-3 (Team, 2025) 0.0296 0.2837 0.6158 0.6468 3.98 (ยฑ 0.90) 4.73 (ยฑ 0.45) 6.50
Aya-Expanse (Dang et al., 2024) 0.0329 0.2771 0.6328 0.6468 3.76 (ยฑ 0.90) 4.68 (ยฑ 0.47) 5.88
Jais (Sengupta et al., 2023) 0.0312 0.2698 0.6245 0.6023 3.21 (ยฑ 0.88) 4.35 (ยฑ 0.52) 5.35
ALLaM-7B (Bari et al., 2024) 0.0119 0.0463 0.5375 0.5997 1.32 (ยฑ 0.62) 2.11 (ยฑ 0.89) 3.12
AceGPT-v2-70B-Chat (Huang et al., 2023) 0.0402 0.0412 0.5759 0.6061 2.52 (ยฑ 0.91) 3.46 (ยฑ 0.95) 4.12


๐Ÿ’ฌ Citation

If you use ARB dataset in your research, please consider citing:

@misc{alghallabi2025fannflopmultigenremultiera,
      title={Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs}, 
      author={Wafa Alghallabi and Ritesh Thawkar and Sara Ghaboura and Ketan More and Omkar Thawakar and Hisham Cholakkal and Salman Khan and Rao Muhammad Anwer},
      year={2025},
      eprint={2505.18152},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18152}, 
}

About

[EMNLP 2025 ๐Ÿ”ฅ] A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages