Skip to content

[BUG] Incorrect Ground Truth annotations in arocrbench_khattparagraph lead to broken evaluation metrics #5

@soltkreig

Description

@soltkreig

Description: While evaluating models on the KITAB-Bench suite, we noticed abnormally poor performance (CER > 400%, WER > 400%, ED ~300) specifically on the khattparagraph dataset (ahmedheakl/arocrbench_khattparagraph), whereas results on the standard khatt dataset were completely normal.

Upon investigation, we discovered a critical flaw in the ground truth (GT) annotations of the ahmedheakl/arocrbench_khattparagraph dataset hosted on HuggingFace:

  1. The images in this dataset are full paragraphs at high resolution (e.g., 2400x1200 pixels).
  2. However, the answer column (Ground Truth) for these images is severely truncated. It only contains a single line of text (roughly 50-80 characters), rather than the full paragraph.
  3. Consequently, any competent VLM (e.g., Qwen2.5-VL) that correctly follows the prompt to extract all text from the image will predict the entire paragraph (~300-400 characters). The evaluation script then compares this full paragraph against the single-line GT, resulting in massive penalties and broken metrics.

To Reproduce:

Check any sample from datasets.load_dataset("ahmedheakl/arocrbench_khattparagraph", split="train").
Notice that the image contains a full paragraph, but the answer string is only a few words long.

Root Cause Verification: To confirm this, we programmatically matched (via pixel-level MD5 hashing) all 200 images from the broken arocrbench subset with the original a-alnaggar/khatt-paragraphs dataset. We found 100% exact image matches. In the original a-alnaggar dataset, the GT text correctly contains the full paragraph (e.g., ~376 characters), proving that the text was lost or incorrectly mapped during the creation of the arocrbench_khattparagraph subset.

Suggested Fix:

  • Option 1 (Preferred): Update the ahmedheakl/arocrbench_khattparagraph HuggingFace repository by replacing the broken answer column with the correct full-paragraph texts mapped from the original a-alnaggar/khatt-paragraphs dataset.
  • Option 2: Temporarily remove khattparagraph from the default ds_ids list in eval.py to prevent it from skewing the overall KITAB-Bench benchmark scores until the upstream dataset is fixed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions