[BUG] Incorrect Ground Truth annotations in arocrbench_khattparagraph lead to broken evaluation metrics

**Description**: While evaluating models on the KITAB-Bench suite, we noticed abnormally poor performance (CER > 400%, WER > 400%, ED ~300) specifically on the khattparagraph dataset (`ahmedheakl/arocrbench_khattparagraph`), whereas results on the standard khatt dataset were completely normal.

Upon investigation, we discovered a critical flaw in the ground truth (GT) annotations of the `ahmedheakl/arocrbench_khattparagraph ` dataset hosted on HuggingFace:

1. The images in this dataset are full paragraphs at high resolution (e.g., 2400x1200 pixels).
2. However, the `answer` column (Ground Truth) for these images is severely truncated. It only contains a single line of text (roughly 50-80 characters), rather than the full paragraph.
3. Consequently, any competent VLM (e.g., Qwen2.5-VL) that correctly follows the prompt to extract all text from the image will predict the entire paragraph (~300-400 characters). The evaluation script then compares this full paragraph against the single-line GT, resulting in massive penalties and broken metrics.

**To Reproduce:**

Check any sample from datasets.load_dataset("ahmedheakl/arocrbench_khattparagraph", split="train").
Notice that the image contains a full paragraph, but the answer string is only a few words long.

**Root Cause Verification:** To confirm this, we programmatically matched (via pixel-level MD5 hashing) all 200 images from the broken arocrbench subset with the original `a-alnaggar/khatt-paragraphs` dataset. We found 100% exact image matches. In the original a-alnaggar dataset, the GT text correctly contains the full paragraph (e.g., ~376 characters), proving that the text was lost or incorrectly mapped during the creation of the `arocrbench_khattparagraph` subset.

**Suggested Fix:**

- Option 1 (Preferred): Update the ahmedheakl/arocrbench_khattparagraph HuggingFace repository by replacing the broken answer column with the correct full-paragraph texts mapped from the original a-alnaggar/khatt-paragraphs dataset.
- Option 2: Temporarily remove `khattparagraph` from the default `ds_ids` list in eval.py to prevent it from skewing the overall KITAB-Bench benchmark scores until the upstream dataset is fixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Incorrect Ground Truth annotations in arocrbench_khattparagraph lead to broken evaluation metrics #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] Incorrect Ground Truth annotations in arocrbench_khattparagraph lead to broken evaluation metrics #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions