Description: While evaluating models on the KITAB-Bench suite, we noticed abnormally poor performance (CER > 400%, WER > 400%, ED ~300) specifically on the khattparagraph dataset (ahmedheakl/arocrbench_khattparagraph), whereas results on the standard khatt dataset were completely normal.
Upon investigation, we discovered a critical flaw in the ground truth (GT) annotations of the ahmedheakl/arocrbench_khattparagraph dataset hosted on HuggingFace:
- The images in this dataset are full paragraphs at high resolution (e.g., 2400x1200 pixels).
- However, the
answer column (Ground Truth) for these images is severely truncated. It only contains a single line of text (roughly 50-80 characters), rather than the full paragraph.
- Consequently, any competent VLM (e.g., Qwen2.5-VL) that correctly follows the prompt to extract all text from the image will predict the entire paragraph (~300-400 characters). The evaluation script then compares this full paragraph against the single-line GT, resulting in massive penalties and broken metrics.
To Reproduce:
Check any sample from datasets.load_dataset("ahmedheakl/arocrbench_khattparagraph", split="train").
Notice that the image contains a full paragraph, but the answer string is only a few words long.
Root Cause Verification: To confirm this, we programmatically matched (via pixel-level MD5 hashing) all 200 images from the broken arocrbench subset with the original a-alnaggar/khatt-paragraphs dataset. We found 100% exact image matches. In the original a-alnaggar dataset, the GT text correctly contains the full paragraph (e.g., ~376 characters), proving that the text was lost or incorrectly mapped during the creation of the arocrbench_khattparagraph subset.
Suggested Fix:
- Option 1 (Preferred): Update the ahmedheakl/arocrbench_khattparagraph HuggingFace repository by replacing the broken answer column with the correct full-paragraph texts mapped from the original a-alnaggar/khatt-paragraphs dataset.
- Option 2: Temporarily remove
khattparagraph from the default ds_ids list in eval.py to prevent it from skewing the overall KITAB-Bench benchmark scores until the upstream dataset is fixed.
Description: While evaluating models on the KITAB-Bench suite, we noticed abnormally poor performance (CER > 400%, WER > 400%, ED ~300) specifically on the khattparagraph dataset (
ahmedheakl/arocrbench_khattparagraph), whereas results on the standard khatt dataset were completely normal.Upon investigation, we discovered a critical flaw in the ground truth (GT) annotations of the
ahmedheakl/arocrbench_khattparagraphdataset hosted on HuggingFace:answercolumn (Ground Truth) for these images is severely truncated. It only contains a single line of text (roughly 50-80 characters), rather than the full paragraph.To Reproduce:
Check any sample from datasets.load_dataset("ahmedheakl/arocrbench_khattparagraph", split="train").
Notice that the image contains a full paragraph, but the answer string is only a few words long.
Root Cause Verification: To confirm this, we programmatically matched (via pixel-level MD5 hashing) all 200 images from the broken arocrbench subset with the original
a-alnaggar/khatt-paragraphsdataset. We found 100% exact image matches. In the original a-alnaggar dataset, the GT text correctly contains the full paragraph (e.g., ~376 characters), proving that the text was lost or incorrectly mapped during the creation of thearocrbench_khattparagraphsubset.Suggested Fix:
khattparagraphfrom the defaultds_idslist in eval.py to prevent it from skewing the overall KITAB-Bench benchmark scores until the upstream dataset is fixed.