Add WISE Benchmark Task#1301
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Review: Add WISE Benchmark Task
I pushed a lint fix (black + isort formatting) in commit 159706b to get CI green.
Issues Found
1. Dataset loading config looks wrong
dataset_path: json
dataset_kwargs:
data_files:
test: Yuwei-Niu/WISEThe json dataset loader expects a file path or URL to a JSON file, not a HuggingFace dataset repo identifier. This will likely fail at runtime. Should be either:
dataset_path: Yuwei-Niu/WISE(if it's a proper HF dataset), or- Provide the actual URL to the JSON file (e.g.,
https://huggingface.co/datasets/Yuwei-Niu/WISE/resolve/main/final_data.json)
2. output_type: generate_until vs text-to-image
This task evaluates generated images, but generate_until produces text. The code in _find_generated_image looks for images saved to WISE_RAW_OUTPUT_DIR by the model, but the connection between lmms-eval's generation pipeline and image output is unclear. A brief explanation in the README or PR description would help.
3. wise_process_results relies on doc_id from **kwargs
doc_id = kwargs.get("doc_id", 0)Does lmms-eval's process_results callback actually pass doc_id in kwargs? If not, every sample will use doc_id=0 and fail to find its image. This needs verification.
4. Hardcoded local paths in README
cd /pfs/weiyang/Show/lmms-eval
and
WISE_RAW_OUTPUT_DIR="/pfs/weiyang/Show/lmms-eval/outputs/WISE_raw/bagel_umm"
These are internal paths. Please replace with generic placeholders.
5. _find_generated_image is model-specific
The function hardcodes naming patterns for bagel.py, bagel_unig2u.py, mmada.py. This is fragile — any new model integration would require modifying this function. Consider a more generic approach (e.g., glob for *{doc_id}* or let the model specify the naming convention).
Summary
The core evaluation logic (GPT-4o judge → 3 scores → WiScore) is sound and well-structured. However, there are several integration concerns (dataset loading, doc_id passing, image discovery) that need verification before merge.
| dataset_path: json | ||
| dataset_kwargs: | ||
| data_files: | ||
| test: Yuwei-Niu/WISE | ||
| task: WISE | ||
| test_split: test |
There was a problem hiding this comment.
If can use load_dataset, then can use a better format in my opinion. Maybe we can try just use Yuwei-Niu/WISE ?
| Environment variables: | ||
| - WISE_RAW_OUTPUT_DIR: model output directory (e.g., /pfs/.../WISE_raw/bagel_umm). | ||
| Required for finding generated images. |
There was a problem hiding this comment.
This does not looks good in my opinion. Should refer to https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/models/simple/bagel.py and https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/models/chat/bagel_lmms_engine.py to direct get image path from answer. Maybe can also refer to other task such as imgedit. Thanks!
| def _find_generated_image(output_dir: str, doc_id: int) -> Optional[str]: | ||
| """Find generated image for doc_id in output_dir. | ||
|
|
||
| Supported formats: | ||
| - bagel.py: output_dir/20260417_xxx/WISE_{doc_id}_0.png | ||
| - bagel_unig2u.py: output_dir/WISE_{doc_id}.png | ||
| - mmada.py: output_dir/WISE_{doc_id}.png | ||
| """ | ||
| if not os.path.exists(output_dir): | ||
| return None | ||
|
|
||
| # 1. Check for timestamp subdirectories (bagel.py case) | ||
| subdirs = [d for d in os.listdir(output_dir) if os.path.isdir(os.path.join(output_dir, d)) and re.match(r"\d{8}_\d{6}_[a-f0-9]+", d)] | ||
|
|
||
| if subdirs: | ||
| # Use the latest timestamp directory | ||
| latest_subdir = sorted(subdirs, reverse=True)[0] | ||
| search_dir = os.path.join(output_dir, latest_subdir) | ||
| else: | ||
| search_dir = output_dir | ||
|
|
||
| # 2. Try common naming patterns | ||
| possible_names = [ | ||
| f"WISE_{doc_id}_0.png", # bagel.py | ||
| f"WISE_{doc_id}.png", # bagel_unig2u, mmada | ||
| ] | ||
|
|
||
| for name in possible_names: | ||
| path = os.path.join(search_dir, name) | ||
| if os.path.exists(path): | ||
| return path | ||
|
|
||
| return None |
There was a problem hiding this comment.
After addressing the env var set for wise, I think this part would not be needed as we get image path direct from output.
|
Thanks for the review. I addressed the WISE integration issues:
|
|
Hi, I think the github content is not what we wish to use here. Just like imgedit that we can use dataset_path: kcz358/imgedit
dataset_kwargs:
token: True
task: "imgedit"
test_split: testso that is equivalent to |
I have addressed it. Thx! |
kcz358
left a comment
There was a problem hiding this comment.
LGTM, thanks for the contributions!
Summary
In scope
Out of scope
Validation
Risk / Compatibility
Type of Change