Add WISE Benchmark Task by Purshow · Pull Request #1301 · EvolvingLMMs-Lab/lmms-eval

Purshow · 2026-04-17T12:22:41Z

Summary

In scope

Out of scope

Validation

Risk / Compatibility

Type of Change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mwxely

Review: Add WISE Benchmark Task

I pushed a lint fix (black + isort formatting) in commit 159706b to get CI green.

Issues Found

1. Dataset loading config looks wrong

dataset_path: json
dataset_kwargs:
  data_files:
    test: Yuwei-Niu/WISE

The json dataset loader expects a file path or URL to a JSON file, not a HuggingFace dataset repo identifier. This will likely fail at runtime. Should be either:

dataset_path: Yuwei-Niu/WISE (if it's a proper HF dataset), or
Provide the actual URL to the JSON file (e.g., https://huggingface.co/datasets/Yuwei-Niu/WISE/resolve/main/final_data.json)

2. output_type: generate_until vs text-to-image
This task evaluates generated images, but generate_until produces text. The code in _find_generated_image looks for images saved to WISE_RAW_OUTPUT_DIR by the model, but the connection between lmms-eval's generation pipeline and image output is unclear. A brief explanation in the README or PR description would help.

3. wise_process_results relies on doc_id from **kwargs

doc_id = kwargs.get("doc_id", 0)

Does lmms-eval's process_results callback actually pass doc_id in kwargs? If not, every sample will use doc_id=0 and fail to find its image. This needs verification.

4. Hardcoded local paths in README

cd /pfs/weiyang/Show/lmms-eval

and

WISE_RAW_OUTPUT_DIR="/pfs/weiyang/Show/lmms-eval/outputs/WISE_raw/bagel_umm"

These are internal paths. Please replace with generic placeholders.

5. _find_generated_image is model-specific
The function hardcodes naming patterns for bagel.py, bagel_unig2u.py, mmada.py. This is fragile — any new model integration would require modifying this function. Consider a more generic approach (e.g., glob for *{doc_id}* or let the model specify the naming convention).

Summary

The core evaluation logic (GPT-4o judge → 3 scores → WiScore) is sound and well-structured. However, there are several integration concerns (dataset loading, doc_id passing, image discovery) that need verification before merge.

kcz358 · 2026-04-17T14:27:16Z

+dataset_path: json
+dataset_kwargs:
+  data_files:
+    test: Yuwei-Niu/WISE
+task: WISE
+test_split: test


If can use load_dataset, then can use a better format in my opinion. Maybe we can try just use Yuwei-Niu/WISE ?

kcz358 · 2026-04-17T14:28:59Z

+Environment variables:
+  - WISE_RAW_OUTPUT_DIR: model output directory (e.g., /pfs/.../WISE_raw/bagel_umm).
+    Required for finding generated images.


This does not looks good in my opinion. Should refer to https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/models/simple/bagel.py and https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/models/chat/bagel_lmms_engine.py to direct get image path from answer. Maybe can also refer to other task such as imgedit. Thanks!

kcz358 · 2026-04-17T14:29:58Z

+def _find_generated_image(output_dir: str, doc_id: int) -> Optional[str]:
+    """Find generated image for doc_id in output_dir.
+
+    Supported formats:
+    - bagel.py: output_dir/20260417_xxx/WISE_{doc_id}_0.png
+    - bagel_unig2u.py: output_dir/WISE_{doc_id}.png
+    - mmada.py: output_dir/WISE_{doc_id}.png
+    """
+    if not os.path.exists(output_dir):
+        return None
+
+    # 1. Check for timestamp subdirectories (bagel.py case)
+    subdirs = [d for d in os.listdir(output_dir) if os.path.isdir(os.path.join(output_dir, d)) and re.match(r"\d{8}_\d{6}_[a-f0-9]+", d)]
+
+    if subdirs:
+        # Use the latest timestamp directory
+        latest_subdir = sorted(subdirs, reverse=True)[0]
+        search_dir = os.path.join(output_dir, latest_subdir)
+    else:
+        search_dir = output_dir
+
+    # 2. Try common naming patterns
+    possible_names = [
+        f"WISE_{doc_id}_0.png",  # bagel.py
+        f"WISE_{doc_id}.png",  # bagel_unig2u, mmada
+    ]
+
+    for name in possible_names:
+        path = os.path.join(search_dir, name)
+        if os.path.exists(path):
+            return path
+
+    return None


After addressing the env var set for wise, I think this part would not be needed as we get image path direct from output.

Purshow · 2026-04-17T17:54:50Z

Thanks for the review. I addressed the WISE integration issues:

Replaced the huggingface json data_files entry with actual official WISE JSON URLs.
Removed WISE_RAW_OUTPUT_DIR and model-specific image discovery.
WISE now follows the imgedit-style contract: model wrappers return JSON with an images list, and process_results reads images[0].
Removed doc_id dependency from process_results; prompt_id/category now come from the dataset doc.
Replaced internal README paths with generic placeholders.
Reduced sample logging duplication by keeping detailed per-sample info only under WISE_overall_wiscore and using compact scalar category metrics.

kcz358 · 2026-04-18T12:32:58Z

Hi, I think the github content is not what we wish to use here. Just like imgedit that we can use

dataset_path: kcz358/imgedit
dataset_kwargs:
  token: True
task: "imgedit"
test_split: test

so that is equivalent to load_dataset(kcz358/imgedit, split=test). Since you use https://huggingface.co/datasets/Yuwei-Niu/WISE, I think you can put dataset path as Yuwei-Niu/WISE and test split as train to use the huggingface auto generate one. If you want to set it properly, check https://huggingface.co/docs/hub/datasets-manual-configuration for info on setting the files and split manually.

Purshow · 2026-04-18T17:01:42Z

Hi, I think the github content is not what we wish to use here. Just like imgedit that we can use
dataset_path: kcz358/imgedit
dataset_kwargs:
  token: True
task: "imgedit"
test_split: test
so that is equivalent to load_dataset(kcz358/imgedit, split=test). Since you use https://huggingface.co/datasets/Yuwei-Niu/WISE, I think you can put dataset path as Yuwei-Niu/WISE and test split as train to use the huggingface auto generate one. If you want to set it properly, check https://huggingface.co/docs/hub/datasets-manual-configuration for info on setting the files and split manually.

I have addressed it. Thx!

kcz358

LGTM, thanks for the contributions!

Add WISE Benchmark Task

e06c725

mwxely requested review from kcz358 and mwxely April 17, 2026 13:46

style: fix black and isort formatting for WISE utils

159706b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mwxely reviewed Apr 17, 2026

View reviewed changes

kcz358 reviewed Apr 17, 2026

View reviewed changes

fix: address WISE task review feedback

6a8b1ae

Purshow added 2 commits April 18, 2026 23:06

fix: use Hugging Face WISE dataset

88b789d

update wise test

a663912

kcz358 approved these changes Apr 20, 2026

View reviewed changes

kcz358 merged commit b3d9dff into EvolvingLMMs-Lab:main Apr 20, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WISE Benchmark Task#1301

Add WISE Benchmark Task#1301
kcz358 merged 5 commits into
EvolvingLMMs-Lab:mainfrom
Purshow:main

Purshow commented Apr 17, 2026

Uh oh!

mwxely left a comment •

edited

Loading

Uh oh!

kcz358 Apr 17, 2026

Uh oh!

kcz358 Apr 17, 2026

Uh oh!

kcz358 Apr 17, 2026

Uh oh!

Purshow commented Apr 17, 2026

Uh oh!

kcz358 commented Apr 18, 2026

Uh oh!

Purshow commented Apr 18, 2026

Uh oh!

kcz358 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Purshow commented Apr 17, 2026

Summary

In scope

Out of scope

Validation

Risk / Compatibility

Type of Change

Uh oh!

mwxely left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review: Add WISE Benchmark Task

Issues Found

Summary

Uh oh!

kcz358 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

kcz358 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

kcz358 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Purshow commented Apr 17, 2026

Uh oh!

kcz358 commented Apr 18, 2026

Uh oh!

Purshow commented Apr 18, 2026

Uh oh!

kcz358 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mwxely left a comment •

edited

Loading