Skip to content

Add WISE Benchmark Task#1301

Merged
kcz358 merged 5 commits into
EvolvingLMMs-Lab:mainfrom
Purshow:main
Apr 20, 2026
Merged

Add WISE Benchmark Task#1301
kcz358 merged 5 commits into
EvolvingLMMs-Lab:mainfrom
Purshow:main

Conversation

@Purshow
Copy link
Copy Markdown
Contributor

@Purshow Purshow commented Apr 17, 2026

Summary

In scope

Out of scope

Validation

Risk / Compatibility

Type of Change

  • Bug fix (non-breaking change)
  • New feature
  • [ ✅] New benchmark/task
  • New model integration
  • Breaking change
  • Documentation update
  • Refactoring (no functional changes)

@mwxely mwxely requested review from kcz358 and mwxely April 17, 2026 13:46
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@mwxely mwxely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Add WISE Benchmark Task

I pushed a lint fix (black + isort formatting) in commit 159706b to get CI green.

Issues Found

1. Dataset loading config looks wrong

dataset_path: json
dataset_kwargs:
  data_files:
    test: Yuwei-Niu/WISE

The json dataset loader expects a file path or URL to a JSON file, not a HuggingFace dataset repo identifier. This will likely fail at runtime. Should be either:

  • dataset_path: Yuwei-Niu/WISE (if it's a proper HF dataset), or
  • Provide the actual URL to the JSON file (e.g., https://huggingface.co/datasets/Yuwei-Niu/WISE/resolve/main/final_data.json)

2. output_type: generate_until vs text-to-image
This task evaluates generated images, but generate_until produces text. The code in _find_generated_image looks for images saved to WISE_RAW_OUTPUT_DIR by the model, but the connection between lmms-eval's generation pipeline and image output is unclear. A brief explanation in the README or PR description would help.

3. wise_process_results relies on doc_id from **kwargs

doc_id = kwargs.get("doc_id", 0)

Does lmms-eval's process_results callback actually pass doc_id in kwargs? If not, every sample will use doc_id=0 and fail to find its image. This needs verification.

4. Hardcoded local paths in README

cd /pfs/weiyang/Show/lmms-eval

and

WISE_RAW_OUTPUT_DIR="/pfs/weiyang/Show/lmms-eval/outputs/WISE_raw/bagel_umm"

These are internal paths. Please replace with generic placeholders.

5. _find_generated_image is model-specific
The function hardcodes naming patterns for bagel.py, bagel_unig2u.py, mmada.py. This is fragile — any new model integration would require modifying this function. Consider a more generic approach (e.g., glob for *{doc_id}* or let the model specify the naming convention).

Summary

The core evaluation logic (GPT-4o judge → 3 scores → WiScore) is sound and well-structured. However, there are several integration concerns (dataset loading, doc_id passing, image discovery) that need verification before merge.

Comment thread lmms_eval/tasks/WISE/WISE.yaml Outdated
Comment on lines +9 to +14
dataset_path: json
dataset_kwargs:
data_files:
test: Yuwei-Niu/WISE
task: WISE
test_split: test
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If can use load_dataset, then can use a better format in my opinion. Maybe we can try just use Yuwei-Niu/WISE ?

Comment thread lmms_eval/tasks/WISE/utils.py Outdated
Comment on lines +12 to +14
Environment variables:
- WISE_RAW_OUTPUT_DIR: model output directory (e.g., /pfs/.../WISE_raw/bagel_umm).
Required for finding generated images.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not looks good in my opinion. Should refer to https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/models/simple/bagel.py and https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/models/chat/bagel_lmms_engine.py to direct get image path from answer. Maybe can also refer to other task such as imgedit. Thanks!

Comment thread lmms_eval/tasks/WISE/utils.py Outdated
Comment on lines +369 to +401
def _find_generated_image(output_dir: str, doc_id: int) -> Optional[str]:
"""Find generated image for doc_id in output_dir.

Supported formats:
- bagel.py: output_dir/20260417_xxx/WISE_{doc_id}_0.png
- bagel_unig2u.py: output_dir/WISE_{doc_id}.png
- mmada.py: output_dir/WISE_{doc_id}.png
"""
if not os.path.exists(output_dir):
return None

# 1. Check for timestamp subdirectories (bagel.py case)
subdirs = [d for d in os.listdir(output_dir) if os.path.isdir(os.path.join(output_dir, d)) and re.match(r"\d{8}_\d{6}_[a-f0-9]+", d)]

if subdirs:
# Use the latest timestamp directory
latest_subdir = sorted(subdirs, reverse=True)[0]
search_dir = os.path.join(output_dir, latest_subdir)
else:
search_dir = output_dir

# 2. Try common naming patterns
possible_names = [
f"WISE_{doc_id}_0.png", # bagel.py
f"WISE_{doc_id}.png", # bagel_unig2u, mmada
]

for name in possible_names:
path = os.path.join(search_dir, name)
if os.path.exists(path):
return path

return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After addressing the env var set for wise, I think this part would not be needed as we get image path direct from output.

@Purshow
Copy link
Copy Markdown
Contributor Author

Purshow commented Apr 17, 2026

Thanks for the review. I addressed the WISE integration issues:

  • Replaced the huggingface json data_files entry with actual official WISE JSON URLs.
  • Removed WISE_RAW_OUTPUT_DIR and model-specific image discovery.
  • WISE now follows the imgedit-style contract: model wrappers return JSON with an images list, and process_results reads images[0].
  • Removed doc_id dependency from process_results; prompt_id/category now come from the dataset doc.
  • Replaced internal README paths with generic placeholders.
  • Reduced sample logging duplication by keeping detailed per-sample info only under WISE_overall_wiscore and using compact scalar category metrics.

@kcz358
Copy link
Copy Markdown
Collaborator

kcz358 commented Apr 18, 2026

Hi, I think the github content is not what we wish to use here. Just like imgedit that we can use

dataset_path: kcz358/imgedit
dataset_kwargs:
  token: True
task: "imgedit"
test_split: test

so that is equivalent to load_dataset(kcz358/imgedit, split=test). Since you use https://huggingface.co/datasets/Yuwei-Niu/WISE, I think you can put dataset path as Yuwei-Niu/WISE and test split as train to use the huggingface auto generate one. If you want to set it properly, check https://huggingface.co/docs/hub/datasets-manual-configuration for info on setting the files and split manually.

@Purshow
Copy link
Copy Markdown
Contributor Author

Purshow commented Apr 18, 2026

Hi, I think the github content is not what we wish to use here. Just like imgedit that we can use

dataset_path: kcz358/imgedit
dataset_kwargs:
  token: True
task: "imgedit"
test_split: test

so that is equivalent to load_dataset(kcz358/imgedit, split=test). Since you use https://huggingface.co/datasets/Yuwei-Niu/WISE, I think you can put dataset path as Yuwei-Niu/WISE and test split as train to use the huggingface auto generate one. If you want to set it properly, check https://huggingface.co/docs/hub/datasets-manual-configuration for info on setting the files and split manually.

I have addressed it. Thx!

Copy link
Copy Markdown
Collaborator

@kcz358 kcz358 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contributions!

@kcz358 kcz358 merged commit b3d9dff into EvolvingLMMs-Lab:main Apr 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants