Skip to content

Reproduction Discrepancy on MM-SafetyBench – Request for Clarification on Experimental Setup #5

@zhaoyanshang

Description

@zhaoyanshang

Hi authors,
Thank you for sharing this excellent work! I’ve conducted extensive reproduction experiments and consistently observed a significant performance gap on the MM-SafetyBench dataset across all three evaluated models, compared to the results reported in the paper (~92% ASR).

1、Reproduction Details

For each model, I performed five independent runs using the default configurations provided in config/mmsafetybench/. The results are summarized below, with H = HarmBench Judge ASR and L = LlamaGuard3 ASR:
Model Run 1 (H/L) Run 2 (H/L) Run 3 (H/L) Run 4 (H/L) Run 5 (H/L) Mean ± Std (H/L)
Qwen2-VL-7B 59.17%/46.79% 62.56%/45.65% 59.05%/47.80% 66.90%/47.74% 71.13%/47.62% 63.76% ± 4.57% / 47.06% ± 0.58%
InternVL2-8B 50.00%/45.71% 50.54%/45.48% 53.27%/44.05% 51.79%/48.33% 49.23%/45.00% 50.95% ± 1.42% / 45.71% ± 1.56%
MiniGPT4-13B 50.48%/43.15% 48.21%/44.35% 51.96%/44.88% 50.24%/45.36% 54.88%/46.25% 51.35% ± 1.86% / 44.80% ± 1.10%

2、Key Observations

High consistency across runs: Standard deviations are low (<2% for LlamaGuard3, <5% for HarmBench), indicating the discrepancy is systematic rather than due to random variance.
Consistent underperformance across models: All three models show significantly lower ASR than reported, especially under LlamaGuard3 evaluation (~44–47%), suggesting a potential issue specific to the MM-SafetyBench setup or evaluation protocol.
Validation on HarmBench: Validation experiments were conducted using the HarmBench model as the judge. Furthermore, we used the same chatgpt-4o-mini detector as mentioned in the original paper for comparison. The results obtained with chatgpt-4o-mini reached only approximately 43%, which aligns closely with those from the LlamaGuard3 model. This further indicates that the observed performance gap is not isolated to one particular evaluator but is consistent across different assessment methodologies.

3、What I’ve Verified

Used configuration files from config/mmsafetybench/{internvl2,minigpt4,qwen2vl}/
Confirmed correct dataset loading via dataset/mmsafetybench_loader.py
Reviewed attack implementations in attack/{model}_attack.py
Validated that HarmBench results match expected trends

4、Critical Questions

Given the consistent ~45-point gap in ASR across all models and evaluators, I suspect the discrepancy may stem from differences in:
Dataset version or subset: Which exact version or split of MM-SafetyBench was used in your experiments? (The dataset includes multiple subsets and annotation variants.)
Preprocessing pipeline: Were there any MM-SafetyBench-specific image/text preprocessing steps not included in the public codebase?

Would you be willing to share the exact experimental setup used for MM-SafetyBench in your paper—particularly:

  • The dataset version/subset
  • Any benchmark-specific parameters
  • Evaluation methodology (e.g., how H and L scores were combined)
  • This would greatly aid accurate reproduction and community validation.

Thank you very much for your time and support!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions