Reproduction Discrepancy on MM-SafetyBench – Request for Clarification on Experimental Setup

Hi authors,
Thank you for sharing this excellent work! I’ve conducted extensive reproduction experiments and consistently observed a significant performance gap on the MM-SafetyBench dataset across all three evaluated models, compared to the results reported in the paper (~92% ASR).

1、Reproduction Details

For each model, I performed five independent runs using the default configurations provided in config/mmsafetybench/. The results are summarized below, with H = HarmBench Judge ASR and L = LlamaGuard3 ASR:
Model	Run 1 (H/L)	Run 2 (H/L)	Run 3 (H/L)	Run 4 (H/L)	Run 5 (H/L)	Mean ± Std (H/L)
Qwen2-VL-7B	59.17%/46.79%	62.56%/45.65%	59.05%/47.80%	66.90%/47.74%	71.13%/47.62%	63.76% ± 4.57% / 47.06% ± 0.58%
InternVL2-8B	50.00%/45.71%	50.54%/45.48%	53.27%/44.05%	51.79%/48.33%	49.23%/45.00%	50.95% ± 1.42% / 45.71% ± 1.56%
MiniGPT4-13B	50.48%/43.15%	48.21%/44.35%	51.96%/44.88%	50.24%/45.36%	54.88%/46.25%	51.35% ± 1.86% / 44.80% ± 1.10%

2、Key Observations

High consistency across runs: Standard deviations are low (<2% for LlamaGuard3, <5% for HarmBench), indicating the discrepancy is systematic rather than due to random variance.
Consistent underperformance across models: All three models show significantly lower ASR than reported, especially under LlamaGuard3 evaluation (~44–47%), suggesting a potential issue specific to the MM-SafetyBench setup or evaluation protocol.
Validation on HarmBench: Validation experiments were conducted using the HarmBench model as the judge. Furthermore, we used the same chatgpt-4o-mini detector as mentioned in the original paper for comparison. The results obtained with chatgpt-4o-mini reached only approximately 43%, which aligns closely with those from the LlamaGuard3 model. This further indicates that the observed performance gap is not isolated to one particular evaluator but is consistent across different assessment methodologies.

3、What I’ve Verified

Used configuration files from config/mmsafetybench/{internvl2,minigpt4,qwen2vl}/
Confirmed correct dataset loading via dataset/mmsafetybench_loader.py
Reviewed attack implementations in attack/{model}_attack.py
Validated that HarmBench results match expected trends 

4、Critical Questions

Given the consistent ~45-point gap in ASR across all models and evaluators, I suspect the discrepancy may stem from differences in:
Dataset version or subset: Which exact version or split of MM-SafetyBench was used in your experiments? (The dataset includes multiple subsets and annotation variants.)
Preprocessing pipeline: Were there any MM-SafetyBench-specific image/text preprocessing steps not included in the public codebase?

Would you be willing to share the exact experimental setup used for MM-SafetyBench in your paper—particularly:

- The dataset version/subset
- Any benchmark-specific parameters
- Evaluation methodology (e.g., how H and L scores were combined)
- This would greatly aid accurate reproduction and community validation.

Thank you very much for your time and support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction Discrepancy on MM-SafetyBench – Request for Clarification on Experimental Setup #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reproduction Discrepancy on MM-SafetyBench – Request for Clarification on Experimental Setup #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions