Hi authors,
Thank you for sharing this excellent work! I’ve conducted extensive reproduction experiments and consistently observed a significant performance gap on the MM-SafetyBench dataset across all three evaluated models, compared to the results reported in the paper (~92% ASR).
1、Reproduction Details
For each model, I performed five independent runs using the default configurations provided in config/mmsafetybench/. The results are summarized below, with H = HarmBench Judge ASR and L = LlamaGuard3 ASR:
Model Run 1 (H/L) Run 2 (H/L) Run 3 (H/L) Run 4 (H/L) Run 5 (H/L) Mean ± Std (H/L)
Qwen2-VL-7B 59.17%/46.79% 62.56%/45.65% 59.05%/47.80% 66.90%/47.74% 71.13%/47.62% 63.76% ± 4.57% / 47.06% ± 0.58%
InternVL2-8B 50.00%/45.71% 50.54%/45.48% 53.27%/44.05% 51.79%/48.33% 49.23%/45.00% 50.95% ± 1.42% / 45.71% ± 1.56%
MiniGPT4-13B 50.48%/43.15% 48.21%/44.35% 51.96%/44.88% 50.24%/45.36% 54.88%/46.25% 51.35% ± 1.86% / 44.80% ± 1.10%
2、Key Observations
High consistency across runs: Standard deviations are low (<2% for LlamaGuard3, <5% for HarmBench), indicating the discrepancy is systematic rather than due to random variance.
Consistent underperformance across models: All three models show significantly lower ASR than reported, especially under LlamaGuard3 evaluation (~44–47%), suggesting a potential issue specific to the MM-SafetyBench setup or evaluation protocol.
Validation on HarmBench: Validation experiments were conducted using the HarmBench model as the judge. Furthermore, we used the same chatgpt-4o-mini detector as mentioned in the original paper for comparison. The results obtained with chatgpt-4o-mini reached only approximately 43%, which aligns closely with those from the LlamaGuard3 model. This further indicates that the observed performance gap is not isolated to one particular evaluator but is consistent across different assessment methodologies.
3、What I’ve Verified
Used configuration files from config/mmsafetybench/{internvl2,minigpt4,qwen2vl}/
Confirmed correct dataset loading via dataset/mmsafetybench_loader.py
Reviewed attack implementations in attack/{model}_attack.py
Validated that HarmBench results match expected trends
4、Critical Questions
Given the consistent ~45-point gap in ASR across all models and evaluators, I suspect the discrepancy may stem from differences in:
Dataset version or subset: Which exact version or split of MM-SafetyBench was used in your experiments? (The dataset includes multiple subsets and annotation variants.)
Preprocessing pipeline: Were there any MM-SafetyBench-specific image/text preprocessing steps not included in the public codebase?
Would you be willing to share the exact experimental setup used for MM-SafetyBench in your paper—particularly:
- The dataset version/subset
- Any benchmark-specific parameters
- Evaluation methodology (e.g., how H and L scores were combined)
- This would greatly aid accurate reproduction and community validation.
Thank you very much for your time and support!
Hi authors,
Thank you for sharing this excellent work! I’ve conducted extensive reproduction experiments and consistently observed a significant performance gap on the MM-SafetyBench dataset across all three evaluated models, compared to the results reported in the paper (~92% ASR).
1、Reproduction Details
For each model, I performed five independent runs using the default configurations provided in config/mmsafetybench/. The results are summarized below, with H = HarmBench Judge ASR and L = LlamaGuard3 ASR:
Model Run 1 (H/L) Run 2 (H/L) Run 3 (H/L) Run 4 (H/L) Run 5 (H/L) Mean ± Std (H/L)
Qwen2-VL-7B 59.17%/46.79% 62.56%/45.65% 59.05%/47.80% 66.90%/47.74% 71.13%/47.62% 63.76% ± 4.57% / 47.06% ± 0.58%
InternVL2-8B 50.00%/45.71% 50.54%/45.48% 53.27%/44.05% 51.79%/48.33% 49.23%/45.00% 50.95% ± 1.42% / 45.71% ± 1.56%
MiniGPT4-13B 50.48%/43.15% 48.21%/44.35% 51.96%/44.88% 50.24%/45.36% 54.88%/46.25% 51.35% ± 1.86% / 44.80% ± 1.10%
2、Key Observations
High consistency across runs: Standard deviations are low (<2% for LlamaGuard3, <5% for HarmBench), indicating the discrepancy is systematic rather than due to random variance.
Consistent underperformance across models: All three models show significantly lower ASR than reported, especially under LlamaGuard3 evaluation (~44–47%), suggesting a potential issue specific to the MM-SafetyBench setup or evaluation protocol.
Validation on HarmBench: Validation experiments were conducted using the HarmBench model as the judge. Furthermore, we used the same chatgpt-4o-mini detector as mentioned in the original paper for comparison. The results obtained with chatgpt-4o-mini reached only approximately 43%, which aligns closely with those from the LlamaGuard3 model. This further indicates that the observed performance gap is not isolated to one particular evaluator but is consistent across different assessment methodologies.
3、What I’ve Verified
Used configuration files from config/mmsafetybench/{internvl2,minigpt4,qwen2vl}/
Confirmed correct dataset loading via dataset/mmsafetybench_loader.py
Reviewed attack implementations in attack/{model}_attack.py
Validated that HarmBench results match expected trends
4、Critical Questions
Given the consistent ~45-point gap in ASR across all models and evaluators, I suspect the discrepancy may stem from differences in:
Dataset version or subset: Which exact version or split of MM-SafetyBench was used in your experiments? (The dataset includes multiple subsets and annotation variants.)
Preprocessing pipeline: Were there any MM-SafetyBench-specific image/text preprocessing steps not included in the public codebase?
Would you be willing to share the exact experimental setup used for MM-SafetyBench in your paper—particularly:
Thank you very much for your time and support!