Hi, thanks for the great work.
I am currently experimenting with the code and have a question regarding the attack transferability.
Experiment Details: I replaced the initial image used to generate the adversarial images. Aside from this change, all other configurations remain consistent with the original repository.
Results:
AdvBench Subset: The Attack Success Rate (ASR) is 98%.
MMSafeBench: When transferring the attack to mmsafebench, the ASR drops significantly to 51%.
Evaluator: Both benchmarks were evaluated using HarmBench-Llama-2-13b-cls.
Question: Could this performance gap be attributed to the change in the image itself? Or is it possible that the evaluator (HarmBench-Llama-2-13b-cls) differs from the setting/judge used in the original paper?
Any insights would be appreciated. Thanks!
Hi, thanks for the great work.
I am currently experimenting with the code and have a question regarding the attack transferability.
Experiment Details: I replaced the initial image used to generate the adversarial images. Aside from this change, all other configurations remain consistent with the original repository.
Results:
AdvBench Subset: The Attack Success Rate (ASR) is 98%.
MMSafeBench: When transferring the attack to mmsafebench, the ASR drops significantly to 51%.
Evaluator: Both benchmarks were evaluated using HarmBench-Llama-2-13b-cls.
Question: Could this performance gap be attributed to the change in the image itself? Or is it possible that the evaluator (HarmBench-Llama-2-13b-cls) differs from the setting/judge used in the original paper?
Any insights would be appreciated. Thanks!