-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
enhancementNew feature or requestNew feature or request
Description
We would like to implement the LLM jailbreak attack outlined in "Attacking Large Language Models with Projected Gradient Descent" by Geisler et al. Evaluating this evasion attack in Armory Library requires the steps below.
- Implement the PGD attack described in algorithms 1, 2 and 3
- Select an open-source LLM target model
- Implement the flexible sequence length relaxation requiring attention layer modification
- Evaluate the PGD attack on a jailbreak dataset
The authors were contacted for any source code but have not yet responded, but an unverified implementation is available from Dreadnode.
The AdvBench dataset used in "Universal and Transferable Adversarial Attacks on Aligned Language Models" may be used.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request