Implement LLM Jailbreak Attack

We would like to implement the LLM jailbreak attack outlined in ["Attacking Large Language Models with Projected Gradient Descent" by Geisler et al](https://arxiv.org/abs/2402.09154). Evaluating this evasion attack in Armory Library requires the steps below.
- Implement the PGD attack described in algorithms 1, 2 and 3
- Select an open-source LLM target model
- Implement the flexible sequence length relaxation requiring attention layer modification
- Evaluate the PGD attack on a jailbreak dataset

The authors were contacted for any source code but have not yet responded, but an [unverified implementation](https://github.com/dreadnode/research/blob/main/scripts/pgd.py) is available from Dreadnode.

The [AdvBench dataset ](https://github.com/llm-attacks/llm-attacks/tree/main/data/advbench)used in "[Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043)" may be used.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement LLM Jailbreak Attack #181

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement LLM Jailbreak Attack #181

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions