Skip to content

Implement LLM Jailbreak Attack #181

@deprit

Description

@deprit

We would like to implement the LLM jailbreak attack outlined in "Attacking Large Language Models with Projected Gradient Descent" by Geisler et al. Evaluating this evasion attack in Armory Library requires the steps below.

  • Implement the PGD attack described in algorithms 1, 2 and 3
  • Select an open-source LLM target model
  • Implement the flexible sequence length relaxation requiring attention layer modification
  • Evaluate the PGD attack on a jailbreak dataset

The authors were contacted for any source code but have not yet responded, but an unverified implementation is available from Dreadnode.

The AdvBench dataset used in "Universal and Transferable Adversarial Attacks on Aligned Language Models" may be used.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions