Skip to content

Commit 26a3ed1

Browse files
committed
Move jailbreak guardrails into single pipeline class
We've got 3 classes for Jailbreak guardrails. It's a fairly simple class and can just be moved into a single class in the answer composition pipeline. I did consider leaving the JailbreakChecker class as a separate class, but it just didn't seem like there was a lot of value. One thing i noticed while merging this, is that we create a ResponseError class in the JailbreakChecker, but we don't actually raise it anywhere so it was redundant. I've removed it, but if we want we can check that the llm response either returns the pass or fail value and create the object in the db with the correct status if it doesn't. We'd need to add some additional prompt config for the fail value though in order to do that so i've avoided it for now. We will need to do some follow up work to update the evaluation repo since we just return the serialised answer now so it'll need to grab the info it wants from the serialised answer instead of the result that was previously returned from the JailbreakChecker.
1 parent d2482d5 commit 26a3ed1

10 files changed

Lines changed: 139 additions & 480 deletions

File tree

lib/answer_composition/composer.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ def compose_answer
3737
case answer_strategy
3838
when "claude_structured_answer"
3939
PipelineRunner.call(question:, pipeline: [
40-
Pipeline::JailbreakGuardrails.new(llm_provider: :claude),
40+
Pipeline::JailbreakGuardrails,
4141
Pipeline::QuestionRephraser,
4242
Pipeline::QuestionRouter,
4343
Pipeline::QuestionRoutingGuardrails.new(llm_provider: :claude),

lib/answer_composition/pipeline/jailbreak_guardrails.rb

Lines changed: 72 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,96 @@
11
module AnswerComposition
22
module Pipeline
33
class JailbreakGuardrails
4-
def initialize(llm_provider: :claude)
5-
@llm_provider = llm_provider
4+
SUPPORTED_MODELS = %i[claude_sonnet_4_0 claude_haiku_4_5].freeze
5+
DEFAULT_MODEL = :claude_sonnet_4_0
6+
7+
def self.call(...) = new(...).call
8+
9+
def initialize(context)
10+
@context = context
11+
@model_id, @model_name = BedrockModels.determine_model(
12+
ENV["BEDROCK_CLAUDE_JAILBREAK_GUARDRAILS_MODEL"],
13+
DEFAULT_MODEL,
14+
SUPPORTED_MODELS,
15+
)
616
end
717

8-
def call(context)
18+
def call
919
start_time = Clock.monotonic_time
20+
response = anthropic_bedrock_client.messages.create(
21+
system: [{ type: "text", text: system_prompt }],
22+
model: model_id,
23+
messages:,
24+
**inference_config,
25+
)
1026

11-
response = Guardrails::JailbreakChecker.call(context.question.message, llm_provider)
12-
context.answer.assign_attributes(jailbreak_guardrails_status: response.triggered ? :fail : :pass)
13-
context.answer.assign_llm_response("jailbreak_guardrails", response.llm_response)
27+
jailbreak_guardrails_status = response[:content][0][:text] == pass_value ? :pass : :fail
28+
29+
context.answer.assign_attributes(jailbreak_guardrails_status:)
30+
context.answer.assign_llm_response("jailbreak_guardrails", response.to_h)
1431
context.answer.assign_metrics("jailbreak_guardrails", build_metrics(start_time, response))
1532

16-
if response.triggered
33+
if jailbreak_guardrails_status == :fail
1734
context.abort_pipeline!(
1835
message: Answer::CannedResponses::JAILBREAK_GUARDRAILS_FAILED_MESSAGE,
1936
status: "guardrails_jailbreak",
2037
)
2138
end
22-
rescue Guardrails::JailbreakChecker::ResponseError => e
23-
context.abort_pipeline!(
24-
message: Answer::CannedResponses::JAILBREAK_GUARDRAILS_FAILED_MESSAGE,
25-
status: "error_jailbreak_guardrails",
26-
jailbreak_guardrails_status: :error,
27-
metrics: { "jailbreak_guardrails" => build_metrics(start_time, e) },
28-
llm_response: { "jailbreak_guardrails" => e.llm_response },
29-
)
3039
end
3140

3241
private
3342

34-
attr_reader :llm_provider
43+
attr_reader :context, :model_id, :model_name
44+
45+
def anthropic_bedrock_client
46+
@anthropic_bedrock_client ||= Anthropic::BedrockClient.new(
47+
aws_region: ENV["CLAUDE_AWS_REGION"],
48+
)
49+
end
50+
51+
def guardrails_llm_prompts
52+
AnswerComposition::Pipeline::Claude.prompt_config(:jailbreak_guardrails, model_name)
53+
end
54+
55+
# TODO: Move the common prompts into the claude config and use one set of prompts here.
56+
def common_guardrails_llm_prompts
57+
Rails.configuration.govuk_chat_private.llm_prompts.common.jailbreak_guardrails
58+
end
59+
60+
def pass_value
61+
common_guardrails_llm_prompts.fetch(:pass_value)
62+
end
63+
64+
def max_tokens
65+
guardrails_llm_prompts.fetch(:max_tokens)
66+
end
67+
68+
def inference_config
69+
{
70+
max_tokens: max_tokens,
71+
temperature: 0.0,
72+
}
73+
end
74+
75+
def messages
76+
[{ role: "user", content: user_prompt }]
77+
end
78+
79+
def user_prompt
80+
guardrails_llm_prompts[:user_prompt].sub("{input}", context.question.message)
81+
end
82+
83+
def system_prompt
84+
guardrails_llm_prompts[:system_prompt]
85+
end
3586

36-
def build_metrics(start_time, response_or_error)
87+
def build_metrics(start_time, response)
3788
{
3889
duration: Clock.monotonic_time - start_time,
39-
llm_prompt_tokens: response_or_error.llm_prompt_tokens,
40-
llm_completion_tokens: response_or_error.llm_completion_tokens,
41-
llm_cached_tokens: response_or_error.llm_cached_tokens,
42-
model: response_or_error.model,
90+
llm_prompt_tokens: response[:usage][:input_tokens],
91+
llm_completion_tokens: response[:usage][:output_tokens],
92+
llm_cached_tokens: nil,
93+
model: response[:model],
4394
}
4495
end
4596
end

lib/guardrails/claude/jailbreak_checker.rb

Lines changed: 0 additions & 72 deletions
This file was deleted.

lib/guardrails/jailbreak_checker.rb

Lines changed: 0 additions & 96 deletions
This file was deleted.

lib/tasks/evaluation.rake

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,10 @@ namespace :evaluation do
2121
task generate_jailbreak_guardrail_response: :environment do
2222
raise "Requires an INPUT env var" if ENV["INPUT"].blank?
2323

24-
begin
25-
response = Guardrails::JailbreakChecker.call(ENV["INPUT"], :claude)
24+
question = Question.new(message: ENV["INPUT"], conversation: Conversation.new)
25+
answer = AnswerComposition::PipelineRunner.call(question:, pipeline: [AnswerComposition::Pipeline::JailbreakGuardrails])
2626

27-
puts({ success: response }.to_json)
28-
rescue Guardrails::JailbreakChecker::ResponseError => e
29-
puts({ response_error: e }.to_json)
30-
end
27+
puts(answer.serialize_for_evaluation.to_json)
3128
end
3229

3330
desc "Produce the output guardrails response for a user input"

spec/lib/answer_composition/composer_spec.rb

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,9 @@ def stub_pipeline_initialize(klass, *args, **kwargs)
2626
it "calls PipelineRunner with the correct pipeline" do
2727
stub_pipeline_initialize(AnswerComposition::Pipeline::QuestionRoutingGuardrails, llm_provider: :claude)
2828
stub_pipeline_initialize(AnswerComposition::Pipeline::AnswerGuardrails, llm_provider: :claude)
29-
stub_pipeline_initialize(AnswerComposition::Pipeline::JailbreakGuardrails, llm_provider: :claude)
3029

3130
expected_pipeline = [
32-
AnswerComposition::Pipeline::JailbreakGuardrails.new(llm_provider: :claude),
31+
AnswerComposition::Pipeline::JailbreakGuardrails,
3332
AnswerComposition::Pipeline::QuestionRephraser,
3433
AnswerComposition::Pipeline::QuestionRouter,
3534
AnswerComposition::Pipeline::QuestionRoutingGuardrails.new(llm_provider: :claude),

0 commit comments

Comments
 (0)