Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions lib/auto_evaluation/faithfulness.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
class AutoEvaluation::Faithfulness
THRESHOLD = 0.5

def self.call(...) = new(...).call

def initialize(answer)
@answer = answer
@llm_responses = {}
@metrics = {}
end

def call
truths, llm_responses[:truths], metrics[:truths] = TruthsGenerator.call(retrieval_context:)

if truths.empty?
return build_maximum_score_result(
reason: "No truths were extracted from the retrieval context.",
llm_responses:,
metrics:,
)
end

claims, llm_responses[:claims], metrics[:claims] = ClaimsGenerator.call(answer_message:)

if claims.empty?
return build_maximum_score_result(
reason: "No claims were extracted from the answer.",
llm_responses:,
metrics:,
)
end

verdicts, llm_responses[:verdicts], metrics[:verdicts] = VerdictsGenerator.call(
claims:, truths:,
)

if verdicts.empty?
return build_maximum_score_result(
reason: "No verdicts were generated for the extracted claims.",
llm_responses:,
metrics:,
)
end

if verdicts.none? { |verdict| verdict["verdict"].strip.downcase == "no" }
return build_maximum_score_result(
reason: "The response is fully supported by the retrieval context.",
llm_responses:,
metrics:,
)
end

score = calculate_score(verdicts)

reason, llm_responses[:reason], metrics[:reason] = ReasonGenerator.call(
score: score.round(2), verdicts:,
)
Comment thread
kevindew marked this conversation as resolved.

AutoEvaluation::ScoreResult.new(
score:,
reason:,
success: score >= THRESHOLD,
llm_responses:,
metrics:,
)
end

private

attr_reader :answer
attr_accessor :llm_responses, :metrics

def answer_message
answer.message
end

def retrieval_context
used_sources.map(&:plain_content).join("\n\n")
end

def calculate_score(verdicts)
return 1.0 if verdicts.empty?

faithful_count = verdicts.count { |verdict| verdict["verdict"].strip.downcase != "no" }
faithful_count.to_d / verdicts.count
end

def used_sources
answer.sources.used
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a little bug here that I encountered and which took me a while to figure out.

Using the used scope here actually doesn't return any records when this class is called from the Rake task, because that task doesn't persist any records to the database.

The answer that is built in the Rake task (via AutoEvaluation::EvaluateAnswerFromQuestionMessage) is built by the pipeline runner and isn't saved to the database. When you call answer.sources.used it runs a DB query to grab out the sources, but because there aren't any persisted, it always returns an empty relation.

If I run the Rake task as it stands on main, I get this:

$ INPUT="How do I start a new business?" rake evaluation:generate_faithfulness_evaluation
{"score":1.0,"reason":"No truths were extracted from the retrieval context.","success":true,"llm_responses":{"truths":{"choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":"{\n    \"truths\": []\n}","refusal":null,"role":"assistant","tool_calls":[{"function":{"arguments":"{\n  \"truths\": []\n}","name":"extract_truths"},"id":"chatcmpl-tool-b63f81baf8e1bc87","type":"function"}]}}],"created":1767873892,"id":"chatcmpl-a0f5b69c-372c-4bb8-99d9-4e8015b32457","model":"openai.gpt-oss-120b-1:0","object":"chat.completion","service_tier":"default","usage":{"completion_tokens":150,"prompt_tokens":347,"prompt_tokens_details":{"audio_tokens":0,"cached_tokens":64},"total_tokens":497}}},"metrics":{"truths":{"duration":0.8339748330181465,"llm_prompt_tokens":347,"llm_completion_tokens":150,"llm_cached_tokens":null,"model":"openai.gpt-oss-120b-1:0"}}}

So no search results were passed in as the retrieval context.

Whereas if I change this line to answer.sources.select(&:used) (i.e. filtering the array rather than using the scope), I get the expected result:

$ INPUT="How do I start a new business?" rake evaluation:generate_faithfulness_evaluation
{"score":1.0,"reason":"The response is fully supported by the retrieval context.","success":true,"llm_responses":{"truths":{"choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":"{\n  \"truths\": [\n    \"Starting your own business can be a thrilling and rewarding endeavor.\",\n    \"It is important to begin with a solid business plan and clear objectives when starting a business.\",\n    \"Researching your market thoroughly and understanding your competition is recommended.\",\n...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great spot. It's a shame it's such a pain to get integration tests for the rake tasks as this feel something where we'd want a test to catch.

I imagine we want the tests for the Faithfulness class (and other auto eval routes) to use a FactoryBot.build(:answer) rather than a FactoryBot.create(:answer) since the answer may not be persisted

end

def build_maximum_score_result(reason:, llm_responses:, metrics:)
AutoEvaluation::ScoreResult.new(
score: 1.0,
reason:,
success: true,
llm_responses:,
metrics:,
)
end
end
35 changes: 35 additions & 0 deletions lib/auto_evaluation/faithfulness/claims_generator.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
module AutoEvaluation
class Faithfulness::ClaimsGenerator
def self.call(...) = new(...).call

def initialize(answer_message:)
@answer_message = answer_message
end

def call
result = BedrockOpenAIOssInvoke.call(user_prompt, tools)
[result.evaluation_data.fetch("claims"), result.llm_response, result.metrics]
end

private

attr_reader :answer_message

def llm_prompts
Prompts.config
.faithfulness
.fetch(:claims)
end

def user_prompt
sprintf(
llm_prompts.fetch(:user_prompt),
answer: answer_message,
)
end

def tools
[llm_prompts.fetch(:tool_spec)]
end
end
end
42 changes: 42 additions & 0 deletions lib/auto_evaluation/faithfulness/reason_generator.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
module AutoEvaluation
class Faithfulness::ReasonGenerator
def self.call(...) = new(...).call

def initialize(score:, verdicts:)
@score = score
@verdicts = verdicts
end

def call
result = BedrockOpenAIOssInvoke.call(user_prompt, tools)
[result.evaluation_data.fetch("reason"), result.llm_response, result.metrics]
end

private

attr_reader :score, :verdicts

def llm_prompts
Prompts.config
.faithfulness
.fetch(:reason)
end

def user_prompt
sprintf(
llm_prompts.fetch(:user_prompt),
score:,
contradictions:,
)
end

def tools
[llm_prompts.fetch(:tool_spec)]
end

def contradictions
verdicts.select { |verdict| verdict["verdict"].strip.downcase == "no" }
.map { |verdict| verdict["reason"] }
end
end
end
35 changes: 35 additions & 0 deletions lib/auto_evaluation/faithfulness/truths_generator.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
module AutoEvaluation
class Faithfulness::TruthsGenerator
def self.call(...) = new(...).call

def initialize(retrieval_context:)
@retrieval_context = retrieval_context
end

def call
result = BedrockOpenAIOssInvoke.call(user_prompt, tools)
[result.evaluation_data.fetch("truths"), result.llm_response, result.metrics]
end

private

attr_reader :retrieval_context

def llm_prompts
Prompts.config
.faithfulness
.fetch(:truths)
end

def user_prompt
sprintf(
llm_prompts.fetch(:user_prompt),
retrieval_context:,
)
end

def tools
[llm_prompts.fetch(:tool_spec)]
end
end
end
37 changes: 37 additions & 0 deletions lib/auto_evaluation/faithfulness/verdicts_generator.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
module AutoEvaluation
class Faithfulness::VerdictsGenerator
def self.call(...) = new(...).call

def initialize(claims:, truths:)
@claims = claims
@truths = truths
end

def call
result = BedrockOpenAIOssInvoke.call(user_prompt, tools)
[result.evaluation_data.fetch("verdicts"), result.llm_response, result.metrics]
end

private

attr_reader :claims, :truths

def llm_prompts
Prompts.config
.faithfulness
.fetch(:verdicts)
end

def user_prompt
sprintf(
llm_prompts.fetch(:user_prompt),
claims:,
retrieval_context: truths.join("\n\n"),
)
end

def tools
[llm_prompts.fetch(:tool_spec)]
end
end
end
16 changes: 16 additions & 0 deletions lib/tasks/evaluation.rake
Original file line number Diff line number Diff line change
Expand Up @@ -204,4 +204,20 @@ namespace :evaluation do
abort e.message
end
end

desc "Run faithfulness evaluation for a user input"
task generate_faithfulness_evaluation: :environment do
raise "Requires an INPUT env var" if ENV["INPUT"].blank?

begin
result = AutoEvaluation::EvaluateAnswerFromQuestionMessage.call(
evaluation_class: AutoEvaluation::Faithfulness,
question_message: ENV["INPUT"],
)

puts result.to_json
rescue AutoEvaluation::EvaluateAnswerFromQuestionMessage::TaskFailedError => e
abort e.message
end
end
end
44 changes: 44 additions & 0 deletions spec/lib/auto_evaluation/faithfulness/claims_generator_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
RSpec.describe AutoEvaluation::Faithfulness::ClaimsGenerator, :aws_credentials_stubbed do
describe ".call" do
let(:answer_message) { "Einstein won the Nobel Prize in 1968 for the photoelectric effect." }
let(:claims) { ["Einstein won the Nobel Prize in 1968.", "Einstein won the Nobel Prize for the photoelectric effect."] }
let(:claims_json) do
{ claims: }.to_json
end
let(:prompts) { AutoEvaluation::Prompts.config.faithfulness.fetch(:claims) }
let(:user_prompt) do
sprintf(
prompts.fetch(:user_prompt),
answer: answer_message,
)
end
let(:tools) { [prompts.fetch(:tool_spec)] }
let!(:stub_bedrock) do
stub_bedrock_invoke_model_openai_oss_tool_call(
user_prompt,
tools,
claims_json,
)
end

it "returns an array with the claims, llm_response, and metrics" do
allow(Clock).to receive(:monotonic_time).and_return(200.0, 202.0)

result = described_class.call(answer_message:)

expected_llm_response = JSON.parse(stub_bedrock.response.body)
expected_metrics = {
duration: 2.0,
model: AutoEvaluation::BedrockOpenAIOssInvoke::MODEL,
llm_prompt_tokens: 25,
llm_completion_tokens: 35,
llm_cached_tokens: nil,
}
expect(result).to contain_exactly(
claims,
expected_llm_response,
expected_metrics,
)
end
end
end
52 changes: 52 additions & 0 deletions spec/lib/auto_evaluation/faithfulness/reason_generator_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
RSpec.describe AutoEvaluation::Faithfulness::ReasonGenerator, :aws_credentials_stubbed do
describe ".call" do
let(:score) { 0.5 }
let(:verdicts) do
[
{ "verdict" => "no", "reason" => "The retrieval context states Einstein won in 1921, not 1968." },
{ "verdict" => "yes" },
]
end
let(:contradictions) { ["The retrieval context states Einstein won in 1921, not 1968."] }
let(:reason) { "The score is 0.5 because the answer incorrectly stated the year Einstein won the Nobel Prize." }
let(:reason_json) do
{ reason: }.to_json
end
let(:prompts) { AutoEvaluation::Prompts.config.faithfulness.fetch(:reason) }
let(:user_prompt) do
sprintf(
prompts.fetch(:user_prompt),
score:,
contradictions:,
)
end
let(:tools) { [prompts.fetch(:tool_spec)] }
let!(:stub_bedrock) do
stub_bedrock_invoke_model_openai_oss_tool_call(
user_prompt,
tools,
reason_json,
)
end

it "returns an array with the reason, llm_response, and metrics" do
allow(Clock).to receive(:monotonic_time).and_return(200.0, 202.0)

result = described_class.call(score:, verdicts:)

expected_llm_response = JSON.parse(stub_bedrock.response.body)
expected_metrics = {
duration: 2.0,
model: AutoEvaluation::BedrockOpenAIOssInvoke::MODEL,
llm_prompt_tokens: 25,
llm_completion_tokens: 35,
llm_cached_tokens: nil,
}
expect(result).to contain_exactly(
reason,
expected_llm_response,
expected_metrics,
)
end
end
end
Loading