-
Notifications
You must be signed in to change notification settings - Fork 3
Add faithfulness metric #737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,101 @@ | ||
| class AutoEvaluation::Faithfulness | ||
| THRESHOLD = 0.5 | ||
|
|
||
| def self.call(...) = new(...).call | ||
|
|
||
| def initialize(answer) | ||
| @answer = answer | ||
| @llm_responses = {} | ||
| @metrics = {} | ||
| end | ||
|
|
||
| def call | ||
| truths, llm_responses[:truths], metrics[:truths] = TruthsGenerator.call(retrieval_context:) | ||
|
|
||
| if truths.empty? | ||
| return build_maximum_score_result( | ||
| reason: "No truths were extracted from the retrieval context.", | ||
| llm_responses:, | ||
| metrics:, | ||
| ) | ||
| end | ||
|
|
||
| claims, llm_responses[:claims], metrics[:claims] = ClaimsGenerator.call(answer_message:) | ||
|
|
||
| if claims.empty? | ||
| return build_maximum_score_result( | ||
| reason: "No claims were extracted from the answer.", | ||
| llm_responses:, | ||
| metrics:, | ||
| ) | ||
| end | ||
|
|
||
| verdicts, llm_responses[:verdicts], metrics[:verdicts] = VerdictsGenerator.call( | ||
| claims:, truths:, | ||
| ) | ||
|
|
||
| if verdicts.empty? | ||
| return build_maximum_score_result( | ||
| reason: "No verdicts were generated for the extracted claims.", | ||
| llm_responses:, | ||
| metrics:, | ||
| ) | ||
| end | ||
|
|
||
| if verdicts.none? { |verdict| verdict["verdict"].strip.downcase == "no" } | ||
| return build_maximum_score_result( | ||
| reason: "The response is fully supported by the retrieval context.", | ||
| llm_responses:, | ||
| metrics:, | ||
| ) | ||
| end | ||
|
|
||
| score = calculate_score(verdicts) | ||
|
|
||
| reason, llm_responses[:reason], metrics[:reason] = ReasonGenerator.call( | ||
| score: score.round(2), verdicts:, | ||
| ) | ||
|
|
||
| AutoEvaluation::ScoreResult.new( | ||
| score:, | ||
| reason:, | ||
| success: score >= THRESHOLD, | ||
| llm_responses:, | ||
| metrics:, | ||
| ) | ||
| end | ||
|
|
||
| private | ||
|
|
||
| attr_reader :answer | ||
| attr_accessor :llm_responses, :metrics | ||
|
|
||
| def answer_message | ||
| answer.message | ||
| end | ||
|
|
||
| def retrieval_context | ||
| used_sources.map(&:plain_content).join("\n\n") | ||
| end | ||
|
|
||
| def calculate_score(verdicts) | ||
| return 1.0 if verdicts.empty? | ||
|
|
||
| faithful_count = verdicts.count { |verdict| verdict["verdict"].strip.downcase != "no" } | ||
| faithful_count.to_d / verdicts.count | ||
| end | ||
|
|
||
| def used_sources | ||
| answer.sources.used | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think there's a little bug here that I encountered and which took me a while to figure out. Using the The If I run the Rake task as it stands on main, I get this: So no search results were passed in as the retrieval context. Whereas if I change this line to
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh great spot. It's a shame it's such a pain to get integration tests for the rake tasks as this feel something where we'd want a test to catch. I imagine we want the tests for the Faithfulness class (and other auto eval routes) to use a FactoryBot.build(:answer) rather than a FactoryBot.create(:answer) since the answer may not be persisted |
||
| end | ||
|
|
||
| def build_maximum_score_result(reason:, llm_responses:, metrics:) | ||
| AutoEvaluation::ScoreResult.new( | ||
| score: 1.0, | ||
| reason:, | ||
| success: true, | ||
| llm_responses:, | ||
| metrics:, | ||
| ) | ||
| end | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| module AutoEvaluation | ||
| class Faithfulness::ClaimsGenerator | ||
| def self.call(...) = new(...).call | ||
|
|
||
| def initialize(answer_message:) | ||
| @answer_message = answer_message | ||
| end | ||
|
|
||
| def call | ||
| result = BedrockOpenAIOssInvoke.call(user_prompt, tools) | ||
| [result.evaluation_data.fetch("claims"), result.llm_response, result.metrics] | ||
| end | ||
|
|
||
| private | ||
|
|
||
| attr_reader :answer_message | ||
|
|
||
| def llm_prompts | ||
| Prompts.config | ||
| .faithfulness | ||
| .fetch(:claims) | ||
| end | ||
|
|
||
| def user_prompt | ||
| sprintf( | ||
| llm_prompts.fetch(:user_prompt), | ||
| answer: answer_message, | ||
| ) | ||
| end | ||
|
|
||
| def tools | ||
| [llm_prompts.fetch(:tool_spec)] | ||
| end | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| module AutoEvaluation | ||
| class Faithfulness::ReasonGenerator | ||
| def self.call(...) = new(...).call | ||
|
|
||
| def initialize(score:, verdicts:) | ||
| @score = score | ||
| @verdicts = verdicts | ||
| end | ||
|
|
||
| def call | ||
| result = BedrockOpenAIOssInvoke.call(user_prompt, tools) | ||
| [result.evaluation_data.fetch("reason"), result.llm_response, result.metrics] | ||
| end | ||
|
|
||
| private | ||
|
|
||
| attr_reader :score, :verdicts | ||
|
|
||
| def llm_prompts | ||
| Prompts.config | ||
| .faithfulness | ||
| .fetch(:reason) | ||
| end | ||
|
|
||
| def user_prompt | ||
| sprintf( | ||
| llm_prompts.fetch(:user_prompt), | ||
| score:, | ||
| contradictions:, | ||
| ) | ||
| end | ||
|
|
||
| def tools | ||
| [llm_prompts.fetch(:tool_spec)] | ||
| end | ||
|
|
||
| def contradictions | ||
| verdicts.select { |verdict| verdict["verdict"].strip.downcase == "no" } | ||
| .map { |verdict| verdict["reason"] } | ||
| end | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| module AutoEvaluation | ||
| class Faithfulness::TruthsGenerator | ||
| def self.call(...) = new(...).call | ||
|
|
||
| def initialize(retrieval_context:) | ||
| @retrieval_context = retrieval_context | ||
| end | ||
|
|
||
| def call | ||
| result = BedrockOpenAIOssInvoke.call(user_prompt, tools) | ||
| [result.evaluation_data.fetch("truths"), result.llm_response, result.metrics] | ||
| end | ||
|
|
||
| private | ||
|
|
||
| attr_reader :retrieval_context | ||
|
|
||
| def llm_prompts | ||
| Prompts.config | ||
| .faithfulness | ||
| .fetch(:truths) | ||
| end | ||
|
|
||
| def user_prompt | ||
| sprintf( | ||
| llm_prompts.fetch(:user_prompt), | ||
| retrieval_context:, | ||
| ) | ||
| end | ||
|
|
||
| def tools | ||
| [llm_prompts.fetch(:tool_spec)] | ||
| end | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| module AutoEvaluation | ||
| class Faithfulness::VerdictsGenerator | ||
| def self.call(...) = new(...).call | ||
|
|
||
| def initialize(claims:, truths:) | ||
| @claims = claims | ||
| @truths = truths | ||
| end | ||
|
|
||
| def call | ||
| result = BedrockOpenAIOssInvoke.call(user_prompt, tools) | ||
| [result.evaluation_data.fetch("verdicts"), result.llm_response, result.metrics] | ||
| end | ||
|
|
||
| private | ||
|
|
||
| attr_reader :claims, :truths | ||
|
|
||
| def llm_prompts | ||
| Prompts.config | ||
| .faithfulness | ||
| .fetch(:verdicts) | ||
| end | ||
|
|
||
| def user_prompt | ||
| sprintf( | ||
| llm_prompts.fetch(:user_prompt), | ||
| claims:, | ||
| retrieval_context: truths.join("\n\n"), | ||
| ) | ||
| end | ||
|
|
||
| def tools | ||
| [llm_prompts.fetch(:tool_spec)] | ||
| end | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| RSpec.describe AutoEvaluation::Faithfulness::ClaimsGenerator, :aws_credentials_stubbed do | ||
| describe ".call" do | ||
| let(:answer_message) { "Einstein won the Nobel Prize in 1968 for the photoelectric effect." } | ||
| let(:claims) { ["Einstein won the Nobel Prize in 1968.", "Einstein won the Nobel Prize for the photoelectric effect."] } | ||
| let(:claims_json) do | ||
| { claims: }.to_json | ||
| end | ||
| let(:prompts) { AutoEvaluation::Prompts.config.faithfulness.fetch(:claims) } | ||
| let(:user_prompt) do | ||
| sprintf( | ||
| prompts.fetch(:user_prompt), | ||
| answer: answer_message, | ||
| ) | ||
| end | ||
| let(:tools) { [prompts.fetch(:tool_spec)] } | ||
| let!(:stub_bedrock) do | ||
| stub_bedrock_invoke_model_openai_oss_tool_call( | ||
| user_prompt, | ||
| tools, | ||
| claims_json, | ||
| ) | ||
| end | ||
|
|
||
| it "returns an array with the claims, llm_response, and metrics" do | ||
| allow(Clock).to receive(:monotonic_time).and_return(200.0, 202.0) | ||
|
|
||
| result = described_class.call(answer_message:) | ||
|
|
||
| expected_llm_response = JSON.parse(stub_bedrock.response.body) | ||
| expected_metrics = { | ||
| duration: 2.0, | ||
| model: AutoEvaluation::BedrockOpenAIOssInvoke::MODEL, | ||
| llm_prompt_tokens: 25, | ||
| llm_completion_tokens: 35, | ||
| llm_cached_tokens: nil, | ||
| } | ||
| expect(result).to contain_exactly( | ||
| claims, | ||
| expected_llm_response, | ||
| expected_metrics, | ||
| ) | ||
| end | ||
| end | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| RSpec.describe AutoEvaluation::Faithfulness::ReasonGenerator, :aws_credentials_stubbed do | ||
| describe ".call" do | ||
| let(:score) { 0.5 } | ||
| let(:verdicts) do | ||
| [ | ||
| { "verdict" => "no", "reason" => "The retrieval context states Einstein won in 1921, not 1968." }, | ||
| { "verdict" => "yes" }, | ||
| ] | ||
| end | ||
| let(:contradictions) { ["The retrieval context states Einstein won in 1921, not 1968."] } | ||
| let(:reason) { "The score is 0.5 because the answer incorrectly stated the year Einstein won the Nobel Prize." } | ||
| let(:reason_json) do | ||
| { reason: }.to_json | ||
| end | ||
| let(:prompts) { AutoEvaluation::Prompts.config.faithfulness.fetch(:reason) } | ||
| let(:user_prompt) do | ||
| sprintf( | ||
| prompts.fetch(:user_prompt), | ||
| score:, | ||
| contradictions:, | ||
| ) | ||
| end | ||
| let(:tools) { [prompts.fetch(:tool_spec)] } | ||
| let!(:stub_bedrock) do | ||
| stub_bedrock_invoke_model_openai_oss_tool_call( | ||
| user_prompt, | ||
| tools, | ||
| reason_json, | ||
| ) | ||
| end | ||
|
|
||
| it "returns an array with the reason, llm_response, and metrics" do | ||
| allow(Clock).to receive(:monotonic_time).and_return(200.0, 202.0) | ||
|
|
||
| result = described_class.call(score:, verdicts:) | ||
|
|
||
| expected_llm_response = JSON.parse(stub_bedrock.response.body) | ||
| expected_metrics = { | ||
| duration: 2.0, | ||
| model: AutoEvaluation::BedrockOpenAIOssInvoke::MODEL, | ||
| llm_prompt_tokens: 25, | ||
| llm_completion_tokens: 35, | ||
| llm_cached_tokens: nil, | ||
| } | ||
| expect(result).to contain_exactly( | ||
| reason, | ||
| expected_llm_response, | ||
| expected_metrics, | ||
| ) | ||
| end | ||
| end | ||
| end |
Uh oh!
There was an error while loading. Please reload this page.