Add faithfulness metric#737
Conversation
6668c5e to
07c51df
Compare
b097e2f to
a372199
Compare
3dc3ce0 to
d0b65c3
Compare
kevindew
left a comment
There was a problem hiding this comment.
Looks good, I've added a few comments - mostly interested in the situations where we have empty arrays and proceed anyway with LLM calls which feels a bit dubious
| verdicts = [] | ||
| else | ||
| verdicts, llm_responses[:verdicts], metrics[:verdicts] = VerdictsGenerator.call( | ||
| claims:, retrieval_context: truths.join("\n\n"), |
There was a problem hiding this comment.
Are we concerned if truths is empty?
I think it also might be better to pass in an array of truths into VerdictGenerator rather than retrieval_context as this use of retrieval_context means something different to the other usage (aside: also seems a bit misbalanced with claims being an argument)
There was a problem hiding this comment.
Yeah, that sounds better, I'll switch to a truths array.
Re: empty truths, this also matches DeepEval. When truths is empty, VerdictsGenerator is called with empty context and the LLM returns "idk" for all claims, resulting in score 1.0.
The alternative, skipping VerdictsGenerator, would yield the same score. So as above, it seems to me that it's worth deviating from DeepEval here.
There was a problem hiding this comment.
As above, I've added a guard to short-circuit if truths is empty.
f28ca79 to
4885b2f
Compare
This adds the Faithfulness metric to the auto-evaluation module. It follows the established Ruby patterns from the AnswerRelevancy metric, using BedrockOpenAIOssInvoke to make tool calls to the LLM. The metric evaluates whether the AI's answer is faithful to the retrieval context through a multi-step process: 1. Extract truths from the retrieval context 2. Extract claims from the answer 3. Generate verdicts comparing claims against truths 4. Calculate score and generate a reason The score is calculated as the proportion of claims that don't contradict the retrieval context. Verdicts of "yes" and "idk" are treated as faithful (non-contradictory), while only "no" verdicts count against the score. This follows the DeepEval implementation. The metric returns early with a perfect score (1.0) when: - No claims are extracted from the answer - No truths are extracted from the retrieval context - No verdicts are generated - All verdicts are "yes" (no contradictions found)
This adds a new Rake task to generate faithfulness evaluation for a
given question. Like the answer relevancy and coherence tasks it:
1. generates an answer for the input question using the existing
answer composition pipeline
2. evaluates the faithfulness of the generated answer against the
retrieval context using AutoEvaluation::Faithfulness
3. outputs the result json to stdout
4. handles error answers appropriately
The key difference from the other metrics is that faithfulness evaluates
the answer against the retrieval context (the sources used to generate
the answer) rather than the original question. The retrieval context is
extracted from the answer's used sources joined with double newlines,
matching the DeepEval approach.
4885b2f to
e75c28f
Compare
|
@kevindew this is ready for a re-review |
kevindew
left a comment
There was a problem hiding this comment.
Looks good to me. Will be good to get Data Science to check it
| end | ||
|
|
||
| def used_sources | ||
| answer.sources.used |
There was a problem hiding this comment.
I think there's a little bug here that I encountered and which took me a while to figure out.
Using the used scope here actually doesn't return any records when this class is called from the Rake task, because that task doesn't persist any records to the database.
The answer that is built in the Rake task (via AutoEvaluation::EvaluateAnswerFromQuestionMessage) is built by the pipeline runner and isn't saved to the database. When you call answer.sources.used it runs a DB query to grab out the sources, but because there aren't any persisted, it always returns an empty relation.
If I run the Rake task as it stands on main, I get this:
$ INPUT="How do I start a new business?" rake evaluation:generate_faithfulness_evaluation
{"score":1.0,"reason":"No truths were extracted from the retrieval context.","success":true,"llm_responses":{"truths":{"choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":"{\n \"truths\": []\n}","refusal":null,"role":"assistant","tool_calls":[{"function":{"arguments":"{\n \"truths\": []\n}","name":"extract_truths"},"id":"chatcmpl-tool-b63f81baf8e1bc87","type":"function"}]}}],"created":1767873892,"id":"chatcmpl-a0f5b69c-372c-4bb8-99d9-4e8015b32457","model":"openai.gpt-oss-120b-1:0","object":"chat.completion","service_tier":"default","usage":{"completion_tokens":150,"prompt_tokens":347,"prompt_tokens_details":{"audio_tokens":0,"cached_tokens":64},"total_tokens":497}}},"metrics":{"truths":{"duration":0.8339748330181465,"llm_prompt_tokens":347,"llm_completion_tokens":150,"llm_cached_tokens":null,"model":"openai.gpt-oss-120b-1:0"}}}
So no search results were passed in as the retrieval context.
Whereas if I change this line to answer.sources.select(&:used) (i.e. filtering the array rather than using the scope), I get the expected result:
$ INPUT="How do I start a new business?" rake evaluation:generate_faithfulness_evaluation
{"score":1.0,"reason":"The response is fully supported by the retrieval context.","success":true,"llm_responses":{"truths":{"choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":"{\n \"truths\": [\n \"Starting your own business can be a thrilling and rewarding endeavor.\",\n \"It is important to begin with a solid business plan and clear objectives when starting a business.\",\n \"Researching your market thoroughly and understanding your competition is recommended.\",\n...
There was a problem hiding this comment.
Oh great spot. It's a shame it's such a pain to get integration tests for the rake tasks as this feel something where we'd want a test to catch.
I imagine we want the tests for the Faithfulness class (and other auto eval routes) to use a FactoryBot.build(:answer) rather than a FactoryBot.create(:answer) since the answer may not be persisted
This PR adds the Faithfulness metric to the auto-evaluation module. It follows the established Ruby patterns from the AnswerRelevancy metric, using BedrockOpenAIOssInvoke to make tool calls to the LLM.
The metric evaluates whether the AI's answer is faithful to the retrieval context through a multi-step process:
The score is calculated as the proportion of claims that don't contradict the retrieval context. Verdicts of "yes" and "idk" are treated as faithful (non-contradictory), while only "no" verdicts count against the score. This follows the DeepEval implementation.
The score is 1.0 (perfect) when no claims are extracted or all verdicts are "yes" or "idk". In all cases, a reason is generated via the LLM explaining the score.
A rake task has been added to generate faithfulness evaluations using the used sources as the retrieval context.
Trello: https://trello.com/c/SZkhqPRO/2992-ruby-auto-eval-for-faithfulness
Example rake task and output