Skip to content

Add faithfulness metric#737

Merged
chaecramb merged 2 commits into
mainfrom
add-faithfulness-metric
Jan 8, 2026
Merged

Add faithfulness metric#737
chaecramb merged 2 commits into
mainfrom
add-faithfulness-metric

Conversation

@chaecramb
Copy link
Copy Markdown
Contributor

@chaecramb chaecramb commented Dec 29, 2025

This PR adds the Faithfulness metric to the auto-evaluation module. It follows the established Ruby patterns from the AnswerRelevancy metric, using BedrockOpenAIOssInvoke to make tool calls to the LLM.

The metric evaluates whether the AI's answer is faithful to the retrieval context through a multi-step process:

  1. Extract truths from the retrieval context
  2. Extract claims from the answer
  3. Generate verdicts comparing claims against truths
  4. Calculate score and generate a reason

The score is calculated as the proportion of claims that don't contradict the retrieval context. Verdicts of "yes" and "idk" are treated as faithful (non-contradictory), while only "no" verdicts count against the score. This follows the DeepEval implementation.

The score is 1.0 (perfect) when no claims are extracted or all verdicts are "yes" or "idk". In all cases, a reason is generated via the LLM explaining the score.

A rake task has been added to generate faithfulness evaluations using the used sources as the retrieval context.

Trello: https://trello.com/c/SZkhqPRO/2992-ruby-auto-eval-for-faithfulness

Example rake task and output
❯ INPUT="How do I apply for a UK passport?" bundle exec rake evaluation:generate_faithfulness_evaluation
{
  "score": 1,
  "reason": "The score is 1.0 because there are no contradictions; the actual output fully aligns with the retrieval context.",
  "success": true,
  "llm_responses": {
    "truths": {
      "choices": [
        {
          "finish_reason": "tool_calls",
          "index": 0,
          "logprobs": null,
          "message": {
            "content": "{\n    \"truths\": []\n}",
            "refusal": null,
            "role": "assistant",
            "tool_calls": [
              {
                "function": {
                  "arguments": "{\n  \"truths\": []\n}",
                  "name": "extract_truths"
                },
                "id": "chatcmpl-tool-ad80ecb06bc9d080",
                "type": "function"
              }
            ]
          }
        }
      ],
      "created": 1767787220,
      "id": "chatcmpl-fa810f91-6cc1-483e-8f4d-200de4110984",
      "model": "openai.gpt-oss-120b-1:0",
      "object": "chat.completion",
      "service_tier": "default",
      "usage": {
        "completion_tokens": 149,
        "prompt_tokens": 347,
        "total_tokens": 496
      }
    },
    "claims": {
      "choices": [
        {
          "finish_reason": "tool_calls",
          "index": 0,
          "logprobs": null,
          "message": {
            "content": "{\n  \"claims\": [\n    \"You can apply for a UK passport in several ways: apply online at gov.uk/apply-renew-passport - this is £12.50 cheaper than applying by post.\",\n    \"You can pick up a paper application form from your local Post Office and apply by post.\",\n    \"You can get help with your application using the Check and Send service at selected Post Office branches.\",\n    \"You will need a debit or credit card to pay.\",\n    \"You will need 2 recent identical passport photos.\",\n    \"You will need your supporting documents (what you need depends on your circumstances).\",\n    \"Post Office staff can help you with both online and paper applications through their Check and Send service.\",\n    \"Post Office staff can take your digital photo and help fill in your online application, or check your paper form is completed correctly.\",\n    \"The Check and Send service costs extra.\",\n    \"Apply for your UK passport online to get started.\"\n  ]\n}",
            "refusal": null,
            "role": "assistant",
            "tool_calls": [
              {
                "function": {
                  "arguments": "{\n  \"claims\": [\n    \"You can apply for a UK passport in several ways: apply online at gov.uk/apply-renew-passport - this is £12.50 cheaper than applying by post.\",\n    \"You can pick up a paper application form from your local Post Office and apply by post.\",\n    \"You can get help with your application using the Check and Send service at selected Post Office branches.\",\n    \"You will need a debit or credit card to pay.\",\n    \"You will need 2 recent identical passport photos.\",\n    \"You will need your supporting documents (what you need depends on your circumstances).\",\n    \"Post Office staff can help you with both online and paper applications through their Check and Send service.\",\n    \"Post Office staff can take your digital photo and help fill in your online application, or check your paper form is completed correctly.\",\n    \"The Check and Send service costs extra.\",\n    \"Apply for your UK passport online to get started.\"\n  ]\n}",
                  "name": "extract_claims"
                },
                "id": "chatcmpl-tool-99f736167d0061ef",
                "type": "function"
              }
            ]
          }
        }
      ],
      "created": 1767787221,
      "id": "chatcmpl-9286fb85-2ec3-46dd-bc41-be85450ea1dd",
      "model": "openai.gpt-oss-120b-1:0",
      "object": "chat.completion",
      "service_tier": "default",
      "usage": {
        "completion_tokens": 690,
        "prompt_tokens": 644,
        "total_tokens": 1334
      }
    },
    "verdicts": {
      "choices": [
        {
          "finish_reason": "tool_calls",
          "index": 0,
          "logprobs": null,
          "message": {
            "content": "{\n    \"verdicts\": [\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        }\n    ]\n}",
            "refusal": null,
            "role": "assistant",
            "tool_calls": [
              {
                "function": {
                  "arguments": "{\n  \"verdicts\": [\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    }\n  ]\n}",
                  "name": "evaluate_claims"
                },
                "id": "chatcmpl-tool-93e5e2c7857a53e2",
                "type": "function"
              }
            ]
          }
        }
      ],
      "created": 1767787223,
      "id": "chatcmpl-5deb97cb-12bd-40df-afea-6d094c03752c",
      "model": "openai.gpt-oss-120b-1:0",
      "object": "chat.completion",
      "service_tier": "default",
      "usage": {
        "completion_tokens": 683,
        "prompt_tokens": 669,
        "total_tokens": 1352
      }
    },
    "reason": {
      "choices": [
        {
          "finish_reason": "tool_calls",
          "index": 0,
          "logprobs": null,
          "message": {
            "content": "{\n    \"reason\": \"The score is 1.0 because there are no contradictions; the actual output fully aligns with the retrieval context.\"\n}",
            "refusal": null,
            "role": "assistant",
            "tool_calls": [
              {
                "function": {
                  "arguments": "{\n    \"reason\": \"The score is 1.0 because there are no contradictions; the actual output fully aligns with the retrieval context.\"\n}",
                  "name": "generate_reason"
                },
                "id": "chatcmpl-tool-b7a832d113220dc7",
                "type": "function"
              }
            ]
          }
        }
      ],
      "created": 1767787225,
      "id": "chatcmpl-3c3e506e-2466-4a27-9a46-f5c43d051fed",
      "model": "openai.gpt-oss-120b-1:0",
      "object": "chat.completion",
      "service_tier": "default",
      "usage": {
        "completion_tokens": 127,
        "prompt_tokens": 343,
        "total_tokens": 470
      }
    }
  },
  "metrics": {
    "truths": {
      "duration": 1.1160260000033304,
      "llm_prompt_tokens": 347,
      "llm_completion_tokens": 149,
      "llm_cached_tokens": null,
      "model": "openai.gpt-oss-120b-1:0"
    },
    "claims": {
      "duration": 2.439254000026267,
      "llm_prompt_tokens": 644,
      "llm_completion_tokens": 690,
      "llm_cached_tokens": null,
      "model": "openai.gpt-oss-120b-1:0"
    },
    "verdicts": {
      "duration": 2.1723169999895617,
      "llm_prompt_tokens": 669,
      "llm_completion_tokens": 683,
      "llm_cached_tokens": null,
      "model": "openai.gpt-oss-120b-1:0"
    },
    "reason": {
      "duration": 0.5770500000216998,
      "llm_prompt_tokens": 343,
      "llm_completion_tokens": 127,
      "llm_cached_tokens": null,
      "model": "openai.gpt-oss-120b-1:0"
    }
  }
}

@chaecramb chaecramb force-pushed the add-faithfulness-metric branch from 6668c5e to 07c51df Compare December 30, 2025 13:36
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-faithful-63ed36 December 30, 2025 13:39 Inactive
@chaecramb chaecramb force-pushed the add-faithfulness-metric branch 4 times, most recently from b097e2f to a372199 Compare January 6, 2026 16:38
@chaecramb chaecramb marked this pull request as ready for review January 7, 2026 10:27
@chaecramb chaecramb force-pushed the add-faithfulness-metric branch 3 times, most recently from 3dc3ce0 to d0b65c3 Compare January 7, 2026 13:20
Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I've added a few comments - mostly interested in the situations where we have empty arrays and proceed anyway with LLM calls which feels a bit dubious

Comment thread lib/auto_evaluation/faithfulness.rb Outdated
Comment thread lib/auto_evaluation/faithfulness.rb
Comment thread lib/auto_evaluation/faithfulness.rb Outdated
verdicts = []
else
verdicts, llm_responses[:verdicts], metrics[:verdicts] = VerdictsGenerator.call(
claims:, retrieval_context: truths.join("\n\n"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we concerned if truths is empty?

I think it also might be better to pass in an array of truths into VerdictGenerator rather than retrieval_context as this use of retrieval_context means something different to the other usage (aside: also seems a bit misbalanced with claims being an argument)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds better, I'll switch to a truths array.

Re: empty truths, this also matches DeepEval. When truths is empty, VerdictsGenerator is called with empty context and the LLM returns "idk" for all claims, resulting in score 1.0.

The alternative, skipping VerdictsGenerator, would yield the same score. So as above, it seems to me that it's worth deviating from DeepEval here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I've added a guard to short-circuit if truths is empty.

Comment thread spec/lib/auto_evaluation/faithfulness_spec.rb
Comment thread spec/lib/auto_evaluation/faithfulness_spec.rb
Comment thread spec/lib/auto_evaluation/faithfulness_spec.rb Outdated
Comment thread spec/lib/auto_evaluation/faithfulness_spec.rb Outdated
@chaecramb chaecramb force-pushed the add-faithfulness-metric branch 3 times, most recently from f28ca79 to 4885b2f Compare January 7, 2026 15:20
This adds the Faithfulness metric to the auto-evaluation module. It
follows the established Ruby patterns from the AnswerRelevancy metric,
using BedrockOpenAIOssInvoke to make tool calls to the LLM.

The metric evaluates whether the AI's answer is faithful to the
retrieval context through a multi-step process:

1. Extract truths from the retrieval context
2. Extract claims from the answer
3. Generate verdicts comparing claims against truths
4. Calculate score and generate a reason

The score is calculated as the proportion of claims that don't
contradict the retrieval context. Verdicts of "yes" and "idk" are
treated as faithful (non-contradictory), while only "no" verdicts
count against the score. This follows the DeepEval implementation.

The metric returns early with a perfect score (1.0) when:
- No claims are extracted from the answer
- No truths are extracted from the retrieval context
- No verdicts are generated
- All verdicts are "yes" (no contradictions found)
This adds a new Rake task to generate faithfulness evaluation for a
given question. Like the answer relevancy and coherence tasks it:

1. generates an answer for the input question using the existing
    answer composition pipeline
2. evaluates the faithfulness of the generated answer against the
    retrieval context using AutoEvaluation::Faithfulness
3. outputs the result json to stdout
4. handles error answers appropriately

The key difference from the other metrics is that faithfulness evaluates
the answer against the retrieval context (the sources used to generate
the answer) rather than the original question. The retrieval context is
extracted from the answer's used sources joined with double newlines,
matching the DeepEval approach.
@chaecramb chaecramb force-pushed the add-faithfulness-metric branch from 4885b2f to e75c28f Compare January 7, 2026 15:24
@chaecramb
Copy link
Copy Markdown
Contributor Author

@kevindew this is ready for a re-review

Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Will be good to get Data Science to check it

Comment thread lib/auto_evaluation/faithfulness.rb
@chaecramb chaecramb merged commit e1a9044 into main Jan 8, 2026
12 checks passed
@chaecramb chaecramb deleted the add-faithfulness-metric branch January 8, 2026 11:11
end

def used_sources
answer.sources.used
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a little bug here that I encountered and which took me a while to figure out.

Using the used scope here actually doesn't return any records when this class is called from the Rake task, because that task doesn't persist any records to the database.

The answer that is built in the Rake task (via AutoEvaluation::EvaluateAnswerFromQuestionMessage) is built by the pipeline runner and isn't saved to the database. When you call answer.sources.used it runs a DB query to grab out the sources, but because there aren't any persisted, it always returns an empty relation.

If I run the Rake task as it stands on main, I get this:

$ INPUT="How do I start a new business?" rake evaluation:generate_faithfulness_evaluation
{"score":1.0,"reason":"No truths were extracted from the retrieval context.","success":true,"llm_responses":{"truths":{"choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":"{\n    \"truths\": []\n}","refusal":null,"role":"assistant","tool_calls":[{"function":{"arguments":"{\n  \"truths\": []\n}","name":"extract_truths"},"id":"chatcmpl-tool-b63f81baf8e1bc87","type":"function"}]}}],"created":1767873892,"id":"chatcmpl-a0f5b69c-372c-4bb8-99d9-4e8015b32457","model":"openai.gpt-oss-120b-1:0","object":"chat.completion","service_tier":"default","usage":{"completion_tokens":150,"prompt_tokens":347,"prompt_tokens_details":{"audio_tokens":0,"cached_tokens":64},"total_tokens":497}}},"metrics":{"truths":{"duration":0.8339748330181465,"llm_prompt_tokens":347,"llm_completion_tokens":150,"llm_cached_tokens":null,"model":"openai.gpt-oss-120b-1:0"}}}

So no search results were passed in as the retrieval context.

Whereas if I change this line to answer.sources.select(&:used) (i.e. filtering the array rather than using the scope), I get the expected result:

$ INPUT="How do I start a new business?" rake evaluation:generate_faithfulness_evaluation
{"score":1.0,"reason":"The response is fully supported by the retrieval context.","success":true,"llm_responses":{"truths":{"choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":"{\n  \"truths\": [\n    \"Starting your own business can be a thrilling and rewarding endeavor.\",\n    \"It is important to begin with a solid business plan and clear objectives when starting a business.\",\n    \"Researching your market thoroughly and understanding your competition is recommended.\",\n...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great spot. It's a shame it's such a pain to get integration tests for the rake tasks as this feel something where we'd want a test to catch.

I imagine we want the tests for the Faithfulness class (and other auto eval routes) to use a FactoryBot.build(:answer) rather than a FactoryBot.create(:answer) since the answer may not be persisted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants