Add faithfulness metric by chaecramb · Pull Request #737 · alphagov/govuk-chat

chaecramb · 2025-12-29T16:56:02Z

This PR adds the Faithfulness metric to the auto-evaluation module. It follows the established Ruby patterns from the AnswerRelevancy metric, using BedrockOpenAIOssInvoke to make tool calls to the LLM.

The metric evaluates whether the AI's answer is faithful to the retrieval context through a multi-step process:

Extract truths from the retrieval context
Extract claims from the answer
Generate verdicts comparing claims against truths
Calculate score and generate a reason

The score is calculated as the proportion of claims that don't contradict the retrieval context. Verdicts of "yes" and "idk" are treated as faithful (non-contradictory), while only "no" verdicts count against the score. This follows the DeepEval implementation.

The score is 1.0 (perfect) when no claims are extracted or all verdicts are "yes" or "idk". In all cases, a reason is generated via the LLM explaining the score.

A rake task has been added to generate faithfulness evaluations using the used sources as the retrieval context.

Trello: https://trello.com/c/SZkhqPRO/2992-ruby-auto-eval-for-faithfulness

Example rake task and output

❯ INPUT="How do I apply for a UK passport?" bundle exec rake evaluation:generate_faithfulness_evaluation
{
  "score": 1,
  "reason": "The score is 1.0 because there are no contradictions; the actual output fully aligns with the retrieval context.",
  "success": true,
  "llm_responses": {
    "truths": {
      "choices": [
        {
          "finish_reason": "tool_calls",
          "index": 0,
          "logprobs": null,
          "message": {
            "content": "{\n    \"truths\": []\n}",
            "refusal": null,
            "role": "assistant",
            "tool_calls": [
              {
                "function": {
                  "arguments": "{\n  \"truths\": []\n}",
                  "name": "extract_truths"
                },
                "id": "chatcmpl-tool-ad80ecb06bc9d080",
                "type": "function"
              }
            ]
          }
        }
      ],
      "created": 1767787220,
      "id": "chatcmpl-fa810f91-6cc1-483e-8f4d-200de4110984",
      "model": "openai.gpt-oss-120b-1:0",
      "object": "chat.completion",
      "service_tier": "default",
      "usage": {
        "completion_tokens": 149,
        "prompt_tokens": 347,
        "total_tokens": 496
      }
    },
    "claims": {
      "choices": [
        {
          "finish_reason": "tool_calls",
          "index": 0,
          "logprobs": null,
          "message": {
            "content": "{\n  \"claims\": [\n    \"You can apply for a UK passport in several ways: apply online at gov.uk/apply-renew-passport - this is £12.50 cheaper than applying by post.\",\n    \"You can pick up a paper application form from your local Post Office and apply by post.\",\n    \"You can get help with your application using the Check and Send service at selected Post Office branches.\",\n    \"You will need a debit or credit card to pay.\",\n    \"You will need 2 recent identical passport photos.\",\n    \"You will need your supporting documents (what you need depends on your circumstances).\",\n    \"Post Office staff can help you with both online and paper applications through their Check and Send service.\",\n    \"Post Office staff can take your digital photo and help fill in your online application, or check your paper form is completed correctly.\",\n    \"The Check and Send service costs extra.\",\n    \"Apply for your UK passport online to get started.\"\n  ]\n}",
            "refusal": null,
            "role": "assistant",
            "tool_calls": [
              {
                "function": {
                  "arguments": "{\n  \"claims\": [\n    \"You can apply for a UK passport in several ways: apply online at gov.uk/apply-renew-passport - this is £12.50 cheaper than applying by post.\",\n    \"You can pick up a paper application form from your local Post Office and apply by post.\",\n    \"You can get help with your application using the Check and Send service at selected Post Office branches.\",\n    \"You will need a debit or credit card to pay.\",\n    \"You will need 2 recent identical passport photos.\",\n    \"You will need your supporting documents (what you need depends on your circumstances).\",\n    \"Post Office staff can help you with both online and paper applications through their Check and Send service.\",\n    \"Post Office staff can take your digital photo and help fill in your online application, or check your paper form is completed correctly.\",\n    \"The Check and Send service costs extra.\",\n    \"Apply for your UK passport online to get started.\"\n  ]\n}",
                  "name": "extract_claims"
                },
                "id": "chatcmpl-tool-99f736167d0061ef",
                "type": "function"
              }
            ]
          }
        }
      ],
      "created": 1767787221,
      "id": "chatcmpl-9286fb85-2ec3-46dd-bc41-be85450ea1dd",
      "model": "openai.gpt-oss-120b-1:0",
      "object": "chat.completion",
      "service_tier": "default",
      "usage": {
        "completion_tokens": 690,
        "prompt_tokens": 644,
        "total_tokens": 1334
      }
    },
    "verdicts": {
      "choices": [
        {
          "finish_reason": "tool_calls",
          "index": 0,
          "logprobs": null,
          "message": {
            "content": "{\n    \"verdicts\": [\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        },\n        {\n            \"verdict\": \"idk\",\n            \"reason\": \"No retrieval context provided to verify the claim.\"\n        }\n    ]\n}",
            "refusal": null,
            "role": "assistant",
            "tool_calls": [
              {
                "function": {
                  "arguments": "{\n  \"verdicts\": [\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    },\n    {\n      \"verdict\": \"idk\",\n      \"reason\": \"No retrieval context provided to verify the claim.\"\n    }\n  ]\n}",
                  "name": "evaluate_claims"
                },
                "id": "chatcmpl-tool-93e5e2c7857a53e2",
                "type": "function"
              }
            ]
          }
        }
      ],
      "created": 1767787223,
      "id": "chatcmpl-5deb97cb-12bd-40df-afea-6d094c03752c",
      "model": "openai.gpt-oss-120b-1:0",
      "object": "chat.completion",
      "service_tier": "default",
      "usage": {
        "completion_tokens": 683,
        "prompt_tokens": 669,
        "total_tokens": 1352
      }
    },
    "reason": {
      "choices": [
        {
          "finish_reason": "tool_calls",
          "index": 0,
          "logprobs": null,
          "message": {
            "content": "{\n    \"reason\": \"The score is 1.0 because there are no contradictions; the actual output fully aligns with the retrieval context.\"\n}",
            "refusal": null,
            "role": "assistant",
            "tool_calls": [
              {
                "function": {
                  "arguments": "{\n    \"reason\": \"The score is 1.0 because there are no contradictions; the actual output fully aligns with the retrieval context.\"\n}",
                  "name": "generate_reason"
                },
                "id": "chatcmpl-tool-b7a832d113220dc7",
                "type": "function"
              }
            ]
          }
        }
      ],
      "created": 1767787225,
      "id": "chatcmpl-3c3e506e-2466-4a27-9a46-f5c43d051fed",
      "model": "openai.gpt-oss-120b-1:0",
      "object": "chat.completion",
      "service_tier": "default",
      "usage": {
        "completion_tokens": 127,
        "prompt_tokens": 343,
        "total_tokens": 470
      }
    }
  },
  "metrics": {
    "truths": {
      "duration": 1.1160260000033304,
      "llm_prompt_tokens": 347,
      "llm_completion_tokens": 149,
      "llm_cached_tokens": null,
      "model": "openai.gpt-oss-120b-1:0"
    },
    "claims": {
      "duration": 2.439254000026267,
      "llm_prompt_tokens": 644,
      "llm_completion_tokens": 690,
      "llm_cached_tokens": null,
      "model": "openai.gpt-oss-120b-1:0"
    },
    "verdicts": {
      "duration": 2.1723169999895617,
      "llm_prompt_tokens": 669,
      "llm_completion_tokens": 683,
      "llm_cached_tokens": null,
      "model": "openai.gpt-oss-120b-1:0"
    },
    "reason": {
      "duration": 0.5770500000216998,
      "llm_prompt_tokens": 343,
      "llm_completion_tokens": 127,
      "llm_cached_tokens": null,
      "model": "openai.gpt-oss-120b-1:0"
    }
  }
}

kevindew

Looks good, I've added a few comments - mostly interested in the situations where we have empty arrays and proceed anyway with LLM calls which feels a bit dubious

kevindew · 2026-01-07T12:59:25Z

+      verdicts = []
+    else
+      verdicts, llm_responses[:verdicts], metrics[:verdicts] = VerdictsGenerator.call(
+        claims:, retrieval_context: truths.join("\n\n"),


Are we concerned if truths is empty?

I think it also might be better to pass in an array of truths into VerdictGenerator rather than retrieval_context as this use of retrieval_context means something different to the other usage (aside: also seems a bit misbalanced with claims being an argument)

Yeah, that sounds better, I'll switch to a truths array.

Re: empty truths, this also matches DeepEval. When truths is empty, VerdictsGenerator is called with empty context and the LLM returns "idk" for all claims, resulting in score 1.0.

The alternative, skipping VerdictsGenerator, would yield the same score. So as above, it seems to me that it's worth deviating from DeepEval here.

As above, I've added a guard to short-circuit if truths is empty.

This adds the Faithfulness metric to the auto-evaluation module. It follows the established Ruby patterns from the AnswerRelevancy metric, using BedrockOpenAIOssInvoke to make tool calls to the LLM. The metric evaluates whether the AI's answer is faithful to the retrieval context through a multi-step process: 1. Extract truths from the retrieval context 2. Extract claims from the answer 3. Generate verdicts comparing claims against truths 4. Calculate score and generate a reason The score is calculated as the proportion of claims that don't contradict the retrieval context. Verdicts of "yes" and "idk" are treated as faithful (non-contradictory), while only "no" verdicts count against the score. This follows the DeepEval implementation. The metric returns early with a perfect score (1.0) when: - No claims are extracted from the answer - No truths are extracted from the retrieval context - No verdicts are generated - All verdicts are "yes" (no contradictions found)

This adds a new Rake task to generate faithfulness evaluation for a given question. Like the answer relevancy and coherence tasks it: 1. generates an answer for the input question using the existing answer composition pipeline 2. evaluates the faithfulness of the generated answer against the retrieval context using AutoEvaluation::Faithfulness 3. outputs the result json to stdout 4. handles error answers appropriately The key difference from the other metrics is that faithfulness evaluates the answer against the retrieval context (the sources used to generate the answer) rather than the original question. The retrieval context is extracted from the answer's used sources joined with double newlines, matching the DeepEval approach.

chaecramb · 2026-01-07T15:27:59Z

@kevindew this is ready for a re-review

kevindew

Looks good to me. Will be good to get Data Science to check it

jackbot · 2026-01-08T12:07:20Z

+  end
+
+  def used_sources
+    answer.sources.used


I think there's a little bug here that I encountered and which took me a while to figure out.

Using the used scope here actually doesn't return any records when this class is called from the Rake task, because that task doesn't persist any records to the database.

The answer that is built in the Rake task (via AutoEvaluation::EvaluateAnswerFromQuestionMessage) is built by the pipeline runner and isn't saved to the database. When you call answer.sources.used it runs a DB query to grab out the sources, but because there aren't any persisted, it always returns an empty relation.

If I run the Rake task as it stands on main, I get this:

$ INPUT="How do I start a new business?" rake evaluation:generate_faithfulness_evaluation {"score":1.0,"reason":"No truths were extracted from the retrieval context.","success":true,"llm_responses":{"truths":{"choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":"{\n \"truths\": []\n}","refusal":null,"role":"assistant","tool_calls":[{"function":{"arguments":"{\n \"truths\": []\n}","name":"extract_truths"},"id":"chatcmpl-tool-b63f81baf8e1bc87","type":"function"}]}}],"created":1767873892,"id":"chatcmpl-a0f5b69c-372c-4bb8-99d9-4e8015b32457","model":"openai.gpt-oss-120b-1:0","object":"chat.completion","service_tier":"default","usage":{"completion_tokens":150,"prompt_tokens":347,"prompt_tokens_details":{"audio_tokens":0,"cached_tokens":64},"total_tokens":497}}},"metrics":{"truths":{"duration":0.8339748330181465,"llm_prompt_tokens":347,"llm_completion_tokens":150,"llm_cached_tokens":null,"model":"openai.gpt-oss-120b-1:0"}}}

So no search results were passed in as the retrieval context.

Whereas if I change this line to answer.sources.select(&:used) (i.e. filtering the array rather than using the scope), I get the expected result:

$ INPUT="How do I start a new business?" rake evaluation:generate_faithfulness_evaluation {"score":1.0,"reason":"The response is fully supported by the retrieval context.","success":true,"llm_responses":{"truths":{"choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":"{\n \"truths\": [\n \"Starting your own business can be a thrilling and rewarding endeavor.\",\n \"It is important to begin with a solid business plan and clear objectives when starting a business.\",\n \"Researching your market thoroughly and understanding your competition is recommended.\",\n...

Oh great spot. It's a shame it's such a pain to get integration tests for the rake tasks as this feel something where we'd want a test to catch.

I imagine we want the tests for the Faithfulness class (and other auto eval routes) to use a FactoryBot.build(:answer) rather than a FactoryBot.create(:answer) since the answer may not be persisted

chaecramb force-pushed the add-faithfulness-metric branch from 6668c5e to 07c51df Compare December 30, 2025 13:36

govuk-ci temporarily deployed to govuk-chat-add-faithful-63ed36 December 30, 2025 13:39 Inactive

chaecramb force-pushed the add-faithfulness-metric branch 4 times, most recently from b097e2f to a372199 Compare January 6, 2026 16:38

chaecramb marked this pull request as ready for review January 7, 2026 10:27

chaecramb force-pushed the add-faithfulness-metric branch 3 times, most recently from 3dc3ce0 to d0b65c3 Compare January 7, 2026 13:20

kevindew reviewed Jan 7, 2026

View reviewed changes

chaecramb force-pushed the add-faithfulness-metric branch 3 times, most recently from f28ca79 to 4885b2f Compare January 7, 2026 15:20

chaecramb added 2 commits January 7, 2026 15:24

chaecramb force-pushed the add-faithfulness-metric branch from 4885b2f to e75c28f Compare January 7, 2026 15:24

kevindew approved these changes Jan 7, 2026

View reviewed changes

Comment thread lib/auto_evaluation/faithfulness.rb

chaecramb merged commit e1a9044 into main Jan 8, 2026
12 checks passed

chaecramb deleted the add-faithfulness-metric branch January 8, 2026 11:11

jackbot reviewed Jan 8, 2026

View reviewed changes

kevindew mentioned this pull request Jan 8, 2026

Auto eval: context relevancy #758

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add faithfulness metric#737

Add faithfulness metric#737
chaecramb merged 2 commits into
mainfrom
add-faithfulness-metric

chaecramb commented Dec 29, 2025 •

edited

Loading

Uh oh!

kevindew left a comment

Uh oh!

Uh oh!

Uh oh!

kevindew Jan 7, 2026

Uh oh!

chaecramb Jan 7, 2026

Uh oh!

chaecramb Jan 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chaecramb commented Jan 7, 2026

Uh oh!

kevindew left a comment

Uh oh!

Uh oh!

Uh oh!

jackbot Jan 8, 2026

Uh oh!

kevindew Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

chaecramb commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevindew Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

chaecramb Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

chaecramb Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chaecramb commented Jan 7, 2026

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jackbot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

kevindew Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chaecramb commented Dec 29, 2025 •

edited

Loading