Add Coherence metric by davidgisbey · Pull Request #726 · alphagov/govuk-chat

davidgisbey · 2025-12-22T13:58:11Z

Description

This ports the Coherence metric from the evaluation repo to our Ruby codebase.

It:

adding the metric to lib/auto_evaluation
adds a rake task which can be used to output an evaluation result for coherence for a given input
adds the RunMetricEvaluation which handles, building an answer and running a metric on that answer
adds schema validation against the tool output

Rake task

Result

Trello card

https://trello.com/c/cUIagBUx/2996-ruby-auto-eval-for-coherence-metric

kevindew

Looks great, mostly minor comments but with a potentially bigger suggestion on an abstraction for the Rake task

davidgisbey · 2026-01-05T09:25:11Z

Thanks for the review @kevindew. I made those changes. The last 2 commits implement the rake task and schema validation.

kevindew

Looks great, just a bit of tweaking with the code to run the task. If easier we could do that on separate PR but I think it should be ok.

This adds the Coherence metric to the auto-evaluation module. Much like the answer relevancy metric, the Coherence metric uses BedrockOpenAIOssInvoke to make a tool call the LLM. It also returns a result object with the same attributes as the answer relevancy metric result object. I'll move this into a shared data class in a follow up commit. There is a difference in score required to normalise the rubric score to a 0-1 scale. This has been directly ported from our eval codebase and follows these mappings: 1 - 0.0 2 - 0.25 3 - 0.5 4 - 0.75 5 - 1.0

Since the Result data class should be returned in the same format across multiple auto-evaluation metrics, it makes sense to define it in a common location and have the other classes use it.

This adds a new Rake task to generate coherence for a given question. Much like the answer relevancy task it: 1. generates an answer for the input question using the existing answer composition pipeline 2. evaluates the coherence of the generated answer against the question using AutoEvaluation::Coherence 3. outputs the result json to stdout 4. handles errors answers appropriately Because there's so much shared functionality i've added a shared example to the existing evaluation_spec to reduce duplication between the two tasks. Once all the metrics are ported, we might want to consider updating this so have a single rake task that takes the metric as an argument rather than separate tasks for each metric. I've held off on doing this for now just to make sure all the rake tasks do have shared logic with the exception of the metric called. I'm pretty sure they will though.

davidgisbey · 2026-01-05T15:21:09Z

@kevindew i made those changes. I renamed the new class to AutoEvaluation::RunEvaluation but i think the naming is too generic. I wonder if coupling it to ScoreResults is the way to go. Something like:

AutoEvaluation::ScoreResultEvaluation
AutoEvaluation::AnswerScoreResultEvaluation
AutoEvaluation::GenerateAutoEvaluationScoreResult
AutoEvaluation::GenerateAutoEvaluationFromInput

Or we could go down a slightly more explicit route for what it's actually doing

AutoEvaluation::GenerateAndEvaluateAnswer
AutoEvaluation::EvaluateGeneratedAnswer

Another option is to just add a metrics directory as you previously suggested.

Then AutoEvaluation::EvaluateMetric or similar works fine.

Wdyt?

kevindew · 2026-01-05T16:02:30Z

Thanks, just dropping in between meetings to reply to the question.

Yeah agree it sounds quite generic. I'm not sure it needs to reference score though as I don't think there's something coupled to ScoreResults here (unless I've missed something) instead I think the identifying aspect of this is that you have an input of a question, which gets turned into a basic answer then that evaluated.

So I think it's something like EvaluateAnswerFromQuestionMessage ?

davidgisbey · 2026-01-05T16:17:34Z

Great i've gone with EvaluateAnswerFromQuestionMessage

kevindew

Just a few super minor things then good to merge.

davidgisbey · 2026-01-06T09:48:28Z

@kevindew thanks for that. I've made those changes in this commit.

kevindew · 2026-01-06T09:58:04Z

+    llm_responses { {} }
+    metrics { {} }
+
+    initialize_with { attributes }


You're missing the new on this new(**attributes) so this will be just returning a hash.

While this is just minor it did make me wonder should something have failed for this to be creating the wrong class?

oh yes sorry. Yea i would've expected some tests to fail 🤔

Btw using to_d does this

Not a big deal, but we might still want to round in the UI if it causes issues when rendering.

Ah the test just checks that the evaluation result is returned

let(:score_result) { build(:auto_evaluation_score_result) } before do allow(evaluation_class) .to receive(:call) .and_return(score_result) end it "returns the AutoEvaluation::ScoreResult generated by the evaluation class" do result = described_class.call( evaluation_class: evaluation_klass, question_message:, ) expect(result).to eq(score_result) end

It would've blown up on any integration test. I've got some in the next PR that would've failed.

The coherence class uses a stubbed response to build the runs not an evaluation result .

Great thanks for digging into that.

For the decimal concern I think when we render them they'll only come from the DB and I'd have thought (but I haven't confirmed it) that as we have the decimal type in the DB they'll also be decimals not floats, so I think we'd have this problem either way? Unless I've missed something

We're adding a few metrics. Each of these requires a basic rake task that can be called with an INPUT environment variable. The task will generate a ScoreResult for the given metric and print it as JSON. This adds the AutoEvaluation::EvaluateAnswerFromQuestionMessage class which: - takes a question_message and a evaluation class as arguments - generates an answer using the question_message, - calls the evaluation class with the question and answer to get a ScoreResult - returns the ScoreResult If the generated answer has an error status, it raises a TaskFailedError and the rake task handles outputting the error message to stderr and aborting.

This updates the BedrockOpenAIOssInvoke class to include JSON schema validation for the structured output received from the LLM. It uses the first tool's schema since all of our auto-eval metrics use a single tool. If this is to change we could consider passing the schema in via the method parameters down the line. While doing this I noticed that i'd not been consistent with adhering to the schemas it some test cases so i've updated those.

kevindew

Nice one - sorry this dragged on

govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:03 Inactive

davidgisbey force-pushed the 2996-add-coherence-metric branch from cd44ef2 to 1a7c165 Compare December 22, 2025 14:06

govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:06 Inactive

govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:07 Inactive

davidgisbey force-pushed the 2996-add-coherence-metric branch from 10fcd94 to c8797ff Compare December 22, 2025 14:09

govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:09 Inactive

davidgisbey force-pushed the 2996-add-coherence-metric branch from c8797ff to 9a95c26 Compare December 22, 2025 14:43

govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:44 Inactive

davidgisbey force-pushed the 2996-add-coherence-metric branch from 9a95c26 to da1feec Compare December 22, 2025 14:51

govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:51 Inactive

chaecramb reviewed Dec 22, 2025

View reviewed changes

Comment thread lib/auto_evaluation/coherence.rb Outdated

Comment thread lib/auto_evaluation/coherence.rb Outdated

Comment thread lib/auto_evaluation/coherence.rb Outdated

davidgisbey force-pushed the 2996-add-coherence-metric branch from da1feec to b1b2142 Compare December 23, 2025 09:37

govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 23, 2025 09:37 Inactive

chaecramb mentioned this pull request Dec 29, 2025

Add faithfulness metric #737

Merged

davidgisbey force-pushed the 2996-add-coherence-metric branch 2 times, most recently from ea631f8 to 437b1d1 Compare December 30, 2025 11:02

kevindew reviewed Dec 30, 2025

View reviewed changes

Comment thread lib/auto_evaluation.rb Outdated

Comment thread lib/auto_evaluation/coherence.rb Outdated

Comment thread lib/auto_evaluation/coherence.rb Outdated

Comment thread spec/lib/tasks/evaluation_spec.rb Outdated

Comment thread lib/tasks/evaluation.rake Outdated

davidgisbey force-pushed the 2996-add-coherence-metric branch 8 times, most recently from c05b0a6 to 1caf3c2 Compare January 5, 2026 09:18

kevindew reviewed Jan 5, 2026

View reviewed changes

Comment thread lib/auto_evaluation/run_metric_evaluation.rb Outdated

Comment thread lib/auto_evaluation/run_metric_evaluation.rb Outdated

kevindew mentioned this pull request Jan 5, 2026

Add answer relevancy models and integrate into analysis workflow #713

Merged

davidgisbey force-pushed the 2996-add-coherence-metric branch from 1caf3c2 to 58e4105 Compare January 5, 2026 13:20

davidgisbey added 2 commits January 5, 2026 13:56

Add AutoEvaluation::Result data class

600147e

Since the Result data class should be returned in the same format across multiple auto-evaluation metrics, it makes sense to define it in a common location and have the other classes use it.

davidgisbey force-pushed the 2996-add-coherence-metric branch from 58e4105 to 6a023fd Compare January 5, 2026 13:56

davidgisbey force-pushed the 2996-add-coherence-metric branch from 6a023fd to 29c51e1 Compare January 5, 2026 16:13

kevindew reviewed Jan 5, 2026

View reviewed changes

davidgisbey force-pushed the 2996-add-coherence-metric branch from 29c51e1 to 9924db6 Compare January 6, 2026 09:42

kevindew reviewed Jan 6, 2026

View reviewed changes

Comment thread spec/lib/auto_evaluation/evaluate_answer_from_question_message_spec.rb Outdated

davidgisbey force-pushed the 2996-add-coherence-metric branch from 9924db6 to 6e93298 Compare January 6, 2026 10:11

davidgisbey added 2 commits January 6, 2026 10:16

davidgisbey force-pushed the 2996-add-coherence-metric branch from 6e93298 to 7ab136b Compare January 6, 2026 10:16

kevindew approved these changes Jan 6, 2026

View reviewed changes

davidgisbey merged commit 4a878b4 into main Jan 6, 2026
12 checks passed

davidgisbey deleted the 2996-add-coherence-metric branch January 6, 2026 10:22

Conversation

davidgisbey commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Rake task

Result

Trello card

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidgisbey commented Jan 5, 2026

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davidgisbey commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevindew commented Jan 5, 2026

Uh oh!

davidgisbey commented Jan 5, 2026

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidgisbey commented Jan 6, 2026

Uh oh!

kevindew Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

davidgisbey Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidgisbey Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevindew Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

davidgisbey commented Dec 22, 2025 •

edited

Loading

davidgisbey commented Jan 5, 2026 •

edited

Loading

davidgisbey Jan 6, 2026 •

edited

Loading

davidgisbey Jan 6, 2026 •

edited

Loading