Skip to content

Add Coherence metric#726

Merged
davidgisbey merged 5 commits into
mainfrom
2996-add-coherence-metric
Jan 6, 2026
Merged

Add Coherence metric#726
davidgisbey merged 5 commits into
mainfrom
2996-add-coherence-metric

Conversation

@davidgisbey
Copy link
Copy Markdown
Contributor

@davidgisbey davidgisbey commented Dec 22, 2025

Description

This ports the Coherence metric from the evaluation repo to our Ruby codebase.

It:

  • adding the metric to lib/auto_evaluation
  • adds a rake task which can be used to output an evaluation result for coherence for a given input
  • adds the RunMetricEvaluation which handles, building an answer and running a metric on that answer
  • adds schema validation against the tool output

Rake task

image

Result

image

Trello card

https://trello.com/c/cUIagBUx/2996-ruby-auto-eval-for-coherence-metric

@govuk-ci govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:03 Inactive
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from cd44ef2 to 1a7c165 Compare December 22, 2025 14:06
@govuk-ci govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:06 Inactive
@govuk-ci govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:07 Inactive
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from 10fcd94 to c8797ff Compare December 22, 2025 14:09
@govuk-ci govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:09 Inactive
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from c8797ff to 9a95c26 Compare December 22, 2025 14:43
@govuk-ci govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:44 Inactive
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from 9a95c26 to da1feec Compare December 22, 2025 14:51
@govuk-ci govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 22, 2025 14:51 Inactive
Comment thread lib/auto_evaluation/coherence.rb Outdated
Comment thread lib/auto_evaluation/coherence.rb Outdated
Comment thread lib/auto_evaluation/coherence.rb Outdated
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from da1feec to b1b2142 Compare December 23, 2025 09:37
@govuk-ci govuk-ci temporarily deployed to govuk-chat-2996-add-coh-hzdvog December 23, 2025 09:37 Inactive
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch 2 times, most recently from ea631f8 to 437b1d1 Compare December 30, 2025 11:02
Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, mostly minor comments but with a potentially bigger suggestion on an abstraction for the Rake task

Comment thread lib/auto_evaluation.rb Outdated
Comment thread lib/auto_evaluation/coherence.rb Outdated
Comment thread lib/auto_evaluation/coherence.rb Outdated
Comment thread spec/lib/tasks/evaluation_spec.rb Outdated
Comment thread lib/tasks/evaluation.rake Outdated
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch 8 times, most recently from c05b0a6 to 1caf3c2 Compare January 5, 2026 09:18
@davidgisbey
Copy link
Copy Markdown
Contributor Author

Thanks for the review @kevindew. I made those changes. The last 2 commits implement the rake task and schema validation.

Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a bit of tweaking with the code to run the task. If easier we could do that on separate PR but I think it should be ok.

Comment thread lib/auto_evaluation/run_metric_evaluation.rb Outdated
Comment thread lib/auto_evaluation/run_metric_evaluation.rb Outdated
This adds the Coherence metric to the auto-evaluation module. Much like
the answer relevancy metric, the Coherence metric uses
BedrockOpenAIOssInvoke to make a tool call the LLM.

It also returns a result object with the same attributes as the answer
relevancy metric result object. I'll move this into a shared data class
in a follow up commit.

There is a difference in score required to normalise the rubric score
to a 0-1 scale. This has been directly ported from our eval codebase
and follows these mappings:

1 - 0.0
2 - 0.25
3 - 0.5
4 - 0.75
5 - 1.0
Since the Result data class should be returned in the same format
across multiple auto-evaluation metrics, it makes sense to define it
in a common location and have the other classes use it.
This adds a new Rake task to generate coherence for a given question.
Much like the answer relevancy task it:

1. generates an answer for the input question using the existing
   answer composition pipeline
2. evaluates the coherence of the generated answer against the question
   using AutoEvaluation::Coherence
3. outputs the result json to stdout
4. handles errors answers appropriately

Because there's so much shared functionality i've added a shared example
to the existing evaluation_spec to reduce duplication between the two tasks.

Once all the metrics are ported, we might want to consider updating this
so have a single rake task that takes the metric as an argument rather
than separate tasks for each metric.

I've held off on doing this for now just to make sure all the rake tasks
do have shared logic with the exception of the metric called. I'm pretty
sure they will though.
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from 58e4105 to 6a023fd Compare January 5, 2026 13:56
@davidgisbey
Copy link
Copy Markdown
Contributor Author

davidgisbey commented Jan 5, 2026

@kevindew i made those changes. I renamed the new class to AutoEvaluation::RunEvaluation but i think the naming is too generic. I wonder if coupling it to ScoreResults is the way to go. Something like:

AutoEvaluation::ScoreResultEvaluation
AutoEvaluation::AnswerScoreResultEvaluation
AutoEvaluation::GenerateAutoEvaluationScoreResult
AutoEvaluation::GenerateAutoEvaluationFromInput

Or we could go down a slightly more explicit route for what it's actually doing

AutoEvaluation::GenerateAndEvaluateAnswer
AutoEvaluation::EvaluateGeneratedAnswer

Another option is to just add a metrics directory as you previously suggested.

Then AutoEvaluation::EvaluateMetric or similar works fine.

Wdyt?

@kevindew
Copy link
Copy Markdown
Member

kevindew commented Jan 5, 2026

Thanks, just dropping in between meetings to reply to the question.

Yeah agree it sounds quite generic. I'm not sure it needs to reference score though as I don't think there's something coupled to ScoreResults here (unless I've missed something) instead I think the identifying aspect of this is that you have an input of a question, which gets turned into a basic answer then that evaluated.

So I think it's something like EvaluateAnswerFromQuestionMessage ?

@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from 6a023fd to 29c51e1 Compare January 5, 2026 16:13
@davidgisbey
Copy link
Copy Markdown
Contributor Author

Great i've gone with EvaluateAnswerFromQuestionMessage

Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few super minor things then good to merge.

Comment thread spec/factories/auto_evaluation_score_result.rb Outdated
Comment thread spec/factories/auto_evaluation_score_result.rb
Comment thread spec/factories/auto_evaluation_score_result.rb Outdated
Comment thread spec/lib/auto_evaluation/evaluate_answer_from_question_message_spec.rb Outdated
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from 29c51e1 to 9924db6 Compare January 6, 2026 09:42
@davidgisbey
Copy link
Copy Markdown
Contributor Author

@kevindew thanks for that. I've made those changes in this commit.

llm_responses { {} }
metrics { {} }

initialize_with { attributes }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing the new on this new(**attributes) so this will be just returning a hash.

While this is just minor it did make me wonder should something have failed for this to be creating the wrong class?

Copy link
Copy Markdown
Contributor Author

@davidgisbey davidgisbey Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes sorry. Yea i would've expected some tests to fail 🤔

Btw using to_d does this

image

Not a big deal, but we might still want to round in the UI if it causes issues when rendering.

Copy link
Copy Markdown
Contributor Author

@davidgisbey davidgisbey Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah the test just checks that the evaluation result is returned

  let(:score_result) { build(:auto_evaluation_score_result) }

      before do
        allow(evaluation_class)
          .to receive(:call)
          .and_return(score_result)
      end


     it "returns the AutoEvaluation::ScoreResult generated by the evaluation class" do
    result = described_class.call(
      evaluation_class: evaluation_klass,
      question_message:,
    )
    expect(result).to eq(score_result)
  end


It would've blown up on any integration test. I've got some in the next PR that would've failed.

The coherence class uses a stubbed response to build the runs not an evaluation result .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks for digging into that.

For the decimal concern I think when we render them they'll only come from the DB and I'd have thought (but I haven't confirmed it) that as we have the decimal type in the DB they'll also be decimals not floats, so I think we'd have this problem either way? Unless I've missed something

Comment thread spec/lib/auto_evaluation/evaluate_answer_from_question_message_spec.rb Outdated
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from 9924db6 to 6e93298 Compare January 6, 2026 10:11
We're adding a few metrics. Each of these requires a basic rake task
that can be called with an INPUT environment variable. The task will
generate a ScoreResult for the given metric and print it as JSON.

This adds the AutoEvaluation::EvaluateAnswerFromQuestionMessage class which:

- takes a question_message and a evaluation class as arguments
- generates an answer using the question_message,
- calls the evaluation class with the question and answer to get a ScoreResult
- returns the ScoreResult

If the generated answer has an error status, it raises a TaskFailedError
and the rake task handles outputting the error message to stderr and aborting.
This updates the BedrockOpenAIOssInvoke class to include JSON schema validation
for the structured output received from the LLM.

It uses the first tool's schema since all of our auto-eval metrics
use a single tool. If this is to change we could consider passing the
schema in via the method parameters down the line.

While doing this I noticed that i'd not been consistent with adhering to
the schemas it some test cases so i've updated those.
@davidgisbey davidgisbey force-pushed the 2996-add-coherence-metric branch from 6e93298 to 7ab136b Compare January 6, 2026 10:16
Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one - sorry this dragged on

@davidgisbey davidgisbey merged commit 4a878b4 into main Jan 6, 2026
12 checks passed
@davidgisbey davidgisbey deleted the 2996-add-coherence-metric branch January 6, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants