Add answer relevancy models and integrate into analysis workflow by davidgisbey · Pull Request #713 · alphagov/govuk-chat

davidgisbey · 2025-12-16T15:04:44Z

Description

This spikes out integrating answer relevancy into the analysis workflow. It adds:

the AnswerRelevancyAggregate model which captures the aggregate score
the AnswerRelevancyRun model which captures the score, reason, llm response and metrics for a single metric run
the AnswerRelevancyJob handles making multiple calls to the AnswerRelevancy class, compiles the results, persists the AnswerRelevancyAggregate record with the mean score and calls the AnswerRelevancyAggregate#create_run_from_result method to persist individual runs
the BaseMetricJob which provides shared functionality between metrics
an additional bedrock stub to handle stubbing all 3 llm calls which we can reuse across our specs
a new answer relevancy tab on the question show page

I've updated the analysis tabs to topics since that now only contains topics. The reason i added another metric tab is that with 3 runs for each metric, one tab for analysis would get incredibly noisy. There'd be an absolute tonne of metrics and llm responses. If we want to keep analysis in one tab we could split the topics and each metric into it's own summary list, but it's a bit easier to navigate with many tabs in my opinion.

Finally it updates the ComposeAnswerJob to call the AnswerRelevancyJob once an answer has been composed and persisted.

Screenshots

Trello card

https://trello.com/c/cUaZjHSb/3042-integrate-answerrelevancy-into-analysis-workflow

kevindew

Great work David, I think my most significant suggestion is having metrics have individual tables

kevindew

Oops sorry I hadn't submitted this

davidgisbey · 2025-12-18T17:13:48Z

@kevindew thanks i made those changes. Have a good break!

kevindew

Looks good, I've put in a few comments before running out of time

kevindew · 2025-12-18T17:46:30Z

+        return true
+      end
+
+      false


You could probably simplify this method down to:

Answer.status_answered.exists?(id: answer_id) if !eligble logger.warn("Couldn't find an answer #{answer_id} that was eligible for auto-evaluation") end eligible

it'd save the whole hydration

I've gone with this. The only only downside is that we warn on ineligibility rather than log it as info. It's not really a warning since we call it on all answers and know some won't be eligible. Not a massive deal though.

kevindew · 2025-12-18T17:52:31Z

+  included do
+    def self.create_mean_aggregate_and_score_runs(answer, results)
+      mean_score = results.sum(&:score) / results.size.to_f
+      aggregate = create!(answer:, mean_score:)


This should probably be a new since we're doing the save! a few lines below.

We should also consider the risk of a partial write here. I assume, but aren't sure, that a save! that saves multiple items does not run in a transaction, so I expect we may need to run that save! in a transaction block to ensure that it can't be a partial write.

Great. I just tested the behaviour and we don't need to worry about using a transaction.

kevindew

Looks great, does look like there's a few small things to sort

kevindew · 2025-12-30T17:03:00Z

+      llm_responses: {
+        "response_1" => { "content" => "LLM response content 1" },
+        "response_2" => { "content" => "LLM response content 2" },
+      },
+      metrics: {
+        "metric_1" => { "detail" => "Metric detail 1" },
+        "metric_2" => { "detail" => "Metric detail 2" },
+      },


I think we should consider having these handled in the factory since it's quite verbose. I imagine you could have a transient attribute that is a sequence that can define a llm responses that have unique aspects of data for that instance if it's good that they're different each time.

You could probably use that sequence for reason too so you don't need to set that and then you might get down to a one-liner.

Same as above. I'll tackle that when the other PR is merged.

Looking back at this with fresh eyes i wonder if we should just update the factory to use a sequence for reason, llm_responses and metrics i.e.

FactoryBot.define do factory :auto_evaluation_score_result, class: "AutoEvaluation::ScoreResult" do score { 0.85 } sequence(:reason) { |n| "Reason #{n}" } success { true } sequence(:llm_responses) { |n| { "llm_response" => { "reason" => "Reason #{n}" } } } sequence(:metrics) { |n| { "llm_response" => { "duration" => n } } } initialize_with do new( score:, reason:, success:, llm_responses:, metrics:, ) end end end

It feels a bit simpler than worrying about transient attrs. Wdyt?

kevindew

I've ran out of time before lunch but I expect the mean decimal thing is worth submitting now before looking further anyway in case it's worth chatting over whether that is actually the best thing to do or if we should have more rounding in the code

davidgisbey · 2026-01-06T11:10:51Z

@kevindew i think this good for a re-review. I've broken that giant commit that did all the integration up into a few smaller ones for ease of review. It's no 3 commits.

the first adds the concern that handles persisting the records
the second adds the answer relevancy and base jobs
the third calls the above jobs from the composer answer job and stubs the request in the various system specs

kevindew

Great job - thanks for breaking up the commits a bit

This adds a migration to the two new tables needed to store answer relevancy metrics. It also adds the corresponding models and factories. We will need to record llm multiple llm responses and metrics for each run so i've included the LlmCallsRecordable module in the AnswerRelevancyRun model.

We're going to need to stub out these calls in multiple places so it makes sense to have a single method that does all the stubbing for us. I've also prepended stub_ to bedrock_invoke_model_openai_oss_tool_call. All other stubs have this so it makes sense to be consistent.

This adds a concern to encapsulate the logic for creating aggregate and run records for metrics. It will be called from the various evaluation jobs that require wisdom of the crowd.

This adds the BaseMetricjob and AnswerRelevancyJob. The AnswerRelevancyJobs handles: - making calls to the AnswerRelevancy class - compiling the results - calling the AnswerRelevancyAggregate#create_run_from_result method to delegate record creation to the AutoEvaluationMetricRun model The BaseJob is used to store shard functionality for future metric jobs. The next commit will integrate this job into the analysis workflow. As part of this commit i've updated the ScoreResult factory to use a sequence to build unique attributes for the reason, llm_responses and metrics fields. This ensures that we are correctly persiting all the attributes returned from the evaluation classes correctly. I've also updated the answer relevancy scoring method to use BigDecimal as part of this commit. Without this I was forced to use round(2) in the tests to avoid rounding issues caused by floats.

This updates the compose answer job to call the answer relevancy job after an answer has been successfully composed and persisted.

I've added an additional tab for answer relevancy metrics in the admin interface on the question show page. My thoughts for this are if we don't split out the metrics into their own tabs then the page will get incredibly noisy. This makes it easier to navigate. Due to this, i've renamed the analysis tab to topics.

davidgisbey · 2026-01-07T09:40:32Z

Thanks for the review @kevindew. I've made those changes.

We've got a few places in our codebase where we want to use the rephrased question if it exists, otherwise fall back to the original question message in our LLM calls. This adds the Answer#question_used method to encapsulate that logic, and updates all relevant places to use this new method. I've removed the tests that were specifically checking for the rephrased question logic in the metrics, since that is now covered by the new method.

kevindew

Looks good to me 👍

davidgisbey · 2026-01-07T10:30:53Z

Thanks for the review!

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 16, 2025 15:09 Inactive

davidgisbey force-pushed the migrate-analysis-to-answer-topics branch 2 times, most recently from 6a24efb to eeef206 Compare December 16, 2025 16:36

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 78dbeb7 to 0b90024 Compare December 16, 2025 16:46

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 16, 2025 16:47 Inactive

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 0b90024 to a9c890b Compare December 16, 2025 16:53

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 16, 2025 16:53 Inactive

kevindew reviewed Dec 16, 2025

View reviewed changes

Comment thread app/jobs/auto_evaluation_metric_job.rb Outdated

Comment thread app/jobs/auto_evaluation_metric_job.rb Outdated

Comment thread app/jobs/auto_evaluation_metric_job.rb Outdated

Comment thread db/schema.rb Outdated

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from a9c890b to 28829ab Compare December 17, 2025 13:47

govuk-ci had a problem deploying to govuk-chat-add-metrics--dggzud December 17, 2025 13:48 Failure

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 28829ab to 162525a Compare December 17, 2025 13:53

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 13:53 Inactive

davidgisbey force-pushed the migrate-analysis-to-answer-topics branch from eeef206 to 1a43792 Compare December 17, 2025 14:26

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 162525a to bf6f62c Compare December 17, 2025 14:26

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 14:26 Inactive

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from bf6f62c to 1373e1b Compare December 17, 2025 14:32

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 14:33 Inactive

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 1373e1b to ce6e49a Compare December 17, 2025 15:07

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 15:07 Inactive

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from ce6e49a to 33cc17f Compare December 17, 2025 15:18

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 15:18 Inactive

govuk-ci requested a deployment to govuk-chat-add-metrics--dggzud December 17, 2025 15:18 Abandoned

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 33cc17f to d240b06 Compare December 17, 2025 15:24

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 15:24 Inactive

govuk-ci requested a deployment to govuk-chat-add-metrics--dggzud December 17, 2025 15:24 Abandoned

davidgisbey changed the title ~~Add metrics data models and integrate into workflow~~ Add answer relevancy models and integrate into analysis workflow Dec 17, 2025

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 16:36 Inactive

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from e1b06d6 to 09b2a6f Compare December 17, 2025 17:12

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 17:12 Inactive

govuk-ci requested a deployment to govuk-chat-add-metrics--dggzud December 17, 2025 17:12 Abandoned

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 09:51 Inactive

kevindew reviewed Dec 18, 2025

View reviewed changes

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from c5cc167 to 7a69206 Compare December 18, 2025 16:27

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 16:27 Inactive

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 7a69206 to 3315cbf Compare December 18, 2025 16:50

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 16:50 Inactive

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 3315cbf to 4b608fa Compare December 18, 2025 16:57

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 16:57 Inactive

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 4b608fa to 5791f27 Compare December 18, 2025 17:04

govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 17:04 Inactive

kevindew reviewed Dec 18, 2025

View reviewed changes

davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 5791f27 to f78d8b9 Compare December 19, 2025 13:13

kevindew mentioned this pull request Dec 23, 2025

Rename AnswerAnalysis to AnswerAnalysis::Topics #710

Merged

kevindew reviewed Dec 30, 2025

View reviewed changes

kevindew mentioned this pull request Jan 5, 2026

Add Coherence metric #726

Merged

kevindew reviewed Jan 5, 2026

View reviewed changes

Comment thread app/views/admin/questions/_generic_aggregate_auto_evaluation.html.erb

Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated

Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated

kevindew reviewed Jan 6, 2026

View reviewed changes

davidgisbey added 6 commits January 7, 2026 09:19

Add auto_evaluation_results_creatable concern

f8fb85e

This adds a concern to encapsulate the logic for creating aggregate and run records for metrics. It will be called from the various evaluation jobs that require wisdom of the crowd.

Integraate Answer Relevancy Analysis into analysis workflow

9ace316

This updates the compose answer job to call the answer relevancy job after an answer has been successfully composed and persisted.

davidgisbey commented Jan 7, 2026

View reviewed changes

Comment thread spec/factories/answer_relevancy_aggregate_factory.rb

kevindew approved these changes Jan 7, 2026

View reviewed changes

Comment thread spec/factories/answer_relevancy_aggregate_factory.rb

Comment thread spec/support/stub_bedrock.rb

Conversation

davidgisbey commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Screenshots

Trello card

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidgisbey commented Dec 18, 2025

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevindew Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

davidgisbey Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevindew Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

davidgisbey Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevindew Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidgisbey Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

davidgisbey Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidgisbey commented Jan 6, 2026

Uh oh!

kevindew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidgisbey commented Dec 16, 2025 •

edited

Loading

davidgisbey Dec 19, 2025 •

edited

Loading

davidgisbey Jan 5, 2026 •

edited

Loading