Skip to content

Add answer relevancy models and integrate into analysis workflow#713

Merged
davidgisbey merged 7 commits into
mainfrom
add-metrics-data-models-and-integrate-into-workflow
Jan 7, 2026
Merged

Add answer relevancy models and integrate into analysis workflow#713
davidgisbey merged 7 commits into
mainfrom
add-metrics-data-models-and-integrate-into-workflow

Conversation

@davidgisbey
Copy link
Copy Markdown
Contributor

@davidgisbey davidgisbey commented Dec 16, 2025

Description

This spikes out integrating answer relevancy into the analysis workflow. It adds:

  • the AnswerRelevancyAggregate model which captures the aggregate score
  • the AnswerRelevancyRun model which captures the score, reason, llm response and metrics for a single metric run
  • the AnswerRelevancyJob handles making multiple calls to the AnswerRelevancy class, compiles the results, persists the AnswerRelevancyAggregate record with the mean score and calls the AnswerRelevancyAggregate#create_run_from_result method to persist individual runs
  • the BaseMetricJob which provides shared functionality between metrics
  • an additional bedrock stub to handle stubbing all 3 llm calls which we can reuse across our specs
  • a new answer relevancy tab on the question show page

I've updated the analysis tabs to topics since that now only contains topics. The reason i added another metric tab is that with 3 runs for each metric, one tab for analysis would get incredibly noisy. There'd be an absolute tonne of metrics and llm responses. If we want to keep analysis in one tab we could split the topics and each metric into it's own summary list, but it's a bit easier to navigate with many tabs in my opinion.

Finally it updates the ComposeAnswerJob to call the AnswerRelevancyJob once an answer has been composed and persisted.

Screenshots

image image image

Trello card

https://trello.com/c/cUaZjHSb/3042-integrate-answerrelevancy-into-analysis-workflow

@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 16, 2025 15:09 Inactive
@davidgisbey davidgisbey force-pushed the migrate-analysis-to-answer-topics branch 2 times, most recently from 6a24efb to eeef206 Compare December 16, 2025 16:36
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 78dbeb7 to 0b90024 Compare December 16, 2025 16:46
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 16, 2025 16:47 Inactive
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 0b90024 to a9c890b Compare December 16, 2025 16:53
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 16, 2025 16:53 Inactive
Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work David, I think my most significant suggestion is having metrics have individual tables

Comment thread app/jobs/auto_evaluation_metric_job.rb Outdated
Comment thread app/jobs/auto_evaluation_metric_job.rb Outdated
Comment thread app/jobs/auto_evaluation_metric_job.rb Outdated
Comment thread db/schema.rb Outdated
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from a9c890b to 28829ab Compare December 17, 2025 13:47
@govuk-ci govuk-ci had a problem deploying to govuk-chat-add-metrics--dggzud December 17, 2025 13:48 Failure
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 28829ab to 162525a Compare December 17, 2025 13:53
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 13:53 Inactive
@davidgisbey davidgisbey force-pushed the migrate-analysis-to-answer-topics branch from eeef206 to 1a43792 Compare December 17, 2025 14:26
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 162525a to bf6f62c Compare December 17, 2025 14:26
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 14:26 Inactive
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from bf6f62c to 1373e1b Compare December 17, 2025 14:32
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 14:33 Inactive
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 1373e1b to ce6e49a Compare December 17, 2025 15:07
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 15:07 Inactive
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from ce6e49a to 33cc17f Compare December 17, 2025 15:18
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 15:18 Inactive
@govuk-ci govuk-ci requested a deployment to govuk-chat-add-metrics--dggzud December 17, 2025 15:18 Abandoned
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 33cc17f to d240b06 Compare December 17, 2025 15:24
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 15:24 Inactive
@govuk-ci govuk-ci requested a deployment to govuk-chat-add-metrics--dggzud December 17, 2025 15:24 Abandoned
@davidgisbey davidgisbey changed the title Add metrics data models and integrate into workflow Add answer relevancy models and integrate into analysis workflow Dec 17, 2025
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 16:36 Inactive
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from e1b06d6 to 09b2a6f Compare December 17, 2025 17:12
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 17, 2025 17:12 Inactive
@govuk-ci govuk-ci requested a deployment to govuk-chat-add-metrics--dggzud December 17, 2025 17:12 Abandoned
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 09:51 Inactive
Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry I hadn't submitted this

Comment thread app/jobs/answer_analysis/answer_relevancy_job.rb Outdated
Comment thread app/jobs/answer_analysis/answer_relevancy_job.rb Outdated
Comment thread app/jobs/answer_analysis/answer_relevancy_job.rb Outdated
Comment thread app/jobs/compose_answer_job.rb
Comment thread app/models/concerns/analysis_results_creatable.rb Outdated
Comment thread app/models/concerns/analysis_results_creatable.rb Outdated
Comment thread app/views/admin/questions/show.html.erb Outdated
Comment thread db/migrate/20251216092915_add_answer_relevancy_tables.rb Outdated
Comment thread db/schema.rb
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from c5cc167 to 7a69206 Compare December 18, 2025 16:27
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 16:27 Inactive
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 7a69206 to 3315cbf Compare December 18, 2025 16:50
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 16:50 Inactive
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 3315cbf to 4b608fa Compare December 18, 2025 16:57
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 16:57 Inactive
@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 4b608fa to 5791f27 Compare December 18, 2025 17:04
@govuk-ci govuk-ci temporarily deployed to govuk-chat-add-metrics--dggzud December 18, 2025 17:04 Inactive
@davidgisbey
Copy link
Copy Markdown
Contributor Author

@kevindew thanks i made those changes. Have a good break!

Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I've put in a few comments before running out of time

Comment thread app/jobs/answer_analysis/answer_relevancy_job.rb Outdated
return true
end

false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably simplify this method down to:

Answer.status_answered.exists?(id: answer_id)

if !eligble
  logger.warn("Couldn't find an answer #{answer_id} that was eligible for auto-evaluation")
end

eligible

it'd save the whole hydration

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone with this. The only only downside is that we warn on ineligibility rather than log it as info. It's not really a warning since we call it on all answers and know some won't be eligible. Not a massive deal though.

Comment thread app/jobs/answer_analysis/answer_relevancy_job.rb Outdated
included do
def self.create_mean_aggregate_and_score_runs(answer, results)
mean_score = results.sum(&:score) / results.size.to_f
aggregate = create!(answer:, mean_score:)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a new since we're doing the save! a few lines below.

We should also consider the risk of a partial write here. I assume, but aren't sure, that a save! that saves multiple items does not run in a transaction, so I expect we may need to run that save! in a transaction block to ensure that it can't be a partial write.

Copy link
Copy Markdown
Contributor Author

@davidgisbey davidgisbey Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I just tested the behaviour and we don't need to worry about using a transaction.

image

@davidgisbey davidgisbey force-pushed the add-metrics-data-models-and-integrate-into-workflow branch from 5791f27 to f78d8b9 Compare December 19, 2025 13:13
Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, does look like there's a few small things to sort

Comment thread app/views/admin/questions/show.html.erb Outdated
Comment on lines +11 to +18
llm_responses: {
"response_1" => { "content" => "LLM response content 1" },
"response_2" => { "content" => "LLM response content 2" },
},
metrics: {
"metric_1" => { "detail" => "Metric detail 1" },
"metric_2" => { "detail" => "Metric detail 2" },
},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider having these handled in the factory since it's quite verbose. I imagine you could have a transient attribute that is a sequence that can define a llm responses that have unique aspects of data for that instance if it's good that they're different each time.

You could probably use that sequence for reason too so you don't need to set that and then you might get down to a one-liner.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. I'll tackle that when the other PR is merged.

Copy link
Copy Markdown
Contributor Author

@davidgisbey davidgisbey Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at this with fresh eyes i wonder if we should just update the factory to use a sequence for reason, llm_responses and metrics i.e.

FactoryBot.define do
  factory :auto_evaluation_score_result, class: "AutoEvaluation::ScoreResult" do
    score { 0.85 }
    sequence(:reason) { |n| "Reason #{n}" }
    success { true }
    sequence(:llm_responses) { |n| { "llm_response" => { "reason" => "Reason #{n}" } } }
    sequence(:metrics) { |n| { "llm_response" => { "duration" => n } } }


    initialize_with do
      new(
        score:,
        reason:,
        success:,
        llm_responses:,
        metrics:,
      )
    end
  end
end

It feels a bit simpler than worrying about transient attrs. Wdyt?

Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated
Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated
Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated
Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated
@kevindew kevindew mentioned this pull request Jan 5, 2026
Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've ran out of time before lunch but I expect the mean decimal thing is worth submitting now before looking further anyway in case it's worth chatting over whether that is actually the best thing to do or if we should have more rounding in the code

Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated
Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated
@davidgisbey
Copy link
Copy Markdown
Contributor Author

@kevindew i think this good for a re-review. I've broken that giant commit that did all the integration up into a few smaller ones for ease of review. It's no 3 commits.

  • the first adds the concern that handles persisting the records
  • the second adds the answer relevancy and base jobs
  • the third calls the above jobs from the composer answer job and stubs the request in the various system specs

Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job - thanks for breaking up the commits a bit

Comment thread app/jobs/answer_analysis/answer_relevancy_job.rb Outdated
Comment thread app/models/concerns/auto_evaluation_results_creatable.rb Outdated
Comment thread app/models/answer_analysis/answer_relevancy_aggregate.rb
Comment thread spec/support/analysis_results_creatable_examples.rb Outdated
Comment thread spec/support/auto_evaluation_results_creatable_examples.rb
Comment thread spec/requests/admin/questions_spec.rb Outdated
Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated
Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated
Comment thread spec/jobs/answer_analysis/answer_relevancy_job_spec.rb Outdated
Comment thread lib/auto_evaluation/answer_relevancy.rb Outdated
This adds a migration to the two new tables needed to store answer relevancy
metrics. It also adds the corresponding models and factories.

We will need to record llm multiple llm responses and metrics for each
run so i've included the LlmCallsRecordable module in the AnswerRelevancyRun
model.
We're going to need to stub out these calls in multiple places so it makes
sense to have a single method that does all the stubbing for us.

I've also prepended stub_ to bedrock_invoke_model_openai_oss_tool_call.
All other stubs have this so it makes sense to be consistent.
This adds a concern to encapsulate the logic for creating aggregate and run
records for metrics. It will be called from the various evaluation jobs
that require wisdom of the crowd.
This adds the BaseMetricjob and AnswerRelevancyJob. The AnswerRelevancyJobs
handles:
- making calls to the AnswerRelevancy class
- compiling the results
- calling the AnswerRelevancyAggregate#create_run_from_result method to
  delegate record creation to the AutoEvaluationMetricRun model

The BaseJob is used to store shard functionality for future metric jobs.
The next commit will integrate this job into the analysis workflow.

As part of this commit i've updated the ScoreResult factory to use a
sequence to build unique attributes for the reason, llm_responses and
metrics fields. This ensures that we are correctly persiting all the
attributes returned from the evaluation classes correctly.

I've also updated the answer relevancy scoring method to use BigDecimal
as part of this commit. Without this I was forced to use round(2) in
the tests to avoid rounding issues caused by floats.
This updates the compose answer job to call the answer relevancy job
after an answer has been successfully composed and persisted.
I've added an additional tab for answer relevancy metrics in the admin
interface on the question show page.

My thoughts for this are if we don't split out the metrics into their own
tabs then the page will get incredibly noisy. This makes it easier to
navigate.

Due to this, i've renamed the analysis tab to topics.
@davidgisbey
Copy link
Copy Markdown
Contributor Author

Thanks for the review @kevindew. I've made those changes.

We've got a few places in our codebase where we want to use the rephrased
question if it exists, otherwise fall back to the original question message
in our LLM calls.

This adds the Answer#question_used method to encapsulate that logic, and updates
all relevant places to use this new method.

I've removed the tests that were specifically checking for the rephrased
question logic in the metrics, since that is now covered by the new method.
Comment thread spec/factories/answer_relevancy_aggregate_factory.rb
Copy link
Copy Markdown
Member

@kevindew kevindew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

Comment thread spec/factories/answer_relevancy_aggregate_factory.rb
Comment thread spec/support/stub_bedrock.rb
@davidgisbey
Copy link
Copy Markdown
Contributor Author

Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants