Add answer relevancy models and integrate into analysis workflow#713
Conversation
6a24efb to
eeef206
Compare
78dbeb7 to
0b90024
Compare
0b90024 to
a9c890b
Compare
kevindew
left a comment
There was a problem hiding this comment.
Great work David, I think my most significant suggestion is having metrics have individual tables
a9c890b to
28829ab
Compare
28829ab to
162525a
Compare
eeef206 to
1a43792
Compare
162525a to
bf6f62c
Compare
bf6f62c to
1373e1b
Compare
1373e1b to
ce6e49a
Compare
ce6e49a to
33cc17f
Compare
33cc17f to
d240b06
Compare
e1b06d6 to
09b2a6f
Compare
kevindew
left a comment
There was a problem hiding this comment.
Oops sorry I hadn't submitted this
c5cc167 to
7a69206
Compare
7a69206 to
3315cbf
Compare
3315cbf to
4b608fa
Compare
4b608fa to
5791f27
Compare
|
@kevindew thanks i made those changes. Have a good break! |
kevindew
left a comment
There was a problem hiding this comment.
Looks good, I've put in a few comments before running out of time
| return true | ||
| end | ||
|
|
||
| false |
There was a problem hiding this comment.
You could probably simplify this method down to:
Answer.status_answered.exists?(id: answer_id)
if !eligble
logger.warn("Couldn't find an answer #{answer_id} that was eligible for auto-evaluation")
end
eligible
it'd save the whole hydration
There was a problem hiding this comment.
I've gone with this. The only only downside is that we warn on ineligibility rather than log it as info. It's not really a warning since we call it on all answers and know some won't be eligible. Not a massive deal though.
| included do | ||
| def self.create_mean_aggregate_and_score_runs(answer, results) | ||
| mean_score = results.sum(&:score) / results.size.to_f | ||
| aggregate = create!(answer:, mean_score:) |
There was a problem hiding this comment.
This should probably be a new since we're doing the save! a few lines below.
We should also consider the risk of a partial write here. I assume, but aren't sure, that a save! that saves multiple items does not run in a transaction, so I expect we may need to run that save! in a transaction block to ensure that it can't be a partial write.
5791f27 to
f78d8b9
Compare
kevindew
left a comment
There was a problem hiding this comment.
Looks great, does look like there's a few small things to sort
| llm_responses: { | ||
| "response_1" => { "content" => "LLM response content 1" }, | ||
| "response_2" => { "content" => "LLM response content 2" }, | ||
| }, | ||
| metrics: { | ||
| "metric_1" => { "detail" => "Metric detail 1" }, | ||
| "metric_2" => { "detail" => "Metric detail 2" }, | ||
| }, |
There was a problem hiding this comment.
I think we should consider having these handled in the factory since it's quite verbose. I imagine you could have a transient attribute that is a sequence that can define a llm responses that have unique aspects of data for that instance if it's good that they're different each time.
You could probably use that sequence for reason too so you don't need to set that and then you might get down to a one-liner.
There was a problem hiding this comment.
Same as above. I'll tackle that when the other PR is merged.
There was a problem hiding this comment.
Looking back at this with fresh eyes i wonder if we should just update the factory to use a sequence for reason, llm_responses and metrics i.e.
FactoryBot.define do
factory :auto_evaluation_score_result, class: "AutoEvaluation::ScoreResult" do
score { 0.85 }
sequence(:reason) { |n| "Reason #{n}" }
success { true }
sequence(:llm_responses) { |n| { "llm_response" => { "reason" => "Reason #{n}" } } }
sequence(:metrics) { |n| { "llm_response" => { "duration" => n } } }
initialize_with do
new(
score:,
reason:,
success:,
llm_responses:,
metrics:,
)
end
end
end
It feels a bit simpler than worrying about transient attrs. Wdyt?
kevindew
left a comment
There was a problem hiding this comment.
I've ran out of time before lunch but I expect the mean decimal thing is worth submitting now before looking further anyway in case it's worth chatting over whether that is actually the best thing to do or if we should have more rounding in the code
|
@kevindew i think this good for a re-review. I've broken that giant commit that did all the integration up into a few smaller ones for ease of review. It's no 3 commits.
|
kevindew
left a comment
There was a problem hiding this comment.
Great job - thanks for breaking up the commits a bit
This adds a migration to the two new tables needed to store answer relevancy metrics. It also adds the corresponding models and factories. We will need to record llm multiple llm responses and metrics for each run so i've included the LlmCallsRecordable module in the AnswerRelevancyRun model.
We're going to need to stub out these calls in multiple places so it makes sense to have a single method that does all the stubbing for us. I've also prepended stub_ to bedrock_invoke_model_openai_oss_tool_call. All other stubs have this so it makes sense to be consistent.
This adds a concern to encapsulate the logic for creating aggregate and run records for metrics. It will be called from the various evaluation jobs that require wisdom of the crowd.
This adds the BaseMetricjob and AnswerRelevancyJob. The AnswerRelevancyJobs handles: - making calls to the AnswerRelevancy class - compiling the results - calling the AnswerRelevancyAggregate#create_run_from_result method to delegate record creation to the AutoEvaluationMetricRun model The BaseJob is used to store shard functionality for future metric jobs. The next commit will integrate this job into the analysis workflow. As part of this commit i've updated the ScoreResult factory to use a sequence to build unique attributes for the reason, llm_responses and metrics fields. This ensures that we are correctly persiting all the attributes returned from the evaluation classes correctly. I've also updated the answer relevancy scoring method to use BigDecimal as part of this commit. Without this I was forced to use round(2) in the tests to avoid rounding issues caused by floats.
This updates the compose answer job to call the answer relevancy job after an answer has been successfully composed and persisted.
I've added an additional tab for answer relevancy metrics in the admin interface on the question show page. My thoughts for this are if we don't split out the metrics into their own tabs then the page will get incredibly noisy. This makes it easier to navigate. Due to this, i've renamed the analysis tab to topics.
|
Thanks for the review @kevindew. I've made those changes. |
We've got a few places in our codebase where we want to use the rephrased question if it exists, otherwise fall back to the original question message in our LLM calls. This adds the Answer#question_used method to encapsulate that logic, and updates all relevant places to use this new method. I've removed the tests that were specifically checking for the rephrased question logic in the metrics, since that is now covered by the new method.
|
Thanks for the review! |

Description
This spikes out integrating answer relevancy into the analysis workflow. It adds:
AnswerRelevancyAggregatemodel which captures the aggregate scoreAnswerRelevancyRunmodel which captures the score, reason, llm response and metrics for a single metric runAnswerRelevancyJobhandles making multiple calls to the AnswerRelevancy class, compiles the results, persists the AnswerRelevancyAggregate record with the mean score and calls the AnswerRelevancyAggregate#create_run_from_result method to persist individual runsBaseMetricJobwhich provides shared functionality between metricsI've updated the analysis tabs to topics since that now only contains topics. The reason i added another metric tab is that with 3 runs for each metric, one tab for analysis would get incredibly noisy. There'd be an absolute tonne of metrics and llm responses. If we want to keep analysis in one tab we could split the topics and each metric into it's own summary list, but it's a bit easier to navigate with many tabs in my opinion.
Finally it updates the
ComposeAnswerJobto call theAnswerRelevancyJobonce an answer has been composed and persisted.Screenshots
Trello card
https://trello.com/c/cUaZjHSb/3042-integrate-answerrelevancy-into-analysis-workflow