Add Coherence metric#726
Conversation
cd44ef2 to
1a7c165
Compare
10fcd94 to
c8797ff
Compare
c8797ff to
9a95c26
Compare
9a95c26 to
da1feec
Compare
da1feec to
b1b2142
Compare
ea631f8 to
437b1d1
Compare
kevindew
left a comment
There was a problem hiding this comment.
Looks great, mostly minor comments but with a potentially bigger suggestion on an abstraction for the Rake task
c05b0a6 to
1caf3c2
Compare
|
Thanks for the review @kevindew. I made those changes. The last 2 commits implement the rake task and schema validation. |
kevindew
left a comment
There was a problem hiding this comment.
Looks great, just a bit of tweaking with the code to run the task. If easier we could do that on separate PR but I think it should be ok.
1caf3c2 to
58e4105
Compare
This adds the Coherence metric to the auto-evaluation module. Much like the answer relevancy metric, the Coherence metric uses BedrockOpenAIOssInvoke to make a tool call the LLM. It also returns a result object with the same attributes as the answer relevancy metric result object. I'll move this into a shared data class in a follow up commit. There is a difference in score required to normalise the rubric score to a 0-1 scale. This has been directly ported from our eval codebase and follows these mappings: 1 - 0.0 2 - 0.25 3 - 0.5 4 - 0.75 5 - 1.0
Since the Result data class should be returned in the same format across multiple auto-evaluation metrics, it makes sense to define it in a common location and have the other classes use it.
This adds a new Rake task to generate coherence for a given question. Much like the answer relevancy task it: 1. generates an answer for the input question using the existing answer composition pipeline 2. evaluates the coherence of the generated answer against the question using AutoEvaluation::Coherence 3. outputs the result json to stdout 4. handles errors answers appropriately Because there's so much shared functionality i've added a shared example to the existing evaluation_spec to reduce duplication between the two tasks. Once all the metrics are ported, we might want to consider updating this so have a single rake task that takes the metric as an argument rather than separate tasks for each metric. I've held off on doing this for now just to make sure all the rake tasks do have shared logic with the exception of the metric called. I'm pretty sure they will though.
58e4105 to
6a023fd
Compare
|
@kevindew i made those changes. I renamed the new class to Or we could go down a slightly more explicit route for what it's actually doing Another option is to just add a metrics directory as you previously suggested. Then Wdyt? |
|
Thanks, just dropping in between meetings to reply to the question. Yeah agree it sounds quite generic. I'm not sure it needs to reference score though as I don't think there's something coupled to ScoreResults here (unless I've missed something) instead I think the identifying aspect of this is that you have an input of a question, which gets turned into a basic answer then that evaluated. So I think it's something like EvaluateAnswerFromQuestionMessage ? |
6a023fd to
29c51e1
Compare
|
Great i've gone with |
kevindew
left a comment
There was a problem hiding this comment.
Just a few super minor things then good to merge.
29c51e1 to
9924db6
Compare
|
@kevindew thanks for that. I've made those changes in this commit. |
| llm_responses { {} } | ||
| metrics { {} } | ||
|
|
||
| initialize_with { attributes } |
There was a problem hiding this comment.
You're missing the new on this new(**attributes) so this will be just returning a hash.
While this is just minor it did make me wonder should something have failed for this to be creating the wrong class?
There was a problem hiding this comment.
Ah the test just checks that the evaluation result is returned
let(:score_result) { build(:auto_evaluation_score_result) }
before do
allow(evaluation_class)
.to receive(:call)
.and_return(score_result)
end
it "returns the AutoEvaluation::ScoreResult generated by the evaluation class" do
result = described_class.call(
evaluation_class: evaluation_klass,
question_message:,
)
expect(result).to eq(score_result)
end
It would've blown up on any integration test. I've got some in the next PR that would've failed.
The coherence class uses a stubbed response to build the runs not an evaluation result .
There was a problem hiding this comment.
Great thanks for digging into that.
For the decimal concern I think when we render them they'll only come from the DB and I'd have thought (but I haven't confirmed it) that as we have the decimal type in the DB they'll also be decimals not floats, so I think we'd have this problem either way? Unless I've missed something
9924db6 to
6e93298
Compare
We're adding a few metrics. Each of these requires a basic rake task that can be called with an INPUT environment variable. The task will generate a ScoreResult for the given metric and print it as JSON. This adds the AutoEvaluation::EvaluateAnswerFromQuestionMessage class which: - takes a question_message and a evaluation class as arguments - generates an answer using the question_message, - calls the evaluation class with the question and answer to get a ScoreResult - returns the ScoreResult If the generated answer has an error status, it raises a TaskFailedError and the rake task handles outputting the error message to stderr and aborting.
This updates the BedrockOpenAIOssInvoke class to include JSON schema validation for the structured output received from the LLM. It uses the first tool's schema since all of our auto-eval metrics use a single tool. If this is to change we could consider passing the schema in via the method parameters down the line. While doing this I noticed that i'd not been consistent with adhering to the schemas it some test cases so i've updated those.
6e93298 to
7ab136b
Compare
kevindew
left a comment
There was a problem hiding this comment.
Nice one - sorry this dragged on

Description
This ports the Coherence metric from the evaluation repo to our Ruby codebase.
It:
Rake task
Result
Trello card
https://trello.com/c/cUIagBUx/2996-ruby-auto-eval-for-coherence-metric