Skip to content

Commit 0e71381

Browse files
SrdjanLLsorenlouv
andauthored
[Obs AI Assistant] Evaluation: Add fallback score when judge misses evaluating a criterion (elastic#228827)
## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by elastic#226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
1 parent 7137d65 commit 0e71381

1 file changed

Lines changed: 24 additions & 17 deletions

File tree

  • x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation

x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts

Lines changed: 24 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -574,7 +574,7 @@ export class KibanaClient {
574574
properties: {
575575
index: {
576576
type: 'number',
577-
description: 'The number of the criterion',
577+
description: 'The index number of the criterion',
578578
},
579579
score: {
580580
type: 'number',
@@ -605,28 +605,35 @@ export class KibanaClient {
605605
}
606606
).criteria;
607607

608-
const scores = scoredCriteria
609-
.map(({ index, score, reasoning }) => {
610-
return {
611-
criterion: criteria[index],
612-
score,
613-
reasoning,
614-
};
615-
})
616-
.concat({
617-
score: errors.length === 0 ? 1 : 0,
618-
criterion: 'The conversation did not encounter any errors',
619-
reasoning: errors.length
620-
? `The following errors occurred: ${errors.map((error) => error.error.message)}`
621-
: 'No errors occurred',
622-
});
608+
const scoredMap = new Map(scoredCriteria.map((c) => [c.index, c] as const));
609+
610+
// Although very rare, the LLM judge can sometimes skip evaluation of certain criteria.
611+
// The fallback default score is 0, with self-explanatory reasoning.
612+
const scores = criteria.map((criterion, idx) => {
613+
const criterionScore = scoredMap.get(idx);
614+
return {
615+
criterion,
616+
score: criterionScore?.score ?? 0,
617+
reasoning: criterionScore
618+
? criterionScore.reasoning
619+
: 'No score returned by LLM judge, defaulting to 0.',
620+
};
621+
});
622+
623+
scores.push({
624+
score: errors.length === 0 ? 1 : 0,
625+
criterion: 'The conversation did not encounter any errors',
626+
reasoning: errors.length
627+
? `The following errors occurred: ${errors.map((error) => error.error.message)}`
628+
: 'No errors occurred',
629+
});
623630

624631
const result: EvaluationResult = {
625632
name: currentTitle,
626633
category: firstSuiteName,
627634
conversationId,
628635
messages,
629-
passed: scoredCriteria.every(({ score }) => score >= 1),
636+
passed: scores.every(({ score }) => score === 1),
630637
scores,
631638
errors,
632639
};

0 commit comments

Comments
 (0)