[Obs AI Assistant] Evaluation: Add fallback score when judge misses evaluating a criterion (elastic#228827)

SrdjanLL · sorenlouv · web-flow · commit 0e7138111a5a · 2025-07-22T12:26:14.000+02:00
## Summary Add fallback score when judge misses evaluating a criterion: - The score is `0` and reasoning: `No score returned by LLM judge, defaulting to 0.` - While the issue of inconsistent evaluation score was mitigated by elastic#226983, I still found that very rarely, the judge misses a criterion. With this change scoring has a fallback that will return the results with 100% consistency in terms of what was evaluated. ### Testing - Since this inconsistency happens rarely, it is really hard to reproduce without tweaking the judge prompt to intentionally fail, by updating the system prompt of the judge ([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534)) with something like: ``` ### Scoring Contract * You MUST call the function "scores" exactly once. * Only and only evaluate the second criterion (reject all others).`, ``` Then you can see the fallback scores populating in the evaluation and keeping the `total` consistent regardless of how well the `score` works. Example from intentionally failed scoring with the prompt change above: <img width="994" height="278" alt="image" src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7" /> --------- Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>
diff --git a/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts b/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts
@@ -574,7 +574,7 @@ export class KibanaClient {
                       properties: {
                         index: {
                           type: 'number',
-                          description: 'The number of the criterion',
+                          description: 'The index number of the criterion',
                         },
                         score: {
                           type: 'number',
@@ -605,28 +605,35 @@ export class KibanaClient {
           }
         ).criteria;
 
-        const scores = scoredCriteria
-          .map(({ index, score, reasoning }) => {
-            return {
-              criterion: criteria[index],
-              score,
-              reasoning,
-            };
-          })
-          .concat({
-            score: errors.length === 0 ? 1 : 0,
-            criterion: 'The conversation did not encounter any errors',
-            reasoning: errors.length
-              ? `The following errors occurred: ${errors.map((error) => error.error.message)}`
-              : 'No errors occurred',
-          });
+        const scoredMap = new Map(scoredCriteria.map((c) => [c.index, c] as const));
+
+        // Although very rare, the LLM judge can sometimes skip evaluation of certain criteria.
+        // The fallback default score is 0, with self-explanatory reasoning.
+        const scores = criteria.map((criterion, idx) => {
+          const criterionScore = scoredMap.get(idx);
+          return {
+            criterion,
+            score: criterionScore?.score ?? 0,
+            reasoning: criterionScore
+              ? criterionScore.reasoning
+              : 'No score returned by LLM judge, defaulting to 0.',
+          };
+        });
+
+        scores.push({
+          score: errors.length === 0 ? 1 : 0,
+          criterion: 'The conversation did not encounter any errors',
+          reasoning: errors.length
+            ? `The following errors occurred: ${errors.map((error) => error.error.message)}`
+            : 'No errors occurred',
+        });
 
         const result: EvaluationResult = {
           name: currentTitle,
           category: firstSuiteName,
           conversationId,
           messages,
-          passed: scoredCriteria.every(({ score }) => score >= 1),
+          passed: scores.every(({ score }) => score === 1),
           scores,
           errors,
         };