Commit 0e71381
[Obs AI Assistant] Evaluation: Add fallback score when judge misses evaluating a criterion (elastic#228827)
## Summary
Add fallback score when judge misses evaluating a criterion:
- The score is `0` and reasoning: `No score returned by LLM judge,
defaulting to 0.`
- While the issue of inconsistent evaluation score was mitigated by
elastic#226983, I still found that very
rarely, the judge misses a criterion. With this change scoring has a
fallback that will return the results with 100% consistency in terms of
what was evaluated.
### Testing
- Since this inconsistency happens rarely, it is really hard to
reproduce without tweaking the judge prompt to intentionally fail, by
updating the system prompt of the judge
([here](https://github.com/elastic/kibana/blob/main/x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation/kibana_client.ts#L534))
with something like:
```
### Scoring Contract
* You MUST call the function "scores" exactly once.
* Only and only evaluate the second criterion (reject all others).`,
```
Then you can see the fallback scores populating in the evaluation and
keeping the `total` consistent regardless of how well the `score` works.
Example from intentionally failed scoring with the prompt change above:
<img width="994" height="278" alt="image"
src="https://github.com/user-attachments/assets/d4bb94bc-4f7e-4982-95ca-cae2159d5ff7"
/>
---------
Co-authored-by: Søren Louv-Jansen <sorenlouv@gmail.com>1 parent 7137d65 commit 0e71381
1 file changed
Lines changed: 24 additions & 17 deletions
File tree
- x-pack/solutions/observability/plugins/observability_ai_assistant_app/scripts/evaluation
Lines changed: 24 additions & 17 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
574 | 574 | | |
575 | 575 | | |
576 | 576 | | |
577 | | - | |
| 577 | + | |
578 | 578 | | |
579 | 579 | | |
580 | 580 | | |
| |||
605 | 605 | | |
606 | 606 | | |
607 | 607 | | |
608 | | - | |
609 | | - | |
610 | | - | |
611 | | - | |
612 | | - | |
613 | | - | |
614 | | - | |
615 | | - | |
616 | | - | |
617 | | - | |
618 | | - | |
619 | | - | |
620 | | - | |
621 | | - | |
622 | | - | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
| 620 | + | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
623 | 630 | | |
624 | 631 | | |
625 | 632 | | |
626 | 633 | | |
627 | 634 | | |
628 | 635 | | |
629 | | - | |
| 636 | + | |
630 | 637 | | |
631 | 638 | | |
632 | 639 | | |
| |||
0 commit comments