Improve entity analytics skill evals infrastructure#30
Draft
patrykkopycinski wants to merge 2 commits into
Draft
Improve entity analytics skill evals infrastructure#30patrykkopycinski wants to merge 2 commits into
patrykkopycinski wants to merge 2 commits into
Conversation
- Extract and deduplicate ToolUsageOnly evaluator into shared @kbn/evals-suite-agent-builder package with full test coverage - Add custom evaluators (ToolOutputESQL, TokenUsage) with unit tests - Add optional HTML eval reporter (EVAL_HTML_REPORT env flag) - Split risk_score.spec.ts into engine_off / engine_on spec files - Fix bulk indexing duplicate IDs by including model.id in doc ID - Improve risk score inline tool description for better tool selection - Strengthen skill discovery instructions in research agent prompt - Simplify chat client, wire up modelUsage and traceId - Robustly enable agentBuilder experimental features in siemSetup - Update README with multi-model eval and HTML report usage
Move ToolUsageOnly, ToolOutputESQL, TokenUsage evaluators and model pricing utilities from kbn-evals-suite-agent-builder and security_solution_evals into @kbn/evals under dedicated subdirectories (tool_usage/, esql/, token_usage/) matching the existing convention used by correctness/, groundedness/, rag/, and trace_based/.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improvements to the entity analytics skill evaluation infrastructure, including shared evaluator deduplication, custom evaluators with tests, enhanced HTML reporting, spec file organization, and several bug fixes.
Evaluation Scores
Multi-model evaluation run with 3 repetitions per example, 6 examples (3 engine-off + 3 engine-on).
Judge model: Claude 4.5 Sonnet (
us.anthropic.claude-sonnet-4-5-20250929-v1:0)Overall Mean Scores (per model)
Per-Dataset Breakdown (from cumulative reports)
Risk Engine OFF (3 examples × 3 reps each):
Risk Engine ON (3 examples × 3 reps each):
1. Shared evaluator deduplication (
@kbn/evals-suite-agent-builder)The ToolUsageOnly evaluator existed in both
kbn-evals-suite-agent-builderandsecurity_solution_evalswith diverging implementations. The security version was more advanced (handlesinvoke_skillindirection, acceptable alternatives likeplatform.core.search, auxiliary discovery tool filtering).Changes:
src/evaluators/directory inkbn-evals-suite-agent-builderwith the canonicalToolUsageOnlyevaluator and shared helpers (getToolCallStepsWithParams,AUXILIARY_DISCOVERY_TOOLS,ToolCallStep,getStringMeta)index.tspackage entry point for external consumerskibana.jsoncvisibility from"private"to"shared"so cross-group packages can importevaluate_dataset.tswithcreateToolUsageOnlyEvaluator()from the new evaluators modulesecurity_solution_evalsto importcreateToolUsageOnlyEvaluatorfrom@kbn/evals-suite-agent-builderinstead of maintaining a local copy2. Custom evaluators with unit tests (
security_solution_evals)Extracted custom evaluators from
evaluate_dataset.tsinto a structuredsrc/evaluators/directory:@kbn/evals-suite-agent-builder(shared)extractEsqlQueries,normalizeWhitespace,matchClause)Added Jest test infrastructure (
jest.config.js) and comprehensive unit tests (32 tests across 3 suites).3. Optional HTML eval reporter
Added a custom Playwright reporter (
eval_html_reporter.ts) that generates a detailed tabular HTML report for debugging evaluation results:Controlled via
EVAL_HTML_REPORT=trueenvironment variable (disabled by default).4. Risk score spec split
Split
risk_score.spec.tsinto two focused spec files:risk_score_engine_off.spec.ts— tests agent behavior when the risk engine is disabled (3 examples)risk_score_engine_on.spec.ts— tests agent behavior with active risk score data (3 examples)5. Fix bulk indexing duplicate document IDs
In multi-model evaluations, all models within a single Playwright worker shared the same
TEST_RUN_ID, causing duplicate Elasticsearch document IDs and bulk indexing failures. Fixed by includingdoc.task.model.idin the generated document ID inscore_repository.ts.6. Improve risk score inline tool description
Updated the risk score inline tool description to be more explicit about its capabilities, improving tool selection accuracy during evaluations.
7. Strengthen skill discovery in research agent prompt
When
experimentalFeatures.skillsis enabled:8. Chat client and evaluate fixture improvements
chat_client.ts: removed unusedagentId/Optionsinterface, wired upmodelUsageandtraceIdfrom API responseevaluate.ts(siemSetupfixture) to robustly enableagentBuilder:experimentalFeaturesby checkingisOverriddenbefore attempting API updatesplaywright.config.tsonExperimentCompletecallback support for attaching evaluation metadata to Playwright test reports9. README updates
Updated README with multi-model evaluation instructions, HTML report usage, and
EVAL_HTML_REPORTenvironment variable documentation.Test plan
kbn-evals-suite-agent-builderJest tests pass (18 tests, 2 suites)security_solution_evalsJest tests pass (32 tests, 3 suites)kbn-evals-suite-agent-buildertype check passes