Improve entity analytics skill evals infrastructure by patrykkopycinski · Pull Request #30 · ymao1/kibana

patrykkopycinski · 2026-02-13T12:24:01Z

Summary

Improvements to the entity analytics skill evaluation infrastructure, including shared evaluator deduplication, custom evaluators with tests, enhanced HTML reporting, spec file organization, and several bug fixes.

Evaluation Scores

Multi-model evaluation run with 3 repetitions per example, 6 examples (3 engine-off + 3 engine-on).

Judge model: Claude 4.5 Sonnet (us.anthropic.claude-sonnet-4-5-20250929-v1:0)

Overall Mean Scores (per model)

Model	Groundedness	Relevance	Seq. Accuracy	ToolOutputESQL	ToolUsageOnly
Claude 4.5 Sonnet	0.60	0.53	1.00	1.00	1.00
Gemini 2.5 Pro	0.94	0.77	1.00	0.70	1.00
GPT-4.1	0.56	0.74	1.00	0.94	1.00

Per-Dataset Breakdown (from cumulative reports)

Risk Engine OFF (3 examples × 3 reps each):

Model	Groundedness	Relevance	Seq. Accuracy	ToolOutputESQL	ToolUsageOnly
Claude 4.5 Sonnet	0.97	0.58	1.00	1.00	1.00
Gemini 2.5 Pro	~1.00	~0.96	1.00	1.00	1.00
GPT-4.1	~0.99	~0.95	1.00	1.00	1.00

Risk Engine ON (3 examples × 3 reps each):

Model	Groundedness	Relevance	Seq. Accuracy	ToolOutputESQL	ToolUsageOnly
Claude 4.5 Sonnet	0.22	0.48	1.00	1.00	1.00
Gemini 2.5 Pro	~0.90	~0.58	1.00	~0.40	1.00
GPT-4.1	~0.11	~0.56	1.00	~0.88	1.00

Note: Gemini and GPT individual scores (marked ~) are derived from cumulative worker reports. Sonnet scores are exact (first worker to report).

1. Shared evaluator deduplication (`@kbn/evals-suite-agent-builder`)

The ToolUsageOnly evaluator existed in both kbn-evals-suite-agent-builder and security_solution_evals with diverging implementations. The security version was more advanced (handles invoke_skill indirection, acceptable alternatives like platform.core.search, auxiliary discovery tool filtering).

Changes:

Created src/evaluators/ directory in kbn-evals-suite-agent-builder with the canonical ToolUsageOnly evaluator and shared helpers (getToolCallStepsWithParams, AUXILIARY_DISCOVERY_TOOLS, ToolCallStep, getStringMeta)
Added index.ts package entry point for external consumers
Changed kibana.jsonc visibility from "private" to "shared" so cross-group packages can import
Replaced the inline ToolUsageOnly in evaluate_dataset.ts with createToolUsageOnlyEvaluator() from the new evaluators module
Added comprehensive unit tests (18 tests across 2 suites)
Updated security_solution_evals to import createToolUsageOnlyEvaluator from @kbn/evals-suite-agent-builder instead of maintaining a local copy

2. Custom evaluators with unit tests (`security_solution_evals`)

Extracted custom evaluators from evaluate_dataset.ts into a structured src/evaluators/ directory:

ToolUsageOnly — re-exported from @kbn/evals-suite-agent-builder (shared)
ToolOutputESQL — checks whether the agent produced the expected ES|QL query using clause-level matching with order-independent KEEP field matching
TokenUsage — reports token-usage statistics and estimated cost per evaluation example
Helpers — ESQL-specific utilities (extractEsqlQueries, normalizeWhitespace, matchClause)

Added Jest test infrastructure (jest.config.js) and comprehensive unit tests (32 tests across 3 suites).

3. Optional HTML eval reporter

Added a custom Playwright reporter (eval_html_reporter.ts) that generates a detailed tabular HTML report for debugging evaluation results:

Aggregate scores per evaluator
Per-example summary table with all evaluator scores
Detailed drill-down for each example: question, expected output, actual response, tools called, evaluator scores with explanations and metadata

Controlled via EVAL_HTML_REPORT=true environment variable (disabled by default).

4. Risk score spec split

Split risk_score.spec.ts into two focused spec files:

risk_score_engine_off.spec.ts — tests agent behavior when the risk engine is disabled (3 examples)
risk_score_engine_on.spec.ts — tests agent behavior with active risk score data (3 examples)

5. Fix bulk indexing duplicate document IDs

In multi-model evaluations, all models within a single Playwright worker shared the same TEST_RUN_ID, causing duplicate Elasticsearch document IDs and bulk indexing failures. Fixed by including doc.task.model.id in the generated document ID in score_repository.ts.

6. Improve risk score inline tool description

Updated the risk score inline tool description to be more explicit about its capabilities, improving tool selection accuracy during evaluations.

7. Strengthen skill discovery in research agent prompt

When experimentalFeatures.skills is enabled:

Added mandatory skill discovery step (step 2) in the Tool Selection Policy
Updated skill instructions preamble to make skill loading a protocol requirement before using general-purpose tools
Added skill check reminder in Step 2 (Plan Research) of the execution workflow

8. Chat client and evaluate fixture improvements

Simplified chat_client.ts: removed unused agentId/Options interface, wired up modelUsage and traceId from API response
Updated evaluate.ts (siemSetup fixture) to robustly enable agentBuilder:experimentalFeatures by checking isOverridden before attempting API updates
Increased default repetitions from 1 to 3 in playwright.config.ts
Added onExperimentComplete callback support for attaching evaluation metadata to Playwright test reports

9. README updates

Updated README with multi-model evaluation instructions, HTML report usage, and EVAL_HTML_REPORT environment variable documentation.

Test plan

kbn-evals-suite-agent-builder Jest tests pass (18 tests, 2 suites)
security_solution_evals Jest tests pass (32 tests, 3 suites)
kbn-evals-suite-agent-builder type check passes
Multi-model evals (GPT-4.1, Claude 4.5 Sonnet, Gemini 2.5 Pro) run successfully with Claude 4.5 Sonnet as judge
CI pipeline validation

- Extract and deduplicate ToolUsageOnly evaluator into shared @kbn/evals-suite-agent-builder package with full test coverage - Add custom evaluators (ToolOutputESQL, TokenUsage) with unit tests - Add optional HTML eval reporter (EVAL_HTML_REPORT env flag) - Split risk_score.spec.ts into engine_off / engine_on spec files - Fix bulk indexing duplicate IDs by including model.id in doc ID - Improve risk score inline tool description for better tool selection - Strengthen skill discovery instructions in research agent prompt - Simplify chat client, wire up modelUsage and traceId - Robustly enable agentBuilder experimental features in siemSetup - Update README with multi-model eval and HTML report usage

Move ToolUsageOnly, ToolOutputESQL, TokenUsage evaluators and model pricing utilities from kbn-evals-suite-agent-builder and security_solution_evals into @kbn/evals under dedicated subdirectories (tool_usage/, esql/, token_usage/) matching the existing convention used by correctness/, groundedness/, rag/, and trace_based/.

patrykkopycinski added 2 commits February 13, 2026 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve entity analytics skill evals infrastructure#30

Improve entity analytics skill evals infrastructure#30
patrykkopycinski wants to merge 2 commits into
ymao1:entity-analytics-skillfrom
patrykkopycinski:entity-analytics-skill

patrykkopycinski commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

patrykkopycinski commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evaluation Scores

Overall Mean Scores (per model)

Per-Dataset Breakdown (from cumulative reports)

1. Shared evaluator deduplication (@kbn/evals-suite-agent-builder)

2. Custom evaluators with unit tests (security_solution_evals)

3. Optional HTML eval reporter

4. Risk score spec split

5. Fix bulk indexing duplicate document IDs

6. Improve risk score inline tool description

7. Strengthen skill discovery in research agent prompt

8. Chat client and evaluate fixture improvements

9. README updates

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

patrykkopycinski commented Feb 13, 2026 •

edited

Loading

1. Shared evaluator deduplication (`@kbn/evals-suite-agent-builder`)

2. Custom evaluators with unit tests (`security_solution_evals`)