Skip to content

Improve entity analytics skill evals infrastructure#30

Draft
patrykkopycinski wants to merge 2 commits into
ymao1:entity-analytics-skillfrom
patrykkopycinski:entity-analytics-skill
Draft

Improve entity analytics skill evals infrastructure#30
patrykkopycinski wants to merge 2 commits into
ymao1:entity-analytics-skillfrom
patrykkopycinski:entity-analytics-skill

Conversation

@patrykkopycinski

@patrykkopycinski patrykkopycinski commented Feb 13, 2026

Copy link
Copy Markdown

Summary

Improvements to the entity analytics skill evaluation infrastructure, including shared evaluator deduplication, custom evaluators with tests, enhanced HTML reporting, spec file organization, and several bug fixes.


Evaluation Scores

Multi-model evaluation run with 3 repetitions per example, 6 examples (3 engine-off + 3 engine-on).

Judge model: Claude 4.5 Sonnet (us.anthropic.claude-sonnet-4-5-20250929-v1:0)

Overall Mean Scores (per model)

Model Groundedness Relevance Seq. Accuracy ToolOutputESQL ToolUsageOnly
Claude 4.5 Sonnet 0.60 0.53 1.00 1.00 1.00
Gemini 2.5 Pro 0.94 0.77 1.00 0.70 1.00
GPT-4.1 0.56 0.74 1.00 0.94 1.00

Per-Dataset Breakdown (from cumulative reports)

Risk Engine OFF (3 examples × 3 reps each):

Model Groundedness Relevance Seq. Accuracy ToolOutputESQL ToolUsageOnly
Claude 4.5 Sonnet 0.97 0.58 1.00 1.00 1.00
Gemini 2.5 Pro ~1.00 ~0.96 1.00 1.00 1.00
GPT-4.1 ~0.99 ~0.95 1.00 1.00 1.00

Risk Engine ON (3 examples × 3 reps each):

Model Groundedness Relevance Seq. Accuracy ToolOutputESQL ToolUsageOnly
Claude 4.5 Sonnet 0.22 0.48 1.00 1.00 1.00
Gemini 2.5 Pro ~0.90 ~0.58 1.00 ~0.40 1.00
GPT-4.1 ~0.11 ~0.56 1.00 ~0.88 1.00

Note: Gemini and GPT individual scores (marked ~) are derived from cumulative worker reports. Sonnet scores are exact (first worker to report).


1. Shared evaluator deduplication (@kbn/evals-suite-agent-builder)

The ToolUsageOnly evaluator existed in both kbn-evals-suite-agent-builder and security_solution_evals with diverging implementations. The security version was more advanced (handles invoke_skill indirection, acceptable alternatives like platform.core.search, auxiliary discovery tool filtering).

Changes:

  • Created src/evaluators/ directory in kbn-evals-suite-agent-builder with the canonical ToolUsageOnly evaluator and shared helpers (getToolCallStepsWithParams, AUXILIARY_DISCOVERY_TOOLS, ToolCallStep, getStringMeta)
  • Added index.ts package entry point for external consumers
  • Changed kibana.jsonc visibility from "private" to "shared" so cross-group packages can import
  • Replaced the inline ToolUsageOnly in evaluate_dataset.ts with createToolUsageOnlyEvaluator() from the new evaluators module
  • Added comprehensive unit tests (18 tests across 2 suites)
  • Updated security_solution_evals to import createToolUsageOnlyEvaluator from @kbn/evals-suite-agent-builder instead of maintaining a local copy

2. Custom evaluators with unit tests (security_solution_evals)

Extracted custom evaluators from evaluate_dataset.ts into a structured src/evaluators/ directory:

  • ToolUsageOnly — re-exported from @kbn/evals-suite-agent-builder (shared)
  • ToolOutputESQL — checks whether the agent produced the expected ES|QL query using clause-level matching with order-independent KEEP field matching
  • TokenUsage — reports token-usage statistics and estimated cost per evaluation example
  • Helpers — ESQL-specific utilities (extractEsqlQueries, normalizeWhitespace, matchClause)

Added Jest test infrastructure (jest.config.js) and comprehensive unit tests (32 tests across 3 suites).

3. Optional HTML eval reporter

Added a custom Playwright reporter (eval_html_reporter.ts) that generates a detailed tabular HTML report for debugging evaluation results:

  • Aggregate scores per evaluator
  • Per-example summary table with all evaluator scores
  • Detailed drill-down for each example: question, expected output, actual response, tools called, evaluator scores with explanations and metadata

Controlled via EVAL_HTML_REPORT=true environment variable (disabled by default).

4. Risk score spec split

Split risk_score.spec.ts into two focused spec files:

  • risk_score_engine_off.spec.ts — tests agent behavior when the risk engine is disabled (3 examples)
  • risk_score_engine_on.spec.ts — tests agent behavior with active risk score data (3 examples)

5. Fix bulk indexing duplicate document IDs

In multi-model evaluations, all models within a single Playwright worker shared the same TEST_RUN_ID, causing duplicate Elasticsearch document IDs and bulk indexing failures. Fixed by including doc.task.model.id in the generated document ID in score_repository.ts.

6. Improve risk score inline tool description

Updated the risk score inline tool description to be more explicit about its capabilities, improving tool selection accuracy during evaluations.

7. Strengthen skill discovery in research agent prompt

When experimentalFeatures.skills is enabled:

  • Added mandatory skill discovery step (step 2) in the Tool Selection Policy
  • Updated skill instructions preamble to make skill loading a protocol requirement before using general-purpose tools
  • Added skill check reminder in Step 2 (Plan Research) of the execution workflow

8. Chat client and evaluate fixture improvements

  • Simplified chat_client.ts: removed unused agentId/Options interface, wired up modelUsage and traceId from API response
  • Updated evaluate.ts (siemSetup fixture) to robustly enable agentBuilder:experimentalFeatures by checking isOverridden before attempting API updates
  • Increased default repetitions from 1 to 3 in playwright.config.ts
  • Added onExperimentComplete callback support for attaching evaluation metadata to Playwright test reports

9. README updates

Updated README with multi-model evaluation instructions, HTML report usage, and EVAL_HTML_REPORT environment variable documentation.

Test plan

  • kbn-evals-suite-agent-builder Jest tests pass (18 tests, 2 suites)
  • security_solution_evals Jest tests pass (32 tests, 3 suites)
  • kbn-evals-suite-agent-builder type check passes
  • Multi-model evals (GPT-4.1, Claude 4.5 Sonnet, Gemini 2.5 Pro) run successfully with Claude 4.5 Sonnet as judge
  • CI pipeline validation

- Extract and deduplicate ToolUsageOnly evaluator into shared
  @kbn/evals-suite-agent-builder package with full test coverage
- Add custom evaluators (ToolOutputESQL, TokenUsage) with unit tests
- Add optional HTML eval reporter (EVAL_HTML_REPORT env flag)
- Split risk_score.spec.ts into engine_off / engine_on spec files
- Fix bulk indexing duplicate IDs by including model.id in doc ID
- Improve risk score inline tool description for better tool selection
- Strengthen skill discovery instructions in research agent prompt
- Simplify chat client, wire up modelUsage and traceId
- Robustly enable agentBuilder experimental features in siemSetup
- Update README with multi-model eval and HTML report usage
Move ToolUsageOnly, ToolOutputESQL, TokenUsage evaluators and model
pricing utilities from kbn-evals-suite-agent-builder and
security_solution_evals into @kbn/evals under dedicated subdirectories
(tool_usage/, esql/, token_usage/) matching the existing convention
used by correctness/, groundedness/, rag/, and trace_based/.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant