Skip to content

Commit aa589ec

Browse files
[Dashboards in Chat] Add skill selection eval suite (#269819)
## Summary Adds a new `@kbn/evals` suite for Agent Builder Dashboards / Dashboards in Chat. The suite checks that the assistant chooses the right dashboard-related behavior for different user intents: - When the user asks for a dashboard, it loads dashboard management and creates a dashboard attachment. - When the user asks for a standalone visualization, it creates only the visualization and does not create a dashboard. - When the user asks for data exploration help, like ES|QL query writing or field discovery, it does not incorrectly use dashboard management. - When the user requests dashboard sections, it creates the requested sections with relevant panels. - When a dashboard is created, the attachment has a title, expected panel count, and valid grid layout. - KPI/metric panels are laid out in compact rows, and mixed layouts place full-width trend panels correctly. - The assistant follows the expected internal tool path for each scenario. This PR also adds reusable code evaluators for dashboard attachment shape, skill selection, panel counts, sections, grid bounds, row layout, and tool trajectory checks. The new suite is registered in eval metadata and added to the weekly LLM evals pipeline for Agent Builder Dashboards. ## How to run See the package README: `x-pack/platform/packages/shared/agent-builder-dashboards/kbn-evals-suite-agent-builder-dashboards/README.md` Example: ```bash node scripts/evals start \ --suite agent-builder-dashboards \ --project eis-anthropic-claude-4-5-sonnet \ --evaluation-connector-id eis-anthropic-claude-4-5-sonnet ``` <img width="1468" height="339" alt="Screenshot 2026-05-28 at 13 12 48" src="https://github.com/user-attachments/assets/ea445653-4f8c-4346-b2ca-9b72c0b9ac7e" /> --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
1 parent 17b8f14 commit aa589ec

22 files changed

Lines changed: 2144 additions & 0 deletions

.buildkite/pipelines/evals/evals.suites.json

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,25 @@
232232
"configPath": "x-pack/solutions/security/packages/kbn-evals-suite-security-esql-generation-regression/playwright.config.ts",
233233
"tags": ["security", "esql-generation"],
234234
"ciLabels": ["evals:security-esql-generation-regression"]
235+
},
236+
{
237+
"id": "agent-builder-dashboards",
238+
"name": "Agent Builder Dashboards",
239+
"slackChannel": "#kibana-presentation-reminders",
240+
"configPath": "x-pack/platform/packages/shared/agent-builder-dashboards/kbn-evals-suite-agent-builder-dashboards/playwright.config.ts",
241+
"tags": [
242+
"platform",
243+
"agent-builder-dashboards"
244+
],
245+
"ciLabels": [
246+
"evals:agent-builder-dashboards"
247+
],
248+
"weeklyEisModelGroups": [
249+
"eis/anthropic-claude-4.6-sonnet",
250+
"eis/anthropic-claude-4.6-opus",
251+
"eis/openai-gpt-5.2",
252+
"eis/openai-gpt-5.4"
253+
]
235254
}
236255
]
237256
}

.buildkite/pipelines/evals/llm_evals.yml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,28 @@ steps:
8686
- exit_status: '-1'
8787
limit: 3
8888

89+
- label: 'Evals: Agent Builder Dashboards'
90+
key: kbn-evals-weekly-agent-builder-dashboards
91+
command: bash .buildkite/scripts/steps/evals/run_suite.sh
92+
env:
93+
KBN_EVALS: '1'
94+
FTR_EIS_CCM: '1'
95+
EVAL_SUITE_ID: 'agent-builder-dashboards'
96+
EVAL_FANOUT: '1'
97+
EVAL_INCLUDE_EIS_MODELS: '1'
98+
EVAL_MODEL_GROUPS: *weekly_eis_core_models
99+
timeout_in_minutes: 60
100+
agents:
101+
image: family/kibana-ubuntu-2404
102+
imageProject: elastic-images-prod
103+
provider: gcp
104+
machineType: n2-standard-8
105+
preemptible: true
106+
retry:
107+
automatic:
108+
- exit_status: '-1'
109+
limit: 3
110+
89111
- label: 'Evals: ES|QL Generation Evaluations'
90112
key: kbn-evals-weekly-esql-generation
91113
command: bash .buildkite/scripts/steps/evals/run_suite.sh

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -993,6 +993,7 @@ x-pack/platform/packages/private/upgrade-assistant/common @elastic/kibana-manage
993993
x-pack/platform/packages/private/upgrade-assistant/public @elastic/kibana-management
994994
x-pack/platform/packages/private/upgrade-assistant/server @elastic/kibana-management
995995
x-pack/platform/packages/shared/agent-builder-dashboards/agent-builder-dashboards-common @elastic/appex-ai-infra
996+
x-pack/platform/packages/shared/agent-builder-dashboards/kbn-evals-suite-agent-builder-dashboards @elastic/appex-ai-infra
996997
x-pack/platform/packages/shared/agent-builder/agent-builder-browser @elastic/workchat-eng
997998
x-pack/platform/packages/shared/agent-builder/agent-builder-common @elastic/workchat-eng
998999
x-pack/platform/packages/shared/agent-builder/agent-builder-genai-utils @elastic/workchat-eng

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1742,6 +1742,7 @@
17421742
"@kbn/evals-extensions": "link:x-pack/platform/packages/shared/kbn-evals-extensions",
17431743
"@kbn/evals-phoenix-executor": "link:x-pack/platform/packages/shared/kbn-evals-phoenix-executor",
17441744
"@kbn/evals-suite-agent-builder": "link:x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder",
1745+
"@kbn/evals-suite-agent-builder-dashboards": "link:x-pack/platform/packages/shared/agent-builder-dashboards/kbn-evals-suite-agent-builder-dashboards",
17451746
"@kbn/evals-suite-alerts-rag": "link:x-pack/solutions/security/packages/kbn-evals-suite-alerts-rag",
17461747
"@kbn/evals-suite-attack-discovery": "link:x-pack/solutions/security/packages/kbn-evals-suite-attack-discovery",
17471748
"@kbn/evals-suite-endpoint": "link:x-pack/solutions/security/packages/kbn-evals-suite-endpoint",

tsconfig.base.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1218,6 +1218,8 @@
12181218
"@kbn/evals-plugin/*": ["x-pack/platform/plugins/shared/evals/*"],
12191219
"@kbn/evals-suite-agent-builder": ["x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder"],
12201220
"@kbn/evals-suite-agent-builder/*": ["x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/*"],
1221+
"@kbn/evals-suite-agent-builder-dashboards": ["x-pack/platform/packages/shared/agent-builder-dashboards/kbn-evals-suite-agent-builder-dashboards"],
1222+
"@kbn/evals-suite-agent-builder-dashboards/*": ["x-pack/platform/packages/shared/agent-builder-dashboards/kbn-evals-suite-agent-builder-dashboards/*"],
12211223
"@kbn/evals-suite-alerts-rag": ["x-pack/solutions/security/packages/kbn-evals-suite-alerts-rag"],
12221224
"@kbn/evals-suite-alerts-rag/*": ["x-pack/solutions/security/packages/kbn-evals-suite-alerts-rag/*"],
12231225
"@kbn/evals-suite-attack-discovery": ["x-pack/solutions/security/packages/kbn-evals-suite-attack-discovery"],
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# @kbn/evals-suite-agent-builder-dashboards
2+
3+
Evaluation test suite for Agent Builder Dashboards behavior, built on top of [`@kbn/evals`](../../kbn-evals/README.md).
4+
5+
## Overview
6+
7+
This package contains in-code evaluation datasets for Agent Builder Dashboards behavior. The initial coverage focuses on skill selection and intent routing:
8+
9+
- Dashboard requests should load dashboard management.
10+
- Standalone visualization requests should load visualization creation without creating a dashboard.
11+
- ES|QL query-writing requests should not use dashboard management.
12+
13+
For general information about writing evaluation tests, configuration, reporting, and comparison, see the main [`@kbn/evals` documentation](../../kbn-evals/README.md).
14+
15+
## Prerequisites
16+
17+
### Configure EIS Connectors
18+
19+
For local EIS-backed model runs, run the eval setup wizard:
20+
21+
```bash
22+
node scripts/evals init
23+
```
24+
25+
When `node scripts/evals init` finishes, copy the printed connector export into the same shell where you will run evals:
26+
27+
```bash
28+
export KIBANA_TESTING_AI_CONNECTORS="..."
29+
```
30+
31+
This makes EIS connector IDs available as Playwright projects, for example `eis-anthropic-claude-4-5-sonnet`.
32+
33+
### Optional: Configure Phoenix and Tracing
34+
35+
`node scripts/evals start` starts EDOT and Scout for you. If you want to export traces to Phoenix or a shared tracing cluster, configure the eval profiles with:
36+
37+
```bash
38+
node scripts/evals init config
39+
```
40+
41+
See [`@kbn/evals` documentation](../../kbn-evals/README.md) for `TRACING_EXPORTERS`, `TRACING_ES_URL`, and Phoenix executor details.
42+
43+
## Running Evaluations
44+
45+
### Managed Stack
46+
47+
Use `node scripts/evals start` when you want the CLI to start or reuse EDOT and Scout, enable EIS Cloud Connected Mode, and then run the suite:
48+
49+
```bash
50+
node scripts/evals start \
51+
--suite agent-builder-dashboards \
52+
--project eis-anthropic-claude-4-5-sonnet \
53+
--evaluation-connector-id eis-anthropic-claude-4-5-sonnet
54+
```
55+
56+
The Scout Kibana instance is usually available at <http://localhost:5620>, and Elasticsearch at <http://localhost:9220>.
57+
58+
### Run a Single Eval
59+
60+
Filter by Playwright test title with `--grep`:
61+
62+
```bash
63+
node scripts/evals start \
64+
--suite agent-builder-dashboards \
65+
--grep "dashboards in chat smokescreen" \
66+
--project eis-anthropic-claude-4-5-sonnet \
67+
--evaluation-connector-id eis-anthropic-claude-4-5-sonnet
68+
```
69+
70+
Available skill-selection test titles:
71+
72+
- `dashboards in chat smokescreen`
73+
- `visualization request does not create dashboard`
74+
- `esql query help does not create dashboard`
75+
76+
After the eval stack is already running, use `run` for faster iteration:
77+
78+
```bash
79+
node scripts/evals run \
80+
--suite agent-builder-dashboards \
81+
--grep "visualization request does not create dashboard" \
82+
--project eis-anthropic-claude-4-5-sonnet \
83+
--evaluation-connector-id eis-anthropic-claude-4-5-sonnet
84+
```
85+
86+
### Repetitions
87+
88+
By default, each dataset example runs once. To run each example multiple times, pass `--repetitions`:
89+
90+
```bash
91+
node scripts/evals start \
92+
--suite agent-builder-dashboards \
93+
--grep "dashboards in chat smokescreen" \
94+
--project eis-anthropic-claude-4-5-sonnet \
95+
--evaluation-connector-id eis-anthropic-claude-4-5-sonnet \
96+
--repetitions 3
97+
```
98+
99+
Equivalent environment variable:
100+
101+
```bash
102+
EVALUATION_REPETITIONS=3 node scripts/evals run \
103+
--suite agent-builder-dashboards \
104+
--grep "dashboards in chat smokescreen" \
105+
--project eis-anthropic-claude-4-5-sonnet \
106+
--evaluation-connector-id eis-anthropic-claude-4-5-sonnet
107+
```
108+
109+
### Direct Playwright
110+
111+
For lower-level debugging, run Playwright directly:
112+
113+
```bash
114+
EVALUATION_CONNECTOR_ID=eis-anthropic-claude-4-5-sonnet \
115+
node scripts/playwright test \
116+
--config x-pack/platform/packages/shared/agent-builder-dashboards/kbn-evals-suite-agent-builder-dashboards/playwright.config.ts \
117+
evals/skill_selection/skill_selection.spec.ts \
118+
--project eis-anthropic-claude-4-5-sonnet \
119+
--grep "esql query help does not create dashboard"
120+
```
121+
122+
Use `--list` to check what Playwright can discover:
123+
124+
```bash
125+
EVALUATION_CONNECTOR_ID=eis-anthropic-claude-4-5-sonnet \
126+
node scripts/playwright test \
127+
--config x-pack/platform/packages/shared/agent-builder-dashboards/kbn-evals-suite-agent-builder-dashboards/playwright.config.ts \
128+
--project eis-anthropic-claude-4-5-sonnet \
129+
--list
130+
```
131+
132+
## Sample Data
133+
134+
The skill-selection spec loads Kibana logs sample data before running:
135+
136+
```ts
137+
await fetch('/api/sample_data/logs', {
138+
method: 'POST',
139+
version: '2023-10-31',
140+
});
141+
```
142+
143+
To verify the index exists in the Scout Elasticsearch cluster:
144+
145+
```bash
146+
curl -u elastic:changeme "http://localhost:9220/_cat/indices/kibana_sample_data_logs?v"
147+
curl -u elastic:changeme "http://localhost:9220/kibana_sample_data_logs/_count?pretty"
148+
```
149+
150+
## Stopping the Stack
151+
152+
When you are done:
153+
154+
```bash
155+
node scripts/evals stop
156+
```

0 commit comments

Comments
 (0)