Skip to content

Commit 19abd49

Browse files
[kbn-evals] Add on-demand evals - Buildkite (elastic#269807)
Closes elastic/obs-ai-team#606 ## Summary Adds a dedicated on-demand Buildkite pipeline (`bk-kibana-evals-on-demand`) so any Elastic org member can run a single eval suite + models on any branch without opening a PR or waiting for the full Kibana PR pipeline. Reuses the existing eval runner (`run_suite.sh`) and golden-cluster result export. ### Problem Today, eval CI is Buildkite-only and PR-triggered via labels (`evals:*` + `models:*`). Engineers must wait for the full Kibana PR pipeline before eval results are available, and there is no lightweight way to run one suite on an arbitrary branch without that overhead. #### Trigger flow `kibana-evals-on-demand` → **New build** → pick branch/commit → set env vars (`EVAL_SUITE_ID`, `EVAL_MODEL_GROUPS`, etc.) #### Results Golden cluster evals UI, filter by branch or run id `bk-<buildkite_build_id>`. #### Access Everyone has `BUILD_AND_READ` (any org member can start builds); `kibana-operations` and `obs-ai-team` retain `MANAGE_BUILD_AND_READ`. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
1 parent 07b8ef1 commit 19abd49

4 files changed

Lines changed: 135 additions & 12 deletions

File tree

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# yaml-language-server: $schema=https://gist.githubusercontent.com/elasticmachine/988b80dae436cafea07d9a4a460a011d/raw/rre.schema.json
2+
apiVersion: backstage.io/v1alpha1
3+
kind: Resource
4+
metadata:
5+
name: bk-kibana-evals-on-demand
6+
description: 'Runs a single @kbn/evals suite and model on demand'
7+
links:
8+
- url: 'https://buildkite.com/elastic/kibana-evals-on-demand'
9+
title: Pipeline link
10+
spec:
11+
type: buildkite-pipeline
12+
owner: 'group:obs-ai-team'
13+
system: buildkite
14+
implementation:
15+
apiVersion: buildkite.elastic.dev/v1
16+
kind: Pipeline
17+
metadata:
18+
name: 'Kibana / Evals / On-demand LLM Evals'
19+
description: 'Runs one @kbn/evals suite against one model on a chosen branch (manual Buildkite trigger)'
20+
spec:
21+
env:
22+
KBN_EVALS: '1'
23+
allow_rebuilds: true
24+
branch_configuration: ''
25+
cancel_intermediate_builds: false
26+
default_branch: main
27+
repository: elastic/kibana
28+
pipeline_file: .buildkite/pipelines/evals/on_demand_evals.yml
29+
provider_settings:
30+
build_branches: false
31+
build_pull_requests: false
32+
publish_commit_status: false
33+
trigger_mode: none
34+
prefix_pull_request_fork_branch_names: false
35+
skip_pull_request_builds_for_existing_commits: false
36+
build_tags: false
37+
teams:
38+
kibana-operations:
39+
access_level: MANAGE_BUILD_AND_READ
40+
obs-ai-team:
41+
access_level: MANAGE_BUILD_AND_READ
42+
everyone:
43+
access_level: BUILD_AND_READ
44+
tags:
45+
- kibana
46+
- kbn-evals

.buildkite/pipeline-resource-definitions/locations.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ spec:
88
type: url
99
targets:
1010
- https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/cloud-security-posture/cspm-agentless-scout.yml
11+
- https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/evals/kibana-evals-on-demand.yml
1112
- https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/evals/kibana-evals.yml
1213
- https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/kibana-agent-builder-smoke-tests-daily.yml
1314
- https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/kibana-apis-capacity-testing-daily.yml
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
env:
2+
KBN_EVALS: '1'
3+
steps:
4+
- label: '👨‍🔧 Pre-Build'
5+
command: .buildkite/scripts/lifecycle/pre_build.sh
6+
agents:
7+
image: family/kibana-ubuntu-2404
8+
imageProject: elastic-images-prod
9+
provider: gcp
10+
machineType: n2-standard-2
11+
12+
- wait
13+
14+
- label: '🧑‍🏭 Build Kibana Distribution'
15+
command: .buildkite/scripts/steps/build_kibana.sh
16+
agents:
17+
image: family/kibana-ubuntu-2404
18+
imageProject: elastic-images-prod
19+
provider: gcp
20+
machineType: n2-standard-8
21+
key: build
22+
if: "build.env('KIBANA_BUILD_ID') == null || build.env('KIBANA_BUILD_ID') == ''"
23+
24+
- wait
25+
26+
- label: 'LLM Evals: On-demand'
27+
key: kbn-evals-on-demand
28+
command: bash .buildkite/scripts/steps/evals/run_suite.sh
29+
env:
30+
FTR_EIS_CCM: '1'
31+
EVAL_FANOUT: '1'
32+
timeout_in_minutes: 120
33+
agents:
34+
image: family/kibana-ubuntu-2404
35+
imageProject: elastic-images-prod
36+
provider: gcp
37+
machineType: n2-standard-8
38+
preemptible: true
39+
retry:
40+
automatic:
41+
- exit_status: '-1'
42+
limit: 3
43+
- exit_status: '*'
44+
limit: 1

x-pack/platform/packages/shared/kbn-evals/README.md

Lines changed: 44 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ node scripts/evals start --suite agent-builder
122122
`evals init` walks you through EIS (Cloud Connected Mode) connector discovery or validates existing connectors in `kibana.dev.yml`. It outputs an `export KIBANA_TESTING_AI_CONNECTORS="..."` command to paste into your shell.
123123

124124
`evals start` orchestrates the full stack in one terminal:
125+
125126
1. Starts the EDOT collector (Docker) for trace capture -- exports traces to the configured tracing Elasticsearch cluster (via `TRACING_ES_URL`)
126127
2. Starts Scout (ES + Kibana with `evals_tracing` config)
127128
3. Enables EIS CCM on the Scout ES cluster (if using EIS connectors)
@@ -161,6 +162,7 @@ node scripts/evals start --suite attack-discovery --export-profile local
161162
```
162163

163164
Notes:
165+
164166
- `--datasets-profile <name>` loads `EVALUATIONS_KBN_URL` / `EVALUATIONS_KBN_API_KEY` from `config.<name>.json`
165167
- `--export-profile <name>` loads `EVALUATIONS_ES_URL`, `TRACING_ES_URL`, and `TRACING_EXPORTERS` from `config.<name>.json`
166168

@@ -210,6 +212,35 @@ The CLI uses suite metadata from:
210212
.buildkite/pipelines/evals/evals.suites.json
211213
```
212214

215+
### On-demand evals (Buildkite)
216+
217+
Run a single suite and model on any branch without opening a PR or waiting for the full Kibana PR pipeline:
218+
219+
1. Open [kibana-evals-on-demand](https://buildkite.com/elastic/kibana-evals-on-demand) on Buildkite
220+
2. Click **New build**, select the branch (or commit) to evaluate
221+
3. Under **Environment variables** (in New build options), add the required variables below — one `KEY=value` per line. These are build-level env vars read by `run_suite.sh`.
222+
223+
Pipeline registration: [`.buildkite/pipeline-resource-definitions/evals/kibana-evals-on-demand.yml`](../../../../../.buildkite/pipeline-resource-definitions/evals/kibana-evals-on-demand.yml).
224+
225+
| Variable | Required | Description |
226+
| ------------------------- | ------------------ | ------------------------------------------------------------------------------------------------------- |
227+
| `EVAL_SUITE_ID` | yes | Suite id from `evals.suites.json`, e.g. `agent-builder` |
228+
| `EVAL_MODEL_GROUPS` | yes | Model group, e.g. `eis/openai-gpt-5.4` |
229+
| `EVAL_INCLUDE_EIS_MODELS` | for `eis/*` models | Set to `1` when `EVAL_MODEL_GROUPS` uses `eis/...` |
230+
| `EVALUATION_CONNECTOR_ID` | no | LLM-as-judge connector id override (connector id, not `eis/...` model group) |
231+
| `EVAL_SERVER_CONFIG_SET` | some suites | From `serverConfigSet` on the suite entry in `evals.suites.json` (e.g. `evals_endpoint` for `endpoint`) |
232+
| `KIBANA_BUILD_ID` | no | Reuse a Kibana build from another Buildkite job (skips the build step) |
233+
234+
The eval pipeline step sets `FTR_EIS_CCM=1` and `EVAL_FANOUT=1`; `KBN_EVALS=1` is set on the pipeline.
235+
236+
Example environment variables for Agent Builder + one EIS model:
237+
238+
```text
239+
EVAL_SUITE_ID=agent-builder
240+
EVAL_MODEL_GROUPS=eis/openai-gpt-5.4
241+
EVAL_INCLUDE_EIS_MODELS=1
242+
```
243+
213244
### CI labels
214245

215246
Eval suites can be triggered in PR CI by adding GitHub labels:
@@ -240,6 +271,7 @@ The `models:*` and `models:judge:*` labels are automatically synced from LiteLLM
240271
- **On demand**: Add the `ci:sync-model-labels` label to any PR to trigger label sync in PR CI.
241272

242273
The sync step:
274+
243275
1. Discovers available models from both **LiteLLM** (`GET /v1/models`) and **EIS** (via `discover_eis_models.js`)
244276
2. Creates/updates labels for all discovered models
245277
3. Marks stale labels as deprecated (renamed from `models:*` to `deprecated:models:*`)
@@ -384,7 +416,7 @@ node scripts/evals dataplex sync --dry-run
384416
node scripts/evals dataplex sync --print-commands
385417
```
386418

387-
Note: The **Dataplex "Aspect types"** console page lists *schemas*. Snapshot datasets themselves show up under Dataplex **Entries**.
419+
Note: The **Dataplex "Aspect types"** console page lists _schemas_. Snapshot datasets themselves show up under Dataplex **Entries**.
388420

389421
<details>
390422
<summary>Manual flow (if you prefer full control)</summary>
@@ -667,11 +699,11 @@ This creates a dedicated `evaluationsEsClient` that connects to your evaluations
667699

668700
Use these settings when traces, evaluation results, or managed datasets live outside the default Scout Kibana/Elasticsearch pair.
669701

670-
| Variable | CLI flag | Purpose |
671-
| --- | --- | --- |
672-
| `TRACING_ES_URL` | `--trace-es-url` | Sends trace-based evaluator queries to a separate monitoring Elasticsearch cluster. |
673-
| `EVALUATIONS_ES_URL` | `--evaluations-es-url` | Exports evaluation scores to a separate Elasticsearch cluster. |
674-
| `EVALUATIONS_KBN_URL` | `--evaluations-kbn-url` | Routes dataset upsert and dataset lookup operations to a separate Kibana instance. |
702+
| Variable | CLI flag | Purpose |
703+
| ------------------------- | --------------------------- | -------------------------------------------------------------------------------------- |
704+
| `TRACING_ES_URL` | `--trace-es-url` | Sends trace-based evaluator queries to a separate monitoring Elasticsearch cluster. |
705+
| `EVALUATIONS_ES_URL` | `--evaluations-es-url` | Exports evaluation scores to a separate Elasticsearch cluster. |
706+
| `EVALUATIONS_KBN_URL` | `--evaluations-kbn-url` | Routes dataset upsert and dataset lookup operations to a separate Kibana instance. |
675707
| `EVALUATIONS_KBN_API_KEY` | `--evaluations-kbn-api-key` | Optional API key used for dataset Kibana operations when `EVALUATIONS_KBN_URL` is set. |
676708

677709
#### Using a Separate Kibana for Dataset Operations
@@ -765,12 +797,12 @@ This grants:
765797

766798
Copy the returned `encoded` value and use it for all four secret fields in your vault config:
767799

768-
| Config field | Env variable | Value |
769-
| --- | --- | --- |
770-
| `evaluationsEs.apiKey` | `EVALUATIONS_ES_API_KEY` | `<encoded>` |
771-
| `tracingEs.apiKey` | `TRACING_ES_API_KEY` | `<encoded>` |
772-
| `evaluationsKbn.apiKey` | `EVALUATIONS_KBN_API_KEY` | `<encoded>` |
773-
| `tracingExporters[0].http.headers.Authorization` | via `TRACING_EXPORTERS` | `ApiKey <encoded>` |
800+
| Config field | Env variable | Value |
801+
| ------------------------------------------------ | ------------------------- | ------------------ |
802+
| `evaluationsEs.apiKey` | `EVALUATIONS_ES_API_KEY` | `<encoded>` |
803+
| `tracingEs.apiKey` | `TRACING_ES_API_KEY` | `<encoded>` |
804+
| `evaluationsKbn.apiKey` | `EVALUATIONS_KBN_API_KEY` | `<encoded>` |
805+
| `tracingExporters[0].http.headers.Authorization` | via `TRACING_EXPORTERS` | `ApiKey <encoded>` |
774806

775807
### Exporting to a separate Elasticsearch cluster
776808

0 commit comments

Comments
 (0)