[kbn-evals] Add on-demand evals - Buildkite (elastic#269807)

arturoliduena · kibanamachine · web-flow · commit 19abd49f71f7 · 2026-05-21T07:38:11.000+02:00
Closes elastic/obs-ai-team#606 ## Summary Adds a dedicated on-demand Buildkite pipeline (`bk-kibana-evals-on-demand`) so any Elastic org member can run a single eval suite + models on any branch without opening a PR or waiting for the full Kibana PR pipeline. Reuses the existing eval runner (`run_suite.sh`) and golden-cluster result export. ### Problem Today, eval CI is Buildkite-only and PR-triggered via labels (`evals:*` + `models:*`). Engineers must wait for the full Kibana PR pipeline before eval results are available, and there is no lightweight way to run one suite on an arbitrary branch without that overhead. #### Trigger flow `kibana-evals-on-demand` → **New build** → pick branch/commit → set env vars (`EVAL_SUITE_ID`, `EVAL_MODEL_GROUPS`, etc.) #### Results Golden cluster evals UI, filter by branch or run id `bk-<buildkite_build_id>`. #### Access Everyone has `BUILD_AND_READ` (any org member can start builds); `kibana-operations` and `obs-ai-team` retain `MANAGE_BUILD_AND_READ`. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
diff --git a/.buildkite/pipeline-resource-definitions/evals/kibana-evals-on-demand.yml b/.buildkite/pipeline-resource-definitions/evals/kibana-evals-on-demand.yml
@@ -0,0 +1,46 @@
+# yaml-language-server: $schema=https://gist.githubusercontent.com/elasticmachine/988b80dae436cafea07d9a4a460a011d/raw/rre.schema.json
+apiVersion: backstage.io/v1alpha1
+kind: Resource
+metadata:
+  name: bk-kibana-evals-on-demand
+  description: 'Runs a single @kbn/evals suite and model on demand'
+  links:
+    - url: 'https://buildkite.com/elastic/kibana-evals-on-demand'
+      title: Pipeline link
+spec:
+  type: buildkite-pipeline
+  owner: 'group:obs-ai-team'
+  system: buildkite
+  implementation:
+    apiVersion: buildkite.elastic.dev/v1
+    kind: Pipeline
+    metadata:
+      name: 'Kibana / Evals / On-demand LLM Evals'
+      description: 'Runs one @kbn/evals suite against one model on a chosen branch (manual Buildkite trigger)'
+    spec:
+      env:
+        KBN_EVALS: '1'
+      allow_rebuilds: true
+      branch_configuration: ''
+      cancel_intermediate_builds: false
+      default_branch: main
+      repository: elastic/kibana
+      pipeline_file: .buildkite/pipelines/evals/on_demand_evals.yml
+      provider_settings:
+        build_branches: false
+        build_pull_requests: false
+        publish_commit_status: false
+        trigger_mode: none
+        prefix_pull_request_fork_branch_names: false
+        skip_pull_request_builds_for_existing_commits: false
+        build_tags: false
+      teams:
+        kibana-operations:
+          access_level: MANAGE_BUILD_AND_READ
+        obs-ai-team:
+          access_level: MANAGE_BUILD_AND_READ
+        everyone:
+          access_level: BUILD_AND_READ
+      tags:
+        - kibana
+        - kbn-evals
diff --git a/.buildkite/pipeline-resource-definitions/locations.yml b/.buildkite/pipeline-resource-definitions/locations.yml
@@ -8,6 +8,7 @@ spec:
   type: url
   targets:
     - https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/cloud-security-posture/cspm-agentless-scout.yml
+    - https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/evals/kibana-evals-on-demand.yml
     - https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/evals/kibana-evals.yml
     - https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/kibana-agent-builder-smoke-tests-daily.yml
     - https://github.com/elastic/kibana/blob/main/.buildkite/pipeline-resource-definitions/kibana-apis-capacity-testing-daily.yml
diff --git a/.buildkite/pipelines/evals/on_demand_evals.yml b/.buildkite/pipelines/evals/on_demand_evals.yml
@@ -0,0 +1,44 @@
+env:
+  KBN_EVALS: '1'
+steps:
+  - label: '👨‍🔧 Pre-Build'
+    command: .buildkite/scripts/lifecycle/pre_build.sh
+    agents:
+      image: family/kibana-ubuntu-2404
+      imageProject: elastic-images-prod
+      provider: gcp
+      machineType: n2-standard-2
+
+  - wait
+
+  - label: '🧑‍🏭 Build Kibana Distribution'
+    command: .buildkite/scripts/steps/build_kibana.sh
+    agents:
+      image: family/kibana-ubuntu-2404
+      imageProject: elastic-images-prod
+      provider: gcp
+      machineType: n2-standard-8
+    key: build
+    if: "build.env('KIBANA_BUILD_ID') == null || build.env('KIBANA_BUILD_ID') == ''"
+
+  - wait
+
+  - label: 'LLM Evals: On-demand'
+    key: kbn-evals-on-demand
+    command: bash .buildkite/scripts/steps/evals/run_suite.sh
+    env:
+      FTR_EIS_CCM: '1'
+      EVAL_FANOUT: '1'
+    timeout_in_minutes: 120
+    agents:
+      image: family/kibana-ubuntu-2404
+      imageProject: elastic-images-prod
+      provider: gcp
+      machineType: n2-standard-8
+      preemptible: true
+    retry:
+      automatic:
+        - exit_status: '-1'
+          limit: 3
+        - exit_status: '*'
+          limit: 1
diff --git a/x-pack/platform/packages/shared/kbn-evals/README.md b/x-pack/platform/packages/shared/kbn-evals/README.md
@@ -122,6 +122,7 @@ node scripts/evals start --suite agent-builder
 `evals init` walks you through EIS (Cloud Connected Mode) connector discovery or validates existing connectors in `kibana.dev.yml`. It outputs an `export KIBANA_TESTING_AI_CONNECTORS="..."` command to paste into your shell.
 
 `evals start` orchestrates the full stack in one terminal:
+
 1. Starts the EDOT collector (Docker) for trace capture -- exports traces to the configured tracing Elasticsearch cluster (via `TRACING_ES_URL`)
 2. Starts Scout (ES + Kibana with `evals_tracing` config)
 3. Enables EIS CCM on the Scout ES cluster (if using EIS connectors)
@@ -161,6 +162,7 @@ node scripts/evals start --suite attack-discovery --export-profile local
 ```
 
 Notes:
+
 - `--datasets-profile <name>` loads `EVALUATIONS_KBN_URL` / `EVALUATIONS_KBN_API_KEY` from `config.<name>.json`
 - `--export-profile <name>` loads `EVALUATIONS_ES_URL`, `TRACING_ES_URL`, and `TRACING_EXPORTERS` from `config.<name>.json`
 
@@ -210,6 +212,35 @@ The CLI uses suite metadata from:
 .buildkite/pipelines/evals/evals.suites.json
 ```
 
+### On-demand evals (Buildkite)
+
+Run a single suite and model on any branch without opening a PR or waiting for the full Kibana PR pipeline:
+
+1. Open [kibana-evals-on-demand](https://buildkite.com/elastic/kibana-evals-on-demand) on Buildkite
+2. Click **New build**, select the branch (or commit) to evaluate
+3. Under **Environment variables** (in New build options), add the required variables below — one `KEY=value` per line. These are build-level env vars read by `run_suite.sh`.
+
+Pipeline registration: [`.buildkite/pipeline-resource-definitions/evals/kibana-evals-on-demand.yml`](../../../../../.buildkite/pipeline-resource-definitions/evals/kibana-evals-on-demand.yml).
+
+| Variable                  | Required           | Description                                                                                             |
+| ------------------------- | ------------------ | ------------------------------------------------------------------------------------------------------- |
+| `EVAL_SUITE_ID`           | yes                | Suite id from `evals.suites.json`, e.g. `agent-builder`                                                 |
+| `EVAL_MODEL_GROUPS`       | yes                | Model group, e.g. `eis/openai-gpt-5.4`                                                                  |
+| `EVAL_INCLUDE_EIS_MODELS` | for `eis/*` models | Set to `1` when `EVAL_MODEL_GROUPS` uses `eis/...`                                                      |
+| `EVALUATION_CONNECTOR_ID` | no                 | LLM-as-judge connector id override (connector id, not `eis/...` model group)                            |
+| `EVAL_SERVER_CONFIG_SET`  | some suites        | From `serverConfigSet` on the suite entry in `evals.suites.json` (e.g. `evals_endpoint` for `endpoint`) |
+| `KIBANA_BUILD_ID`         | no                 | Reuse a Kibana build from another Buildkite job (skips the build step)                                  |
+
+The eval pipeline step sets `FTR_EIS_CCM=1` and `EVAL_FANOUT=1`; `KBN_EVALS=1` is set on the pipeline.
+
+Example environment variables for Agent Builder + one EIS model:
+
+```text
+EVAL_SUITE_ID=agent-builder
+EVAL_MODEL_GROUPS=eis/openai-gpt-5.4
+EVAL_INCLUDE_EIS_MODELS=1
+```
+
 ### CI labels
 
 Eval suites can be triggered in PR CI by adding GitHub labels:
@@ -240,6 +271,7 @@ The `models:*` and `models:judge:*` labels are automatically synced from LiteLLM
 - **On demand**: Add the `ci:sync-model-labels` label to any PR to trigger label sync in PR CI.
 
 The sync step:
+
 1. Discovers available models from both **LiteLLM** (`GET /v1/models`) and **EIS** (via `discover_eis_models.js`)
 2. Creates/updates labels for all discovered models
 3. Marks stale labels as deprecated (renamed from `models:*` to `deprecated:models:*`)
@@ -384,7 +416,7 @@ node scripts/evals dataplex sync --dry-run
 node scripts/evals dataplex sync --print-commands
 ```
 
-Note: The **Dataplex "Aspect types"** console page lists *schemas*. Snapshot datasets themselves show up under Dataplex **Entries**.
+Note: The **Dataplex "Aspect types"** console page lists _schemas_. Snapshot datasets themselves show up under Dataplex **Entries**.
 
 <details>
 <summary>Manual flow (if you prefer full control)</summary>
@@ -667,11 +699,11 @@ This creates a dedicated `evaluationsEsClient` that connects to your evaluations
 
 Use these settings when traces, evaluation results, or managed datasets live outside the default Scout Kibana/Elasticsearch pair.
 
-| Variable | CLI flag | Purpose |
-| --- | --- | --- |
-| `TRACING_ES_URL` | `--trace-es-url` | Sends trace-based evaluator queries to a separate monitoring Elasticsearch cluster. |
-| `EVALUATIONS_ES_URL` | `--evaluations-es-url` | Exports evaluation scores to a separate Elasticsearch cluster. |
-| `EVALUATIONS_KBN_URL` | `--evaluations-kbn-url` | Routes dataset upsert and dataset lookup operations to a separate Kibana instance. |
+| Variable                  | CLI flag                    | Purpose                                                                                |
+| ------------------------- | --------------------------- | -------------------------------------------------------------------------------------- |
+| `TRACING_ES_URL`          | `--trace-es-url`            | Sends trace-based evaluator queries to a separate monitoring Elasticsearch cluster.    |
+| `EVALUATIONS_ES_URL`      | `--evaluations-es-url`      | Exports evaluation scores to a separate Elasticsearch cluster.                         |
+| `EVALUATIONS_KBN_URL`     | `--evaluations-kbn-url`     | Routes dataset upsert and dataset lookup operations to a separate Kibana instance.     |
 | `EVALUATIONS_KBN_API_KEY` | `--evaluations-kbn-api-key` | Optional API key used for dataset Kibana operations when `EVALUATIONS_KBN_URL` is set. |
 
 #### Using a Separate Kibana for Dataset Operations
@@ -765,12 +797,12 @@ This grants:
 
 Copy the returned `encoded` value and use it for all four secret fields in your vault config:
 
-| Config field | Env variable | Value |
-| --- | --- | --- |
-| `evaluationsEs.apiKey` | `EVALUATIONS_ES_API_KEY` | `<encoded>` |
-| `tracingEs.apiKey` | `TRACING_ES_API_KEY` | `<encoded>` |
-| `evaluationsKbn.apiKey` | `EVALUATIONS_KBN_API_KEY` | `<encoded>` |
-| `tracingExporters[0].http.headers.Authorization` | via `TRACING_EXPORTERS` | `ApiKey <encoded>` |
+| Config field                                     | Env variable              | Value              |
+| ------------------------------------------------ | ------------------------- | ------------------ |
+| `evaluationsEs.apiKey`                           | `EVALUATIONS_ES_API_KEY`  | `<encoded>`        |
+| `tracingEs.apiKey`                               | `TRACING_ES_API_KEY`      | `<encoded>`        |
+| `evaluationsKbn.apiKey`                          | `EVALUATIONS_KBN_API_KEY` | `<encoded>`        |
+| `tracingExporters[0].http.headers.Authorization` | via `TRACING_EXPORTERS`   | `ApiKey <encoded>` |
 
 ### Exporting to a separate Elasticsearch cluster