[AlertingV2] Per-tick dispatcher telemetry, event watermark, and ES|QL IN-clause chunking by kdelemme · Pull Request #271990 · elastic/kibana

kdelemme · 2026-05-29T19:29:37Z

Summary

Telemetry submodule: new dispatcher/telemetry/ package with a structured tick-summary log (dispatcher tick complete) emitted after every run, per-stage timings on a shared monotonic clock, and an APM span per step. Step exceptions are caught and converted to step_error halts so the summary is always emitted — including on crashes.
Event watermark + bounded window: fetch_episodes now maintains a per-tick watermark so each run queries only new events (gte last observed timestamp). TICK_LOOKBACK_CAP_MINUTES (1 min) bounds the query window to keep ES|QL row counts under the sub-plan buffer limit; SETTLE_BUFFER_SECONDS (5 s) absorbs refresh-lag. The watermark is persisted in Task Manager state and only advances on clean outcomes — failed ticks re-read their window.
Parallel fetch_episodes ∥ fetch_policies: the two I/O-bound steps are now run concurrently in a parallel group, removing fetch_policies p95 latency (~158 ms) from the serial tick budget. The equivalence test proves the regrouping preserves observable behavior.
ES|QL IN-clause chunking: fetch_suppressions and apply_throttling split their IN (…) literals into ≤ 600 KB chunks to stay under the ES|QL 1 MB statement cap. Without this a single high-cardinality tick would stall the dispatcher permanently (watermark is not advanced on step_error).

Test plan

node scripts/jest x-pack/platform/plugins/shared/alerting_v2/server/lib/dispatcher/
node scripts/jest_integration x-pack/platform/plugins/shared/alerting_v2/server/lib/dispatcher/integration_tests/
Verify dispatcher tick complete log appears in Kibana server logs after a dispatcher tick
Verify kibana.alerting_v2.dispatcher.tick fields are present and correct in structured log output

🤖 Generated with Claude Code

Adds `dispatcher/telemetry/` with: - `types.ts` — `DispatcherTickSummary`, `DispatcherStageTiming`, count schema - `clock.ts` — monotonic `startHrtime`/`elapsedMs` helpers (immune to NTP jumps) - `tick_summary.ts` — `buildTickSummary`/`emitTickSummary` structured-log builder - `state_counts.ts` — derive per-stage count snapshot from `DispatcherPipelineState` - `stage_error.ts` — typed `StageError` wrapper for pipeline step exceptions Also updates `with_dispatcher_span.ts` to drop the direct `elastic-apm-node` import and use the Kibana APM shim instead, and extends `LoggerService` with the structured error/warn overloads needed by `emitTickSummary`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- `DispatcherService.run` now emits a structured 'dispatcher tick complete' info log after every tick and always emits on pipeline-level throws (so outages are visible even when the pipeline itself explodes). - `DispatcherPipeline` catches step exceptions and converts them to `step_error` halts rather than propagating raw throws, giving the tick summary a chance to record stage timings up to the failure. - Tick and per-stage durations use a shared monotonic clock so operators can reason about them together (sum-of-stages ≤ tick total). - `DispatcherHaltReason` splits into `DispatcherStepHaltReason` (safe, step-controlled) and the superset `DispatcherHaltReason` (adds `step_error`), preventing steps from accidentally returning it. - `DispatcherExecutionResult` now carries `tick: DispatcherTickSummary` and `nextEventWatermark` for the task runner to persist. - `execution_pipeline` gains opt-in parallel group support; parallel groups are an optimization, not a default. - `DISPATCHER_TASK_TYPE`/`DISPATCHER_TASK_ID` moved to `constants.ts` so test helpers can import them without pulling in inversify decorators. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…_policies concurrently fetch_episodes now maintains a per-tick watermark so each run queries only the events since the last processed timestamp rather than re-scanning the full lookback window: - `TICK_LOOKBACK_CAP_MINUTES` (1 min) caps the per-tick query window so a single tick never buffers more rows than the ES|QL sub-plan limit. After an outage the dispatcher drains the backlog at one cap per tick. - `SETTLE_BUFFER_SECONDS` (5 s) stops the window just short of `now` to absorb ES refresh-interval lag and clock skew. - Watermark is anchored to the last observed episode timestamp, not the query boundary, so the window never skips an event that arrived late. - `DispatcherPipelineInput` gains `eventWatermark?: Date`; task_runner reads/writes it from task state so it survives restarts. `fetch_episodes` and `fetch_policies` are now run concurrently in a parallel group (verified safe via the equivalence test), pulling `fetch_policies` off the critical path (~158 ms p95 in benchmarks). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…fer limit ES|QL's INLINE STATS sub-plan buffers the full IN-list in memory. Beyond ~16.8 MB the query fails. Under high cardinality (many distinct rule/group combinations) the (rule_id, group_hash) IN-list passed to fetch_suppressions and the action_group_id IN-list passed to apply_throttling can exceed this threshold. Both steps now split their IN-lists into chunks (each ≤ the per-literal budget) and issue one ES|QL query per chunk, merging the results in memory before continuing. The per-literal budget accounts for quoted-string overhead + commas, not just raw byte length, matching ES|QL's actual tokenisation. Also drops `_source` from the fetch_episodes query after JSON_EXTRACT; the field is redundant once the EVAL is in place and was adding unnecessary network transfer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- tick_summary.ts: use stages.at(-1)?.counts ?? ZERO_COUNTS instead of manual length-guard + index arithmetic - state_counts.ts: replace Object.fromEntries(entries.map(…)) in toSpanLabels with a direct property-assignment loop (no intermediate arrays) - queries.ts: fuse the two alertEpisodes traversals in getAlertEpisodeSuppressionsQueries into a single for...of loop that builds minLastEventTimestamp and the uniquePairKeySet together - types.ts: export CLEAN_HALT_REASONS next to DispatcherHaltReason so the watermark-advancement policy lives at the type boundary - dispatcher.ts: use CLEAN_HALT_REASONS.has() in extractAdvanceableWatermark; adding a new controlled halt reason now requires only one change Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

infra-vault-gh-plugin-prod · 2026-05-29T19:29:51Z

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

Click to trigger kibana-pull-request for this PR!
Click to trigger kibana-deploy-project-from-pr for this PR!
Click to trigger kibana-deploy-cloud-from-pr for this PR!
Click to trigger kibana-entity-store-performance-from-pr for this PR!
Click to trigger kibana-storybooks-from-pr for this PR!

…ling policy lookup Two critical issues found during review: 1. ApplyMaintenanceWindowStep was imported but never bound to DispatcherExecutionStepsToken after the parallel-group refactor silently dropped it. Episodes during active maintenance windows were no longer suppressed and would fire when they should be silent. Re-bound as a serial entry between FetchRulesStep and EvaluateMatchersStep (reads dispatchable + rules, which are both ready at that point). Updated dispatcher.test.ts and integration_tests/dispatcher.test.ts to include the step in the manually-constructed pipeline, and updated the README diagram and step table. 2. applyThrottling used a non-null assertion (policies.get(group.policyId)!) without a guard. A policy deleted between FetchPoliciesStep and ApplyThrottlingStep would cause the step to throw, produce a step_error halt, and permanently stall the event watermark (watermark not advanced on step_error). Replaced with an explicit guard that skips the group. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…test The Jest integration test at integration_tests/dispatcher.test.ts imported helpers (./helpers/wait, ./setup_test_servers) that were never created, making it non-compilable and unable to run in CI. All 12 test scenarios it covered are fully subsumed by the Scout API spec at test/scout_alerting_v2/api/tests/dispatcher.spec.ts, which is functional, runs against a real deployment, and has 21 test cases including maintenance window suppression, single-rule policy, multi- policy dispatch, lookback window, snoozed policy, missing destination, and throttled-policy suppress records — scenarios not present in the Jest test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kdelemme and others added 5 commits May 29, 2026 15:13

github-actions Bot added the author:actionable-obs PRs authored by the actionable obs team label May 29, 2026

kdelemme and others added 2 commits May 29, 2026 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AlertingV2] Per-tick dispatcher telemetry, event watermark, and ES|QL IN-clause chunking#271990

[AlertingV2] Per-tick dispatcher telemetry, event watermark, and ES|QL IN-clause chunking#271990
kdelemme wants to merge 7 commits into
elastic:mainfrom
kdelemme:alerting-v2/dispatcher-per-tick-telemetry-clean

kdelemme commented May 29, 2026

Uh oh!

infra-vault-gh-plugin-prod Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kdelemme commented May 29, 2026

Summary

Test plan

Uh oh!

infra-vault-gh-plugin-prod Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant