Skip to content

[AlertingV2] Per-tick dispatcher telemetry, event watermark, and ES|QL IN-clause chunking#271990

Draft
kdelemme wants to merge 7 commits into
elastic:mainfrom
kdelemme:alerting-v2/dispatcher-per-tick-telemetry-clean
Draft

[AlertingV2] Per-tick dispatcher telemetry, event watermark, and ES|QL IN-clause chunking#271990
kdelemme wants to merge 7 commits into
elastic:mainfrom
kdelemme:alerting-v2/dispatcher-per-tick-telemetry-clean

Conversation

@kdelemme
Copy link
Copy Markdown
Contributor

Summary

  • Telemetry submodule: new dispatcher/telemetry/ package with a structured tick-summary log (dispatcher tick complete) emitted after every run, per-stage timings on a shared monotonic clock, and an APM span per step. Step exceptions are caught and converted to step_error halts so the summary is always emitted — including on crashes.
  • Event watermark + bounded window: fetch_episodes now maintains a per-tick watermark so each run queries only new events (gte last observed timestamp). TICK_LOOKBACK_CAP_MINUTES (1 min) bounds the query window to keep ES|QL row counts under the sub-plan buffer limit; SETTLE_BUFFER_SECONDS (5 s) absorbs refresh-lag. The watermark is persisted in Task Manager state and only advances on clean outcomes — failed ticks re-read their window.
  • Parallel fetch_episodes ∥ fetch_policies: the two I/O-bound steps are now run concurrently in a parallel group, removing fetch_policies p95 latency (~158 ms) from the serial tick budget. The equivalence test proves the regrouping preserves observable behavior.
  • ES|QL IN-clause chunking: fetch_suppressions and apply_throttling split their IN (…) literals into ≤ 600 KB chunks to stay under the ES|QL 1 MB statement cap. Without this a single high-cardinality tick would stall the dispatcher permanently (watermark is not advanced on step_error).

Test plan

  • node scripts/jest x-pack/platform/plugins/shared/alerting_v2/server/lib/dispatcher/
  • node scripts/jest_integration x-pack/platform/plugins/shared/alerting_v2/server/lib/dispatcher/integration_tests/
  • Verify dispatcher tick complete log appears in Kibana server logs after a dispatcher tick
  • Verify kibana.alerting_v2.dispatcher.tick fields are present and correct in structured log output

🤖 Generated with Claude Code

kdelemme and others added 5 commits May 29, 2026 15:13
Adds `dispatcher/telemetry/` with:
- `types.ts` — `DispatcherTickSummary`, `DispatcherStageTiming`, count schema
- `clock.ts` — monotonic `startHrtime`/`elapsedMs` helpers (immune to NTP jumps)
- `tick_summary.ts` — `buildTickSummary`/`emitTickSummary` structured-log builder
- `state_counts.ts` — derive per-stage count snapshot from `DispatcherPipelineState`
- `stage_error.ts` — typed `StageError` wrapper for pipeline step exceptions

Also updates `with_dispatcher_span.ts` to drop the direct `elastic-apm-node`
import and use the Kibana APM shim instead, and extends `LoggerService` with
the structured error/warn overloads needed by `emitTickSummary`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `DispatcherService.run` now emits a structured 'dispatcher tick complete'
  info log after every tick and always emits on pipeline-level throws
  (so outages are visible even when the pipeline itself explodes).
- `DispatcherPipeline` catches step exceptions and converts them to
  `step_error` halts rather than propagating raw throws, giving the
  tick summary a chance to record stage timings up to the failure.
- Tick and per-stage durations use a shared monotonic clock so operators
  can reason about them together (sum-of-stages ≤ tick total).
- `DispatcherHaltReason` splits into `DispatcherStepHaltReason` (safe,
  step-controlled) and the superset `DispatcherHaltReason` (adds
  `step_error`), preventing steps from accidentally returning it.
- `DispatcherExecutionResult` now carries `tick: DispatcherTickSummary`
  and `nextEventWatermark` for the task runner to persist.
- `execution_pipeline` gains opt-in parallel group support; parallel
  groups are an optimization, not a default.
- `DISPATCHER_TASK_TYPE`/`DISPATCHER_TASK_ID` moved to `constants.ts`
  so test helpers can import them without pulling in inversify decorators.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_policies concurrently

fetch_episodes now maintains a per-tick watermark so each run queries only
the events since the last processed timestamp rather than re-scanning the
full lookback window:

- `TICK_LOOKBACK_CAP_MINUTES` (1 min) caps the per-tick query window so a
  single tick never buffers more rows than the ES|QL sub-plan limit.
  After an outage the dispatcher drains the backlog at one cap per tick.
- `SETTLE_BUFFER_SECONDS` (5 s) stops the window just short of `now` to
  absorb ES refresh-interval lag and clock skew.
- Watermark is anchored to the last observed episode timestamp, not the
  query boundary, so the window never skips an event that arrived late.
- `DispatcherPipelineInput` gains `eventWatermark?: Date`; task_runner
  reads/writes it from task state so it survives restarts.

`fetch_episodes` and `fetch_policies` are now run concurrently in a
parallel group (verified safe via the equivalence test), pulling
`fetch_policies` off the critical path (~158 ms p95 in benchmarks).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fer limit

ES|QL's INLINE STATS sub-plan buffers the full IN-list in memory. Beyond
~16.8 MB the query fails. Under high cardinality (many distinct rule/group
combinations) the (rule_id, group_hash) IN-list passed to
fetch_suppressions and the action_group_id IN-list passed to
apply_throttling can exceed this threshold.

Both steps now split their IN-lists into chunks (each ≤ the per-literal
budget) and issue one ES|QL query per chunk, merging the results in memory
before continuing. The per-literal budget accounts for quoted-string
overhead + commas, not just raw byte length, matching ES|QL's actual
tokenisation.

Also drops `_source` from the fetch_episodes query after JSON_EXTRACT;
the field is redundant once the EVAL is in place and was adding
unnecessary network transfer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tick_summary.ts: use stages.at(-1)?.counts ?? ZERO_COUNTS instead of
  manual length-guard + index arithmetic
- state_counts.ts: replace Object.fromEntries(entries.map(…)) in
  toSpanLabels with a direct property-assignment loop (no intermediate
  arrays)
- queries.ts: fuse the two alertEpisodes traversals in
  getAlertEpisodeSuppressionsQueries into a single for...of loop that
  builds minLastEventTimestamp and the uniquePairKeySet together
- types.ts: export CLEAN_HALT_REASONS next to DispatcherHaltReason so
  the watermark-advancement policy lives at the type boundary
- dispatcher.ts: use CLEAN_HALT_REASONS.has() in
  extractAdvanceableWatermark; adding a new controlled halt reason now
  requires only one change

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the author:actionable-obs PRs authored by the actionable obs team label May 29, 2026
@infra-vault-gh-plugin-prod
Copy link
Copy Markdown

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!
  • Click to trigger kibana-deploy-project-from-pr for this PR!
  • Click to trigger kibana-deploy-cloud-from-pr for this PR!
  • Click to trigger kibana-entity-store-performance-from-pr for this PR!
  • Click to trigger kibana-storybooks-from-pr for this PR!

kdelemme and others added 2 commits May 29, 2026 16:17
…ling policy lookup

Two critical issues found during review:

1. ApplyMaintenanceWindowStep was imported but never bound to
   DispatcherExecutionStepsToken after the parallel-group refactor silently
   dropped it. Episodes during active maintenance windows were no longer
   suppressed and would fire when they should be silent. Re-bound as a
   serial entry between FetchRulesStep and EvaluateMatchersStep (reads
   dispatchable + rules, which are both ready at that point). Updated
   dispatcher.test.ts and integration_tests/dispatcher.test.ts to include
   the step in the manually-constructed pipeline, and updated the README
   diagram and step table.

2. applyThrottling used a non-null assertion (policies.get(group.policyId)!)
   without a guard. A policy deleted between FetchPoliciesStep and
   ApplyThrottlingStep would cause the step to throw, produce a step_error
   halt, and permanently stall the event watermark (watermark not advanced
   on step_error). Replaced with an explicit guard that skips the group.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test

The Jest integration test at integration_tests/dispatcher.test.ts
imported helpers (./helpers/wait, ./setup_test_servers) that were never
created, making it non-compilable and unable to run in CI.

All 12 test scenarios it covered are fully subsumed by the Scout API
spec at test/scout_alerting_v2/api/tests/dispatcher.spec.ts, which is
functional, runs against a real deployment, and has 21 test cases
including maintenance window suppression, single-rule policy, multi-
policy dispatch, lookback window, snoozed policy, missing destination,
and throttled-policy suppress records — scenarios not present in the
Jest test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:actionable-obs PRs authored by the actionable obs team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant