[AlertingV2] Per-tick dispatcher telemetry, event watermark, and ES|QL IN-clause chunking#271990
Draft
kdelemme wants to merge 7 commits into
Draft
[AlertingV2] Per-tick dispatcher telemetry, event watermark, and ES|QL IN-clause chunking#271990kdelemme wants to merge 7 commits into
kdelemme wants to merge 7 commits into
Conversation
Adds `dispatcher/telemetry/` with: - `types.ts` — `DispatcherTickSummary`, `DispatcherStageTiming`, count schema - `clock.ts` — monotonic `startHrtime`/`elapsedMs` helpers (immune to NTP jumps) - `tick_summary.ts` — `buildTickSummary`/`emitTickSummary` structured-log builder - `state_counts.ts` — derive per-stage count snapshot from `DispatcherPipelineState` - `stage_error.ts` — typed `StageError` wrapper for pipeline step exceptions Also updates `with_dispatcher_span.ts` to drop the direct `elastic-apm-node` import and use the Kibana APM shim instead, and extends `LoggerService` with the structured error/warn overloads needed by `emitTickSummary`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `DispatcherService.run` now emits a structured 'dispatcher tick complete' info log after every tick and always emits on pipeline-level throws (so outages are visible even when the pipeline itself explodes). - `DispatcherPipeline` catches step exceptions and converts them to `step_error` halts rather than propagating raw throws, giving the tick summary a chance to record stage timings up to the failure. - Tick and per-stage durations use a shared monotonic clock so operators can reason about them together (sum-of-stages ≤ tick total). - `DispatcherHaltReason` splits into `DispatcherStepHaltReason` (safe, step-controlled) and the superset `DispatcherHaltReason` (adds `step_error`), preventing steps from accidentally returning it. - `DispatcherExecutionResult` now carries `tick: DispatcherTickSummary` and `nextEventWatermark` for the task runner to persist. - `execution_pipeline` gains opt-in parallel group support; parallel groups are an optimization, not a default. - `DISPATCHER_TASK_TYPE`/`DISPATCHER_TASK_ID` moved to `constants.ts` so test helpers can import them without pulling in inversify decorators. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_policies concurrently fetch_episodes now maintains a per-tick watermark so each run queries only the events since the last processed timestamp rather than re-scanning the full lookback window: - `TICK_LOOKBACK_CAP_MINUTES` (1 min) caps the per-tick query window so a single tick never buffers more rows than the ES|QL sub-plan limit. After an outage the dispatcher drains the backlog at one cap per tick. - `SETTLE_BUFFER_SECONDS` (5 s) stops the window just short of `now` to absorb ES refresh-interval lag and clock skew. - Watermark is anchored to the last observed episode timestamp, not the query boundary, so the window never skips an event that arrived late. - `DispatcherPipelineInput` gains `eventWatermark?: Date`; task_runner reads/writes it from task state so it survives restarts. `fetch_episodes` and `fetch_policies` are now run concurrently in a parallel group (verified safe via the equivalence test), pulling `fetch_policies` off the critical path (~158 ms p95 in benchmarks). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fer limit ES|QL's INLINE STATS sub-plan buffers the full IN-list in memory. Beyond ~16.8 MB the query fails. Under high cardinality (many distinct rule/group combinations) the (rule_id, group_hash) IN-list passed to fetch_suppressions and the action_group_id IN-list passed to apply_throttling can exceed this threshold. Both steps now split their IN-lists into chunks (each ≤ the per-literal budget) and issue one ES|QL query per chunk, merging the results in memory before continuing. The per-literal budget accounts for quoted-string overhead + commas, not just raw byte length, matching ES|QL's actual tokenisation. Also drops `_source` from the fetch_episodes query after JSON_EXTRACT; the field is redundant once the EVAL is in place and was adding unnecessary network transfer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tick_summary.ts: use stages.at(-1)?.counts ?? ZERO_COUNTS instead of manual length-guard + index arithmetic - state_counts.ts: replace Object.fromEntries(entries.map(…)) in toSpanLabels with a direct property-assignment loop (no intermediate arrays) - queries.ts: fuse the two alertEpisodes traversals in getAlertEpisodeSuppressionsQueries into a single for...of loop that builds minLastEventTimestamp and the uniquePairKeySet together - types.ts: export CLEAN_HALT_REASONS next to DispatcherHaltReason so the watermark-advancement policy lives at the type boundary - dispatcher.ts: use CLEAN_HALT_REASONS.has() in extractAdvanceableWatermark; adding a new controlled halt reason now requires only one change Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
🤖 Jobs for this PR can be triggered through checkboxes. 🚧
ℹ️ To trigger the CI, please tick the checkbox below 👇
|
…ling policy lookup Two critical issues found during review: 1. ApplyMaintenanceWindowStep was imported but never bound to DispatcherExecutionStepsToken after the parallel-group refactor silently dropped it. Episodes during active maintenance windows were no longer suppressed and would fire when they should be silent. Re-bound as a serial entry between FetchRulesStep and EvaluateMatchersStep (reads dispatchable + rules, which are both ready at that point). Updated dispatcher.test.ts and integration_tests/dispatcher.test.ts to include the step in the manually-constructed pipeline, and updated the README diagram and step table. 2. applyThrottling used a non-null assertion (policies.get(group.policyId)!) without a guard. A policy deleted between FetchPoliciesStep and ApplyThrottlingStep would cause the step to throw, produce a step_error halt, and permanently stall the event watermark (watermark not advanced on step_error). Replaced with an explicit guard that skips the group. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…test The Jest integration test at integration_tests/dispatcher.test.ts imported helpers (./helpers/wait, ./setup_test_servers) that were never created, making it non-compilable and unable to run in CI. All 12 test scenarios it covered are fully subsumed by the Scout API spec at test/scout_alerting_v2/api/tests/dispatcher.spec.ts, which is functional, runs against a real deployment, and has 21 test cases including maintenance window suppression, single-rule policy, multi- policy dispatch, lookback window, snoozed policy, missing destination, and throttled-policy suppress records — scenarios not present in the Jest test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dispatcher/telemetry/package with a structured tick-summary log (dispatcher tick complete) emitted after every run, per-stage timings on a shared monotonic clock, and an APM span per step. Step exceptions are caught and converted tostep_errorhalts so the summary is always emitted — including on crashes.fetch_episodesnow maintains a per-tick watermark so each run queries only new events (gtelast observed timestamp).TICK_LOOKBACK_CAP_MINUTES(1 min) bounds the query window to keep ES|QL row counts under the sub-plan buffer limit;SETTLE_BUFFER_SECONDS(5 s) absorbs refresh-lag. The watermark is persisted in Task Manager state and only advances on clean outcomes — failed ticks re-read their window.fetch_episodes ∥ fetch_policies: the two I/O-bound steps are now run concurrently in a parallel group, removingfetch_policiesp95 latency (~158 ms) from the serial tick budget. The equivalence test proves the regrouping preserves observable behavior.fetch_suppressionsandapply_throttlingsplit theirIN (…)literals into ≤ 600 KB chunks to stay under the ES|QL 1 MB statement cap. Without this a single high-cardinality tick would stall the dispatcher permanently (watermark is not advanced onstep_error).Test plan
node scripts/jest x-pack/platform/plugins/shared/alerting_v2/server/lib/dispatcher/node scripts/jest_integration x-pack/platform/plugins/shared/alerting_v2/server/lib/dispatcher/integration_tests/dispatcher tick completelog appears in Kibana server logs after a dispatcher tickkibana.alerting_v2.dispatcher.tickfields are present and correct in structured log output🤖 Generated with Claude Code