This is the main architecture document for the alerting_v2 plugin. Read this file first if you want to understand the big picture before diving into one subsystem.
The goal of the plugin is simple:
- Let users define ES|QL-based rules and notification policies.
- Evaluate rules on a schedule and persist immutable rule events.
- Derive alert lifecycle state for alert-type rules.
- Dispatch notifications asynchronously based on policies and action history.
- Expose APIs and UI for managing rules, alert actions, episodes, and notification policies.
- You are new to the plugin and need a mental model.
- You want to know which subsystem owns a behavior.
- You need to decide where a change belongs before editing code.
If you already know the high-level flow and want subsystem detail, jump to:
| Concern | Document |
|---|---|
| Rule execution pipeline, middleware, streaming steps | lib/rule_executor/README.md |
| Episode lifecycle and transition strategies | lib/director/README.md |
| Notification matching, grouping, throttling, dispatch | lib/dispatcher/README.md |
| Data streams, mappings, and ES|QL views | resources/README.md |
The plugin is easiest to understand as five cooperating layers:
| Layer | What it owns | Main code |
|---|---|---|
| Control plane | User-facing APIs, saved objects, privileges, UI, startup wiring | public/, server/routes/, server/saved_objects/, server/setup/ |
| Evaluation plane | Running rules and turning query results into rule events | lib/rule_executor/ |
| Lifecycle plane | Turning alert rule events into episode state transitions | lib/director/ |
| Delivery plane | Turning episodes into notification work and recording outcomes | lib/dispatcher/ |
| Persistence plane | Data streams and ES|QL views used by the other layers | resources/ |
Those layers deliberately do different jobs:
- The rule executor answers "what happened when this rule ran?"
- The director answers "what episode state should this alert series be in now?"
- The dispatcher answers "should anything notify right now, and what should we record about that decision?"
- The resources layer answers "where is all of that stored, and what is the durable document shape?"
A rule is a saved object that defines:
- what ES|QL to run
- how often to run it
- how results are grouped into a stable series identity
- whether it behaves as a
signalrule or analertrule - optional recovery and no-data behavior
- optional alert lifecycle thresholds for
alertrules
Rules are persisted under saved_objects/ and managed through routes/ plus lib/rules_client/.
A rule event is the immutable output document for one evaluated row or derived outcome from one rule run. Rule events are written to .rule-events and never updated in place.
Each event has:
type:signaloralertstatus:breached,recovered, orno_datagroup_hash: the series identifier inside that ruledata: the flattened ES|QL row payloadepisode.*fields for persisted alert-type rules
signal events stop here. Persisted alert events continue into lifecycle and notification processing and are expected to carry episode.* after DirectorStep enriches them.
A series is the per-rule stream of events sharing the same group_hash. It is the stable identity used to compare runs, detect recoveries, scope suppressions, and correlate lifecycle state over time.
An episode is the lifecycle container for an alert series. Episodes exist only for type: alert events.
- A single series can have many episodes over time.
- A single episode covers one breach-to-recovery lifecycle.
- The director assigns
episode.id,episode.status, andepisode.status_count.
Episode statuses are:
| Status | Meaning |
|---|---|
inactive |
The series is not currently active. |
pending |
The series breached but has not yet met activation criteria. |
active |
The series is actively alerting. |
recovering |
The series stopped breaching but has not yet fully returned to inactive. |
A notification policy is a saved object that is evaluated by the dispatcher, not embedded in the rule. Policies define:
- an optional KQL matcher
- grouping behavior
- throttling behavior
- destinations
- optional snooze / API key state
One policy can match episodes from many rules in the same space.
An alert action is an append-only document in .alert-actions describing something that happened to an alert or notification group, for example:
- user actions such as
ack,snooze,activate,deactivate - dispatcher outcomes such as fired, suppressed, throttled, unmatched, or notified
This stream is the dispatcher's durable memory.
The whole plugin revolves around two append-only streams:
.rule-events: what rule execution produced.alert-actions: what users and the dispatcher did with those events
- The UI in
public/lets users manage rules, alerts/episodes, and notification policies. - HTTP routes in
server/routes/expose the same capabilities on the server. - Saved object services and clients persist rules and notification policies.
- Each enabled rule gets its own Task Manager task.
- The task runner hands the rule id, space id, and scheduling metadata to the rule executor pipeline.
- The executor waits for resources to be ready.
- It loads and validates the rule.
- It builds and runs ES|QL.
- It materializes breached events and optional recovery / no-data events.
- For alert rules, it invokes the director to attach episode metadata.
- It writes the final event batch to
.rule-events.
- The director reads the latest known alert state for each
group_hash. - It selects a transition strategy for the rule.
- It calculates the next episode state and episode id for each alert event.
- It does not dispatch notifications and does not schedule work itself.
- The dispatcher runs on its own schedule, separate from per-rule execution.
- It reads candidate episodes from
.rule-events. - It reads suppression / throttle / notification history from
.alert-actions. - It loads rules and enabled notification policies for the relevant space.
- It evaluates policy matchers, builds groups, applies throttling, dispatches eligible groups, and stores outcomes back into
.alert-actions.
- Future rule executions compare against
.rule-eventsfor recovery and lifecycle state. - Future dispatcher runs compare against
.alert-actionsfor suppression and throttling semantics.
┌──────────────────────────────┐
│ Management UI / APIs │
│ public/ + routes/ + clients │
└──────────────┬───────────────┘
│
▼
┌──────────────────────────────┐
│ Saved objects + services │
│ rules / notification │
│ policies / privileges │
└──────────────┬───────────────┘
│
┌────────────────────┴────────────────────┐
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ Rule executor task │ │ Dispatcher task │
│ one task per enabled │ │ fixed interval │
│ rule │ │ │
└─────────────┬────────────┘ └─────────────┬────────────┘
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ Rule executor pipeline │ │ Dispatcher pipeline │
│ ES|QL -> rule events │ │ episodes -> groups -> │
│ -> director -> store │ │ dispatch -> action log │
└─────────────┬────────────┘ └─────────────┬────────────┘
│ │
▼ ▼
`.rule-events` `.alert-actions`
▲ ▲
└─────────────── reads and writes ────────┘
These boundaries keep the architecture understandable. If a change crosses one of them, that is a signal to slow down and verify the design.
- Turning a rule run into rule events
- Recovery and no-data semantics
- Streaming large ES|QL result sets through the execution pipeline
- Persisting final events to
.rule-events
- Converting alert events into episode state
- Choosing transition strategies based on rule configuration
- Generating or reusing episode ids
- Matching episodes to notification policies
- Grouping and throttling
- Delivery to destinations
- Recording durable dispatch decisions in
.alert-actions
- It does not re-run the rule query
- It does not infer recoveries
- It does not calculate episode state transitions
- Datastream definitions and mappings
- ES|QL views
- Versioned schema evolution for stored documents
The plugin uses Inversify-based dependency injection and startup hooks:
| Path | Role |
|---|---|
server/index.ts |
Registers setup/start hooks, services, tasks, and subsystem bindings. |
setup/bind_on_setup.ts |
Registers privileges, saved objects, UI settings, telemetry setup, and task definitions. |
setup/bind_on_start.ts |
Initializes Elasticsearch resources and schedules background tasks. |
setup/bind_routes.ts |
Registers HTTP routes. |
setup/bind_services.ts |
Binds shared services, clients, dispatcher enablement, and director strategies. |
setup/bind_rule_executor.ts |
Registers rule executor middleware and steps in execution order. |
setup/bind_dispatcher_executor.ts |
Registers dispatcher steps in execution order. |
setup/bind_tasks.ts |
Registers Task Manager definitions for rule execution, dispatcher, and support tasks. |
| Path | Role |
|---|---|
public/ |
UI applications, hooks, API wrappers, and app mounting. |
lib/rule_executor/ |
Per-rule execution pipeline. |
lib/director/ |
Alert episode state engine. |
lib/dispatcher/ |
Notification pipeline. |
lib/services/ |
Shared ES, storage, logging, retry, user, and saved object services. |
resources/ |
Datastreams and ES|QL views. |
routes/ |
HTTP API surface. |
saved_objects/ |
Rule and notification policy persistence models. |
If your change is about...
- query construction, batching, recovery, or executor middleware: start in
lib/rule_executor/README.md - episode state transitions or new lifecycle semantics: start in
lib/director/README.md - matchers, grouping, throttling, suppression, or destinations: start in
lib/dispatcher/README.md - stored fields, mappings, or ES|QL views: start in
resources/README.md - routes, request validation, or client-facing contracts: inspect
routes/,saved_objects/, and the relevant subsystem together - UI behavior: inspect
public/and confirm whether server route or saved object changes are needed too
- Prefer changing the smallest owning subsystem instead of threading behavior through multiple layers.
- If you add fields to stored documents, update the resources definitions and any readers/writers that depend on them.
- If you add pipeline state, add it to the relevant
types.ts, update the step bindings, and add tests at the step and pipeline level. - If you are tempted to make the dispatcher understand rule execution internals, that logic probably belongs earlier in the rule executor.
- If you are tempted to make the director send notifications, that logic probably belongs later in the dispatcher.