Alerting V2 Architecture

This is the main architecture document for the alerting_v2 plugin. Read this file first if you want to understand the big picture before diving into one subsystem.

The goal of the plugin is simple:

Let users define ES|QL-based rules and notification policies.
Evaluate rules on a schedule and persist immutable rule events.
Derive alert lifecycle state for alert-type rules.
Dispatch notifications asynchronously based on policies and action history.
Expose APIs and UI for managing rules, alert actions, episodes, and notification policies.

Read this document when

You are new to the plugin and need a mental model.
You want to know which subsystem owns a behavior.
You need to decide where a change belongs before editing code.

If you already know the high-level flow and want subsystem detail, jump to:

Concern	Document
Rule execution pipeline, middleware, streaming steps	`lib/rule_executor/README.md`
Episode lifecycle and transition strategies	`lib/director/README.md`
Notification matching, grouping, throttling, dispatch	`lib/dispatcher/README.md`
Data streams, mappings, and ES\|QL views	`resources/README.md`

The mental model

The plugin is easiest to understand as five cooperating layers:

Layer	What it owns	Main code
Control plane	User-facing APIs, saved objects, privileges, UI, startup wiring	`public/`, `server/routes/`, `server/saved_objects/`, `server/setup/`
Evaluation plane	Running rules and turning query results into rule events	`lib/rule_executor/`
Lifecycle plane	Turning alert rule events into episode state transitions	`lib/director/`
Delivery plane	Turning episodes into notification work and recording outcomes	`lib/dispatcher/`
Persistence plane	Data streams and ES\|QL views used by the other layers	`resources/`

Those layers deliberately do different jobs:

The rule executor answers "what happened when this rule ran?"
The director answers "what episode state should this alert series be in now?"
The dispatcher answers "should anything notify right now, and what should we record about that decision?"
The resources layer answers "where is all of that stored, and what is the durable document shape?"

Core domain concepts

Rule

A rule is a saved object that defines:

what ES|QL to run
how often to run it
how results are grouped into a stable series identity
whether it behaves as a signal rule or an alert rule
optional recovery and no-data behavior
optional alert lifecycle thresholds for alert rules

Rules are persisted under saved_objects/ and managed through routes/ plus lib/rules_client/.

Rule event

A rule event is the immutable output document for one evaluated row or derived outcome from one rule run. Rule events are written to .rule-events and never updated in place.

Each event has:

type: signal or alert
status: breached, recovered, or no_data
group_hash: the series identifier inside that rule
data: the flattened ES|QL row payload
episode.* fields for persisted alert-type rules

signal events stop here. Persisted alert events continue into lifecycle and notification processing and are expected to carry episode.* after DirectorStep enriches them.

Series

A series is the per-rule stream of events sharing the same group_hash. It is the stable identity used to compare runs, detect recoveries, scope suppressions, and correlate lifecycle state over time.

Episode

An episode is the lifecycle container for an alert series. Episodes exist only for type: alert events.

A single series can have many episodes over time.
A single episode covers one breach-to-recovery lifecycle.
The director assigns episode.id, episode.status, and episode.status_count.

Episode statuses are:

Status	Meaning
`inactive`	The series is not currently active.
`pending`	The series breached but has not yet met activation criteria.
`active`	The series is actively alerting.
`recovering`	The series stopped breaching but has not yet fully returned to inactive.

Notification policy

A notification policy is a saved object that is evaluated by the dispatcher, not embedded in the rule. Policies define:

an optional KQL matcher
grouping behavior
throttling behavior
destinations
optional snooze / API key state

One policy can match episodes from many rules in the same space.

Alert action

An alert action is an append-only document in .alert-actions describing something that happened to an alert or notification group, for example:

user actions such as ack, snooze, activate, deactivate
dispatcher outcomes such as fired, suppressed, throttled, unmatched, or notified

This stream is the dispatcher's durable memory.

End-to-end flow

The whole plugin revolves around two append-only streams:

.rule-events: what rule execution produced
.alert-actions: what users and the dispatcher did with those events

1. Configuration and management

The UI in public/ lets users manage rules, alerts/episodes, and notification policies.
HTTP routes in server/routes/ expose the same capabilities on the server.
Saved object services and clients persist rules and notification policies.

2. Rule scheduling

Each enabled rule gets its own Task Manager task.
The task runner hands the rule id, space id, and scheduling metadata to the rule executor pipeline.

3. Rule execution

The executor waits for resources to be ready.
It loads and validates the rule.
It builds and runs ES|QL.
It materializes breached events and optional recovery / no-data events.
For alert rules, it invokes the director to attach episode metadata.
It writes the final event batch to .rule-events.

4. Episode lifecycle enrichment

The director reads the latest known alert state for each group_hash.
It selects a transition strategy for the rule.
It calculates the next episode state and episode id for each alert event.
It does not dispatch notifications and does not schedule work itself.

5. Asynchronous notification dispatch

The dispatcher runs on its own schedule, separate from per-rule execution.
It reads candidate episodes from .rule-events.
It reads suppression / throttle / notification history from .alert-actions.
It loads rules and enabled notification policies for the relevant space.
It evaluates policy matchers, builds groups, applies throttling, dispatches eligible groups, and stores outcomes back into .alert-actions.

6. Subsequent runs reuse durable history

Future rule executions compare against .rule-events for recovery and lifecycle state.
Future dispatcher runs compare against .alert-actions for suppression and throttling semantics.

Architecture diagram

                        ┌──────────────────────────────┐
                        │     Management UI / APIs     │
                        │ public/ + routes/ + clients  │
                        └──────────────┬───────────────┘
                                       │
                                       ▼
                        ┌──────────────────────────────┐
                        │   Saved objects + services   │
                        │   rules / notification       │
                        │   policies / privileges      │
                        └──────────────┬───────────────┘
                                       │
                  ┌────────────────────┴────────────────────┐
                  ▼                                         ▼
      ┌──────────────────────────┐              ┌──────────────────────────┐
      │  Rule executor task      │              │    Dispatcher task       │
      │  one task per enabled    │              │    fixed interval        │
      │  rule                    │              │                          │
      └─────────────┬────────────┘              └─────────────┬────────────┘
                    │                                         │
                    ▼                                         ▼
      ┌──────────────────────────┐              ┌──────────────────────────┐
      │ Rule executor pipeline   │              │ Dispatcher pipeline      │
      │ ES|QL -> rule events     │              │ episodes -> groups ->    │
      │ -> director -> store     │              │ dispatch -> action log   │
      └─────────────┬────────────┘              └─────────────┬────────────┘
                    │                                         │
                    ▼                                         ▼
              `.rule-events`                         `.alert-actions`
                    ▲                                         ▲
                    └─────────────── reads and writes ────────┘

Important ownership boundaries

These boundaries keep the architecture understandable. If a change crosses one of them, that is a signal to slow down and verify the design.

What the rule executor owns

Turning a rule run into rule events
Recovery and no-data semantics
Streaming large ES|QL result sets through the execution pipeline
Persisting final events to .rule-events

What the director owns

Converting alert events into episode state
Choosing transition strategies based on rule configuration
Generating or reusing episode ids

What the dispatcher owns

Matching episodes to notification policies
Grouping and throttling
Delivery to destinations
Recording durable dispatch decisions in .alert-actions

What the dispatcher does not own

It does not re-run the rule query
It does not infer recoveries
It does not calculate episode state transitions

What the resources layer owns

Datastream definitions and mappings
ES|QL views
Versioned schema evolution for stored documents

Plugin startup and wiring

The plugin uses Inversify-based dependency injection and startup hooks:

Path	Role
`server/index.ts`	Registers setup/start hooks, services, tasks, and subsystem bindings.
`setup/bind_on_setup.ts`	Registers privileges, saved objects, UI settings, telemetry setup, and task definitions.
`setup/bind_on_start.ts`	Initializes Elasticsearch resources and schedules background tasks.
`setup/bind_routes.ts`	Registers HTTP routes.
`setup/bind_services.ts`	Binds shared services, clients, dispatcher enablement, and director strategies.
`setup/bind_rule_executor.ts`	Registers rule executor middleware and steps in execution order.
`setup/bind_dispatcher_executor.ts`	Registers dispatcher steps in execution order.
`setup/bind_tasks.ts`	Registers Task Manager definitions for rule execution, dispatcher, and support tasks.

Repository map

Path	Role
`public/`	UI applications, hooks, API wrappers, and app mounting.
`lib/rule_executor/`	Per-rule execution pipeline.
`lib/director/`	Alert episode state engine.
`lib/dispatcher/`	Notification pipeline.
`lib/services/`	Shared ES, storage, logging, retry, user, and saved object services.
`resources/`	Datastreams and ES\|QL views.
`routes/`	HTTP API surface.
`saved_objects/`	Rule and notification policy persistence models.

Where to make changes

If your change is about...

query construction, batching, recovery, or executor middleware: start in lib/rule_executor/README.md
episode state transitions or new lifecycle semantics: start in lib/director/README.md
matchers, grouping, throttling, suppression, or destinations: start in lib/dispatcher/README.md
stored fields, mappings, or ES|QL views: start in resources/README.md
routes, request validation, or client-facing contracts: inspect routes/, saved_objects/, and the relevant subsystem together
UI behavior: inspect public/ and confirm whether server route or saved object changes are needed too

Contribution rules of thumb

Prefer changing the smallest owning subsystem instead of threading behavior through multiple layers.
If you add fields to stored documents, update the resources definitions and any readers/writers that depend on them.
If you add pipeline state, add it to the relevant types.ts, update the step bindings, and add tests at the step and pipeline level.
If you are tempted to make the dispatcher understand rule execution internals, that logic probably belongs earlier in the rule executor.
If you are tempted to make the director send notifications, that logic probably belongs later in the dispatcher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting V2 Architecture

Read this document when

The mental model

Core domain concepts

Rule

Rule event

Series

Episode

Notification policy

Alert action

End-to-end flow

1. Configuration and management

2. Rule scheduling

3. Rule execution

4. Episode lifecycle enrichment

5. Asynchronous notification dispatch

6. Subsequent runs reuse durable history

Architecture diagram

Important ownership boundaries

What the rule executor owns

What the director owns

What the dispatcher owns

What the dispatcher does not own

What the resources layer owns

Plugin startup and wiring

Repository map

Where to make changes

Contribution rules of thumb

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Alerting V2 Architecture

Read this document when

The mental model

Core domain concepts

Rule

Rule event

Series

Episode

Notification policy

Alert action

End-to-end flow

1. Configuration and management

2. Rule scheduling

3. Rule execution

4. Episode lifecycle enrichment

5. Asynchronous notification dispatch

6. Subsequent runs reuse durable history

Architecture diagram

Important ownership boundaries

What the rule executor owns

What the director owns

What the dispatcher owns

What the dispatcher does not own

What the resources layer owns

Plugin startup and wiring

Repository map

Where to make changes

Contribution rules of thumb