This document describes the architecture of alert-manager, a Kubernetes operator that enables management of monitoring alerts as custom resources within Kubernetes clusters.
alert-manager follows the Kubernetes operator pattern, watching for custom resources that define alerts and reconciling them with monitoring systems like Wavefront. It enables a GitOps approach to alert management, allowing alerts to be version-controlled and deployed alongside applications.
graph TD
%% Define styles
classDef k8s fill:#326ce5,color:white,stroke:white,stroke-width:2px
classDef wavefront fill:#00ACEE,color:white,stroke:white,stroke-width:2px
classDef controller fill:#7D559C,color:white,stroke:white,stroke-width:2px
classDef user fill:#767676,color:white,stroke:white,stroke-width:2px
%% User interactions
User([DevOps/User]):::user
User -->|Creates/Updates| WavefrontAlert[WavefrontAlert CR]:::k8s
User -->|Creates/Updates| AlertsConfig[AlertsConfig CR]:::k8s
%% Kubernetes components
subgraph Kubernetes Cluster
WavefrontAlert
AlertsConfig
APIServer[Kubernetes API Server]:::k8s
ControllerManager[Alert Manager Controller]:::controller
Secret[Credentials Secret]:::k8s
ConfigMap[AlertManager ConfigMap]:::k8s
WavefrontAlert -->|Submitted to| APIServer
AlertsConfig -->|Submitted to| APIServer
APIServer -->|Watched by| ControllerManager
ControllerManager -->|Updates status| APIServer
ConfigMap -->|Configuration| ControllerManager
Secret -->|API Credentials| ControllerManager
end
%% Monitoring systems
subgraph Monitoring Systems
Wavefront[Wavefront]:::wavefront
Splunk[Splunk]:::wavefront
ControllerManager -->|Creates/Updates Alerts| Wavefront
ControllerManager -->|Creates/Updates Alerts| Splunk
Wavefront -->|Alert Status| ControllerManager
end
%% CR relationships
AlertsConfig -.->|References| WavefrontAlert
WavefrontAlert -.->|Status includes| AlertsConfig
sequenceDiagram
actor User
participant WavefrontAlert as WavefrontAlert CR
participant Controller as Alert Manager Controller
participant Wavefront as Wavefront API
User->>WavefrontAlert: Create/Update CR
WavefrontAlert->>Controller: Notify of change
Controller->>WavefrontAlert: Process alert definition
Controller->>Wavefront: Create/Update alert
Wavefront-->>Controller: Return alert ID & status
Controller->>WavefrontAlert: Update status with alert ID & link
Note over Controller,Wavefront: Periodic reconciliation
Controller->>Wavefront: Check alert status
Wavefront-->>Controller: Return current status
Controller->>WavefrontAlert: Update CR status
sequenceDiagram
actor User
participant AlertsConfig as AlertsConfig CR
participant WavefrontAlert as WavefrontAlert Templates
participant Controller as Alert Manager Controller
participant Wavefront as Wavefront API
User->>AlertsConfig: Create/Update CR with parameters
AlertsConfig->>Controller: Notify of change
Controller->>WavefrontAlert: Get alert templates
Controller->>Controller: Process templates with parameters
loop For each alert in config
Controller->>Wavefront: Create/Update alert
Wavefront-->>Controller: Return alert ID & status
end
Controller->>AlertsConfig: Update status with alerts info
Controller->>WavefrontAlert: Update referenced templates status
Defines a specific alert in Wavefront with:
- Alert name and conditions
- Notification targets
- Severity
- Display expressions
- Tags
Allows efficient management of multiple similar alerts by:
- Referencing alert templates (WavefrontAlert CRs)
- Providing parameters to customize the templates
- Enabling/disabling specific alerts
- Overriding default template values
The controller:
- Watches for changes to alert-related CRs
- Reconciles the desired state (CRs) with the actual state (monitoring systems)
- Manages the lifecycle of alerts in monitoring systems
- Updates CR status with current alert information
- Handles error conditions and retries
Currently supports:
- Wavefront: Complete implementation
- Splunk: Planned for future release
The AlertsConfig approach addresses a key scalability concern:
Traditional approach: 1 alert type × 100 applications = 100 CRs
Alert-manager approach: 1 AlertsConfig per application = 1 CR
For large environments (e.g., 450 applications with 100 alert types):
- Traditional approach: 45,000 CRs (risk of etcd overload)
- Alert-manager approach: 450 CRs (manageable)
- Controller requires API credentials for monitoring systems
- Credentials stored as Kubernetes Secrets
- RBAC controls who can create/modify alert resources
- Namespace-scoped resources allow isolation between teams
The alert-manager is configured through a ConfigMap which defines:
- Monitoring system endpoints
- Default settings
- Retry parameters
- Logging levels