Skip to content

Commit 121f209

Browse files
authored
Merge pull request #6058 from ClickHouse/drew/sql-alerting-clickstack
Add docs for SQL-based alerting in ClickStack
2 parents 72914b4 + 28c13e9 commit 121f209

3 files changed

Lines changed: 156 additions & 1 deletion

File tree

docs/use-cases/observability/clickstack/alerts.md

Lines changed: 156 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ import add_webhook_dialog from '@site/static/images/use-cases/observability/add_
2020
import manage_alerts from '@site/static/images/use-cases/observability/manage_alerts.png';
2121
import alerts_view from '@site/static/images/use-cases/observability/alerts_view.png';
2222
import multiple_search_alerts from '@site/static/images/use-cases/observability/multiple_search_alerts.png';
23+
import add_raw_sql_alert from '@site/static/images/use-cases/observability/add_raw_sql_alert.png';
24+
import open_sql_chart_mode from '@site/static/images/use-cases/observability/open_sql_chart_mode.png';
2325
import remove_chart_alert from '@site/static/images/use-cases/observability/remove_chart_alert.png';
2426
import Tabs from '@theme/Tabs';
2527
import TabItem from '@theme/TabItem';
@@ -103,7 +105,7 @@ Select **Add Alert**.
103105

104106
#### Define the alert conditions {#define-alert-conditions}
105107

106-
Define the condition (`>=`, `<`), threshold, duration, and webhook. The duration here will also dictate how often the alert is triggered.
108+
Define the condition (`>=`, `>`, `<=`, `<`, `=`, `!=`, `<= x >=`, `> or <`), threshold, duration, and webhook. The duration here will also dictate how often the alert is triggered.
107109

108110
<Image img={create_chart_alert} alt="Create alert for chart" size="lg"/>
109111

@@ -189,6 +191,159 @@ In the example below, the `Remove Alert` button will remove the alert from the c
189191

190192
<Image img={remove_chart_alert} alt="Remove chart alert" size="lg"/>
191193

194+
## SQL-based chart alerts {#sql-based-alerts}
195+
196+
SQL-based chart alerts let you write arbitrary ClickHouse SQL to define alert conditions. This gives you full control over filtering, aggregation, and math — anything you can express in SQL can become an alert.
197+
198+
### Supported chart types {#supported-chart-types}
199+
200+
SQL-based alerts are supported on three chart display types:
201+
202+
| Chart type | Behavior |
203+
|---|---|
204+
| **Line** | Time-series alert. The query must produce time-bucketed rows. Each bucket is evaluated independently against the threshold. |
205+
| **Stacked Bar** | Time-series alert. Same behavior as Line. |
206+
| **Number** | Single-value alert. The query returns a single numeric result which is compared against the threshold once per evaluation. |
207+
208+
Other SQL-based chart types (Table, Pie, Heatmap, etc.) do not support alerts.
209+
210+
### Creating a SQL alert {#create-sql-based-alert}
211+
212+
To create an alert on a SQL-based chart:
213+
214+
<VerticalStepper headerLevel="h4">
215+
216+
#### Create or open a SQL-based chart on a dashboard {#open-sql-chart}
217+
218+
From a saved dashboard, either [create a new chart with the **SQL** chart mode](./dashboards/sql-visualizations.md), or open an existing SQL-based chart for editing.
219+
220+
Choose **Line**, **Stacked Bar**, or **Number** as the display type.
221+
222+
<Image img={open_sql_chart_mode} alt="Create SQL chart" size="lg"/>
223+
224+
#### Add the alert {#add-sql-alert}
225+
226+
Select **Add Alert** from the alert section of the chart editor. Configure:
227+
228+
- **Threshold type**: `>=` (greater than or equal), `>` (greater than), `<=` (less than or equal), `<` (less than), `=` (equal), `!=` (not equal), `<= x >=` (between), or `> or <` (outside)
229+
- **Threshold value**: The numeric value to compare against
230+
- **Interval**: How often the alert is evaluated (1m, 5m, 15m, 30m, 1h, 6h, 12h, or 1d). This also defines the time window for each evaluation.
231+
- **Webhook**: The notification channel to use when the alert fires. See [Adding a webhook](#add-webhook).
232+
233+
<Image img={add_raw_sql_alert} alt="Edit chart alert" size="lg"/>
234+
235+
:::warning Alert Time Range
236+
Typically, alert queries are executed once per interval. However, if one or more intervals are skipped due to errors or slow queries, the following execution will use a time range that includes the missed intervals. In this case, the query's interval parameters would still be set to the alert's configured period, but the time range parameters would reflect the longer time range.
237+
:::
238+
239+
#### Save the dashboard {#save-sql-dashboard}
240+
241+
Save the dashboard to activate the alert. The alert will begin evaluating on the configured interval.
242+
243+
</VerticalStepper>
244+
245+
### How query results are interpreted {#sql-result-interpretation}
246+
247+
The alert system inspects the columns returned by your SQL query to determine what to compare against the threshold.
248+
249+
- **Value column**: The **last numeric column** in your `SELECT` clause is used as the alert value. If your query returns multiple numeric columns (e.g., `count, avg_latency, p99_latency`), only the last one (`p99_latency`) is compared to the threshold.
250+
- **Timestamp column**: For time-series charts (Line and Stacked Bar), the system identifies the Date/DateTime column in your results as the time bucket (i.e. the x-axis on a time-series chart). The value column for each time bucket is evaluated against the threshold independently, and if the value for any time bucket breaches the configured threshold, the alert will trigger.
251+
- **Group columns**: Any non-numeric, non-timestamp columns (e.g., `ServiceName`, `Environment`) are treated as grouping dimensions. When groups are present, each unique combination of group values is tracked and alerted on separately. ClickStack will send an alert for each group with a value that breaches the configured threshold. Groups are only available for time-series charts.
252+
253+
### Query parameters and macros {#query-params}
254+
255+
SQL alert queries support template parameters and macros that are automatically replaced at evaluation time. These are the same parameters and macros available when [building a SQL-based chart](./dashboards/sql-visualizations.md).
256+
257+
#### Required and Recommended Parameters {#required-alert-parameters}
258+
259+
Queries used for line or stacked bar chart alerts **must** include an interval parameter or macro (`{intervalSeconds:Int64}`, `{intervalMilliseconds:Int64}`, `$__timeInterval(col)`, or `$__timeInterval_ms(col)`). During alert execution, it will be replaced with the alert's configured period.
260+
261+
Queries used for alerts **should** include a time range filter (`{startDateMilliseconds:Int64}` and `{endDateMilliseconds:Int64}`, or `$__timeFilter(col)`, etc.). Regardless of whether a time range filter is present in the query, the alert query will run on the alert's configured period. If there is no time range filter, then the query will read the entire time range available in the source table during each execution.
262+
263+
:::warning Alert Time Range
264+
Typically, alert queries are executed once per interval. However, if one or more intervals are skipped due to errors or slow queries, the following execution will use a time range that includes the missed intervals. In this case, the query's interval parameters would still be set to the alert's configured period, but the time range parameters would reflect the longer time range.
265+
:::
266+
267+
### Example alert queries {#example-queries}
268+
269+
#### Error rate per service (time-series) {#example-error-rate}
270+
271+
Alert when any service has an error rate above 5%, with at least 10 requests in the alert period to avoid noisy alerts on low-traffic services.
272+
273+
```sql
274+
WITH error_rates AS (
275+
SELECT
276+
$__timeInterval(Timestamp) as ts,
277+
ServiceName,
278+
countIf (SpanKind = 'Server') as request_count,
279+
countIf (
280+
SpanKind = 'Server'
281+
and StatusCode = 'Error'
282+
) as error_count,
283+
error_count / request_count * 100 AS error_percent
284+
FROM $__sourceTable
285+
WHERE $__timeFilter(Timestamp)
286+
GROUP BY ts, ServiceName
287+
)
288+
SELECT ts, ServiceName, error_percent
289+
FROM error_rates
290+
WHERE request_count > 10
291+
```
292+
293+
**Display type**: Line or Stacked Bar
294+
**Threshold**: `>= 5` (fires when error rate reaches 5%)
295+
296+
In this query, `ServiceName` is a non-numeric, non-timestamp column, so each service is tracked as a separate alert group. The alert fires independently per service.
297+
298+
#### Anomaly detection with lagging average (time-series) {#example-anomaly-detection}
299+
300+
Alert on excess error counts that exceed a rolling average by more than two standard deviations. This catches spikes relative to recent baseline behavior rather than a fixed threshold.
301+
302+
```sql
303+
WITH buckets AS (
304+
SELECT
305+
$__timeInterval(Timestamp) AS ts,
306+
count() AS bucket_count
307+
FROM $__sourceTable
308+
WHERE TimestampTime >= fromUnixTimestamp64Milli({startDateMilliseconds:Int64})
309+
- toIntervalSecond($__interval_s * 30) -- Fetch 30 intervals back
310+
AND TimestampTime < fromUnixTimestamp64Milli({endDateMilliseconds:Int64})
311+
AND SeverityText = 'error'
312+
GROUP BY ts
313+
ORDER BY ts
314+
WITH FILL
315+
FROM toDateTime(fromUnixTimestamp64Milli({startDateMilliseconds:Int64}))
316+
TO toDateTime(fromUnixTimestamp64Milli({endDateMilliseconds:Int64}))
317+
STEP toIntervalSecond($__interval_s)
318+
),
319+
320+
anomaly_detection AS (
321+
SELECT
322+
ts,
323+
bucket_count,
324+
avg(bucket_count) OVER (
325+
ORDER BY ts ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
326+
) AS previous_30_avg,
327+
stddevPop(bucket_count) OVER (
328+
ORDER BY ts ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
329+
) AS previous_30_stddev,
330+
greatest(
331+
bucket_count - (previous_30_avg + 2 * previous_30_stddev), 0
332+
) AS excess_error_count
333+
FROM buckets
334+
)
335+
336+
SELECT ts, excess_error_count
337+
FROM anomaly_detection
338+
WHERE ts >= fromUnixTimestamp64Milli({startDateMilliseconds:Int64})
339+
AND ts < fromUnixTimestamp64Milli({endDateMilliseconds:Int64})
340+
```
341+
342+
**Display type**: Line
343+
**Threshold**: `> 0` (fires when excess errors above the rolling baseline are detected)
344+
345+
Note that the query fetches 30 intervals *before* the start of the date range to seed the rolling window calculations, then filters the final output to only the evaluation window.
346+
192347
## Common alert scenarios {#common-alert-scenarios}
193348

194349
Here are a few common alert scenarios you can use HyperDX for:
185 KB
Loading
166 KB
Loading

0 commit comments

Comments
 (0)