Open
Description
Currently, the alerting rules are just the default ones created by Sloth. Sloth creates one alert rule for the paging severity and one for the ticket severity. However in the paging severity there are two rules evaluated with a big OR
in between. Here's an example:
(
max(slo:sli_error:ratio_rate5m{sloth_id="autometrics-success-rate-95", sloth_service="autometrics", sloth_slo="success-rate-95"} > (14.4 * 0.05)) without (sloth_window)
and
max(slo:sli_error:ratio_rate1h{sloth_id="autometrics-success-rate-95", sloth_service="autometrics", sloth_slo="success-rate-95"} > (14.4 * 0.05)) without (sloth_window)
)
or
(
max(slo:sli_error:ratio_rate30m{sloth_id="autometrics-success-rate-95", sloth_service="autometrics", sloth_slo="success-rate-95"} > (6 * 0.05)) without (sloth_window)
and
max(slo:sli_error:ratio_rate6h{sloth_id="autometrics-success-rate-95", sloth_service="autometrics", sloth_slo="success-rate-95"} > (6 * 0.05)) without (sloth_window)
)
When this alert triggers you cannot see if it's triggered because of the 1h/5m time window + burn rate rule, or the 6h/30m rule. This information also isn't included in any label or anything. It could be worth breaking this rule out into two different rules enabling the user to understand the time frame and burn rate involved in generating this alert. We could include this information in a label on the alert as well to allow it to be better understood and displayed in explorer.
Metadata
Metadata
Assignees
Labels
No labels