Skip to content

Change alert config to have a rule for each alerting window #82

Open
@IvanMerrill

Description

@IvanMerrill

Currently, the alerting rules are just the default ones created by Sloth. Sloth creates one alert rule for the paging severity and one for the ticket severity. However in the paging severity there are two rules evaluated with a big OR in between. Here's an example:

     (
          max(slo:sli_error:ratio_rate5m{sloth_id="autometrics-success-rate-95", sloth_service="autometrics", sloth_slo="success-rate-95"} > (14.4 * 0.05)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate1h{sloth_id="autometrics-success-rate-95", sloth_service="autometrics", sloth_slo="success-rate-95"} > (14.4 * 0.05)) without (sloth_window)
      )
      or
      (
          max(slo:sli_error:ratio_rate30m{sloth_id="autometrics-success-rate-95", sloth_service="autometrics", sloth_slo="success-rate-95"} > (6 * 0.05)) without (sloth_window)
          and
          max(slo:sli_error:ratio_rate6h{sloth_id="autometrics-success-rate-95", sloth_service="autometrics", sloth_slo="success-rate-95"} > (6 * 0.05)) without (sloth_window)
      )

When this alert triggers you cannot see if it's triggered because of the 1h/5m time window + burn rate rule, or the 6h/30m rule. This information also isn't included in any label or anything. It could be worth breaking this rule out into two different rules enabling the user to understand the time frame and burn rate involved in generating this alert. We could include this information in a label on the alert as well to allow it to be better understood and displayed in explorer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions