Skip to content

Commit 47a1b13

Browse files
LiuTianyoutomsun28
andauthored
[doc] add doc for alarm grouping and alarm inhibit (#3206)
Co-authored-by: tomsun28 <tomsun28@outlook.com>
1 parent d5fc492 commit 47a1b13

9 files changed

Lines changed: 159 additions & 1 deletion

File tree

home/docs/help/alarm_group.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
id: alarm_group
3+
title: Alarm Grouping
4+
sidebar_label: Alarm Grouping
5+
keywords: [Open source monitoring system, alarm reduce, alarm grouping]
6+
---
7+
8+
> Group convergence supports grouping and convergence of alarms for specified packet labels, deduplication and convergence of the same repeated alarms for the time period. When the threshold rule triggers the alarm or external alarm reporting, it will enter the packet convergence to alarm grouping to deduplicate the alarm to avoid a large number of alarm messages causing alarm storms.
9+
10+
## Grouping Policy Parameter Configuration
11+
12+
- Strategy Name: The name that uniquely identifies the grouping policy
13+
- Group Labels: Alarm grouping tag, support up to 10 tags
14+
15+
> Tag source: monitoring, threshold rules, tags carried by external alarms
16+
17+
- Wait Time: Waiting time after a new alarm is generated. The same alarms received during this time will be grouped, with a default of 30 seconds.
18+
19+
> When a new (unable to join an existing group) alarm is generated, the group convergence will wait according to the `wait time`, during which time, the same alarm or the alarm that meets the grouping conditions will be grouped. The alarm after the grouping is sent to the alarm suppression module for subsequent processing until the time interval between the current time and the first alarm generation in the packet exceeds the `wait time`.
20+
21+
- Interval time: The minimum time interval for sending group alarm notifications to avoid excessive alarm notifications, default 5 minutes
22+
- Repeat interval: The minimum notification interval for repeated alarms. For continuously triggered alarms, avoid repeated notifications, default 4 hours
23+
24+
**Note**: Only grouped alarms can be suppressed using suppression rules.

home/docs/help/alarm_inhibit.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
---
2+
id: alarm_inhibit
3+
title: Alarm Inhibition
4+
sidebar_label: Alarm Inhibition
5+
keywords: [ Open Source Monitoring System, Alarm Convergence, Alarm Inhibition ]
6+
---
7+
8+
> Alarm inhibition is used to configure the inhibition relationship between alarms. When an alarm occurs, other alarms can be suppressed. It can be understood as "important" alarms suppressing "
9+
> unimportant" alarms. For example, the alarm generated by a server crash suppresses the alarms generated by other services on this server.
10+
11+
## Prerequisites
12+
13+
- Correctly configure the alarm grouping rule
14+
15+
## Inhibit rule configuration
16+
17+
- Inhibit Rule Name: The name that uniquely identifies the suppression rule
18+
19+
- Source Labels: When the alarm contains these tags, the target alarm will be suppressed. Multiple tags can be added.
20+
21+
> Identify the tag of the "important" alarm. The alarm tag needs to contain all source tags to suppress the alarm marked by the target tag.
22+
23+
- Target Labels: Alarms matching these tags will be suppressed.
24+
25+
> Identify the label of "unimportant" alarms. Alarm labels need to contain all target labels to be suppressed.
26+
27+
- Equal Labels: Labels for determining alarm correlation. Supports up to 10 labels.
28+
- Enabled: Enable or disable this inhibit rule
29+
30+
## Example
31+
32+
Scenario: Use Hertzbeat to monitor two Centos servers 192.168.1.1, 192.168.1.2, and Redis services Redis-1 and Redis-2 deployed on the two servers.
33+
And configure the following threshold rules:
34+
35+
- Monitor Centos Linux / Monitor availability. Bind label `server-status:down`
36+
- Monitor Redis database / Monitor availability. Bind label `redis-status:down`
37+
38+
If you need to achieve that when the Centos downtime alarm is generated, the Redis alarm will no longer be generated, you can configure the following alarm suppression rules:
39+
40+
- Source label: `server-status:down`
41+
- Target label: `redis-status:down`
42+
- Equal label: `instancehost`
43+
44+
When the Centos 192.168.1.1 downtime alarm is generated, the Redis-1 unavailable alarm will no longer be generated. And at the same time, when Centos 192.168.1.2 is running normally and Redis-2 is
45+
unavailable, the alarm notifying Redis-2 unavailable will be generated normally.

home/docs/help/guide.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,13 @@ More details see&emsp;&#x1F449;&emsp;[Alarm center](alarm_center)
116116
More details see&emsp;&#x1F449;&emsp;[Threshold alarm](alert_threshold) <br />
117117
&emsp;&emsp;&emsp;&#x1F449;&emsp;[Threshold expression](alert_threshold_expr)
118118

119+
### Alarm reduce
120+
121+
> Combine related alarms through alarm grouping, alarm suppression and other functions to reduce the alarm storm caused by one event, reduce alarm noise and improve alarm response efficiency.
122+
123+
More details see&emsp;&#x1F449;&emsp;[Alarm grouping](alarm_group) <br />
124+
&emsp;&emsp;&emsp;&#x1F449;&emsp;[Alarm inhibit](alarm_inhibit)
125+
119126
### Alarm notification
120127

121128
> After triggering the alarm information, in addition to being displayed in the alarm center list, it can also be notified to the designated recipient in a specified way (e-mail, wechat and FeiShu etc.)

home/i18n/en/docusaurus-plugin-content-docs/current.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,10 @@
5151
"message": "Threshold Alarm Setting",
5252
"description": "The label for category threshold in sidebar docs"
5353
},
54+
"sidebar.docs.category.reduce": {
55+
"message": "Alarm Reduce Setting",
56+
"description": "The label for category reduce in sidebar docs"
57+
},
5458
"sidebar.docs.category.notice": {
5559
"message": "Alarm Notice Setting",
5660
"description": "The label for category notice in sidebar docs"

home/i18n/zh-cn/docusaurus-plugin-content-docs/current.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,10 @@
5555
"message": "阈值告警配置",
5656
"description": "The label for category threshold in sidebar docs"
5757
},
58+
"sidebar.docs.category.reduce": {
59+
"message": "告警收敛配置",
60+
"description": "The label for category reduce in sidebar docs"
61+
},
5862
"sidebar.docs.category.notice": {
5963
"message": "告警通知配置",
6064
"description": "The label for category notice in sidebar docs"
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
---
2+
id: alarm_group
3+
title: 分组收敛
4+
sidebar_label: 分组收敛
5+
keywords: [ 开源监控系统, 告警收敛, 告警分组 ]
6+
---
7+
8+
> 分组收敛支持对指定分组标签的告警进行分组合并,对时间段的相同重复告警去重收敛。 当阈值规则触发告警或外部告警上报后,会进入到分组收敛进行告警分组,告警去重,以避免大量告警消息导致告警风暴。
9+
10+
## 分组策略参数配置
11+
12+
- 策略名称:唯一标识分组策略的名称
13+
- 分组标签:告警分组标签,最多支持添加10个标签
14+
15+
> 标签来源:监控,阈值规则,外部告警携带的标签
16+
17+
- 等待时间:新告警产生后等待时间,在此时间内收到的相同告警将被分组,默认30秒
18+
19+
> 当一条新(无法加入已有分组)的告警产生,分组收敛将按照 `等待时间` 等待,在此期间,相同告警或满足分组条件的告警将被分组。直到当前时间与该分组第一条告警产生时间间隔超过 `等待时间`,分组后的告警才被发送到告警抑制模块进行后续处理。
20+
21+
- 间隔时间:发送分组告警通知的最小时间间隔,避免告警通知过于频繁,默认5分钟
22+
- 重复间隔:重复告警的最小通知间隔,对于持续触发的告警,避免重复发送通知,默认4小时
23+
24+
**注意**:只有分组后的告警才能使用抑制规则进行告警抑制。
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
---
2+
id: alarm_inhibit
3+
title: 告警抑制
4+
sidebar_label: 告警抑制
5+
keywords: [ 开源监控系统, 告警收敛, 告警抑制 ]
6+
---
7+
8+
> 告警抑制用于配置告警之间的抑制关系。当某个告警发生时,可以抑制其他告警的产生,可以理解为“重要”告警抑制“不重要”告警的产生,例如一台服务器宕机产生的告警抑制这台服务器上其他服务产生的告警。
9+
10+
## 前置条件
11+
12+
- 正确配置分组收敛规则
13+
14+
## 抑制规则配置
15+
16+
- 抑制规则名称: 唯一标识抑制规则的名称;
17+
- 源标签: 当告警包含这些标签时,将会抑制目标告警,支持添加多个标签;
18+
> 识别“重要”告警的标签,告警标签需要包含全部源标签才会抑制被目标标签标记的告警。
19+
- 目标标签: 匹配这些标签的告警将被抑制;
20+
> 识别“不重要”告警的标签,告警标签需要包含全部目标标签才会被抑制。
21+
- 相等标签: 判断告警相关性的标签。支持最多10个标签;
22+
- 启用状态: 启用或禁用该抑制规则。
23+
24+
## 示例
25+
26+
场景: 使用 Hertzbeat 监控 两个 Centos 服务器 192.168.1.1 和 192.168.1.2,和部署在两个服务器上的 Redis 服务 Redis-1 和 Redis-2。
27+
并配置如下阈值规则:
28+
29+
- 监控 Centos Linux /监控可用性。绑定标签 `server-status:down`
30+
- 监控 Redis数据库 /监控可用性。绑定标签 `redis-status:down`
31+
32+
如果需要实现当Centos 宕机告警产生后,Redis 告警不再产生,则可以配置如下告警抑制规则:
33+
34+
- 源标签: `server-status:down`
35+
- 目标标签: `redis-status:down`
36+
- 相等标签: `instancehost`
37+
38+
当 Centos 192.168.1.1 宕机告警产生时,通知Redis-1 不可用的告警将不再产生。且同时 Centos 192.168.1.2 运行正常且 Redis-2 不可用时,通知 Redis-2 不可用的告警将正常产生。

home/i18n/zh-cn/docusaurus-plugin-content-docs/current/help/guide.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,13 @@ sidebar_label: 帮助入门
116116
详见&emsp;&#x1F449;&emsp;[阈值告警](alert_threshold) <br />
117117
&emsp;&emsp;&emsp;&#x1F449;&emsp;[阈值表达式](alert_threshold_expr)
118118

119+
### 告警收敛
120+
121+
> 通过分组收敛、告警抑制等功能合并相关告警,减少由一个事件引发的告警风暴,降低告警噪声,提升告警响应效率。
122+
123+
详见&emsp;&#x1F449;&emsp;[分组收敛](alarm_group) <br />
124+
&emsp;&emsp;&emsp;&#x1F449;&emsp;[告警抑制](alarm_inhibit)
125+
119126
### 告警通知
120127

121128
> 触发告警信息后,除了显示在告警中心列表外,还可以用指定方式(邮件钉钉微信飞书等)通知给指定接收人。

home/sidebars.json

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,11 @@
227227
"label": "threshold",
228228
"items": ["help/alert_threshold", "help/alert_threshold_expr"]
229229
},
230+
{
231+
"type": "category",
232+
"label": "reduce",
233+
"items": ["help/alarm_group","help/alarm_inhibit"]
234+
},
230235
{
231236
"type": "category",
232237
"label": "notice",
@@ -344,7 +349,7 @@
344349
}
345350
]
346351
},
347-
352+
348353
{
349354
"type": "category",
350355
"label": "Others",

0 commit comments

Comments
 (0)