Skip to content

Commit ba48234

Browse files
authored
Merge pull request #134 from GabrielSalla/add-heartbeat-monitoring
Add heartbeat monitoring
2 parents b712422 + 30092cb commit ba48234

File tree

12 files changed

+142
-6
lines changed

12 files changed

+142
-6
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ Common use cases:
8888
State machine-related issues often require several data checks and conditional logic to identify. These issues are typically difficult to capture using standard logs and metrics but can be easily addressed using Sentinela Monitoring.
8989

9090
# Dashboard
91-
Sentinela provides a web dashboard with 2 sections:
91+
Sentinela provides a web dashboard, by default at port `8000`, with 2 sections:
9292
1. an overview of the monitors and their alerts and issues
9393
2. a monitor editor, where you can create and edit monitors directly from the browser
9494

configs/configs-scalable.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,8 @@ http_server:
3838

3939
time_zone: America/Sao_Paulo
4040

41+
heartbeat_time: 2
42+
4143
controller_process_schedule: "* * * * *"
4244
controller_concurrency: 5
4345
controller_procedures:

configs/configs.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ http_server:
3333

3434
time_zone: America/Sao_Paulo
3535

36+
heartbeat_time: 2
37+
3638
controller_process_schedule: "* * * * *"
3739
controller_concurrency: 5
3840
controller_procedures:

docs/configuration_file.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -65,22 +65,27 @@ application_queue:
6565
## Time Zone
6666
- `time_zone`: String. Time zone to use for cron scheduling and notification messages.
6767

68+
## Heartbeat
69+
- `heartbeat_time`: Integer. Time, in seconds, between each heartbeat. This heartbeat is used to identify when a task is not yielding the control back to the event loop for too much time, generating a Warning log.
70+
6871
## Controller Settings
6972
- `controller_process_schedule`: String using Cron format. Schedule to check if monitors need to be processed.
7073
- `controller_concurrency`: Integer. Number of monitors that can be processed at the same time by the Controller.
7174
- `controller_procedures`: Map. Procedures to be executed by the Controller and their settings.
72-
- `controller_procedures.monitors_stuck`: Map. Settings for the procedure to fix monitors stuck in "queued" or "running" status.
73-
- `controller_procedures.monitors_stuck.schedule`: String using Cron format. Schedule to execute the `monitors_stuck` procedure.
74-
- `controller_procedures.monitors_stuck.params.time_tolerance`: Integer. Time tolerance in seconds for a monitor to be considered as stuck. This parameter is directly impacted by the `executor_monitor_heartbeat_time` setting and the recommended value is 2 times the heartbeat time.
75-
- `controller_procedures.notifications_alert_solved.schedule`: String using Cron format. Schedule to execute the `notifications_alert_solved` procedure.
75+
- `monitors_stuck`: Map. Settings for the procedure to fix monitors stuck in "queued" or "running" status.
76+
- `schedule`: String using Cron format. Schedule to execute the `monitors_stuck` procedure.
77+
- `params`: Map. Configuration parameters for the `monitors_stuck` procedure.
78+
- `time_tolerance`: Integer. Time tolerance in seconds for a monitor to be considered as stuck. This parameter is directly impacted by the `executor_monitor_heartbeat_time` setting and the recommended value is 2 times the heartbeat time.
79+
- `notifications_alert_solved`: Map. Settings for the procedure to identify and fix active notifications linked to alerts that have already been solved.
80+
- `schedule`: String using Cron format. Schedule to execute the `notifications_alert_solved` procedure.
7681

7782
## Executor Settings
7883
- `executor_concurrency`: Integer. Number of tasks that can be executed at the same time by each Executor.
7984
- `executor_sleep`: Integer. Time, in seconds, the Executor will sleep when there are no tasks in the queue before trying again.
8085
- `executor_monitor_timeout`: Integer. Timeout, in seconds, for monitor execution.
8186
- `executor_reaction_timeout`: Integer. Timeout, in seconds, for reactions execution.
8287
- `executor_request_timeout`: Integer. Timeout, in seconds, for requests execution.
83-
- `executor_monitor_heartbeat_time`: Integer. Time, in seconds, between each monitor heartbeat. This parameter impacts the controller procedure `monitors_stuck.time_tolerance` parameter.
88+
- `executor_monitor_heartbeat_time`: Integer. Time, in seconds, between each executor heartbeat during monitor execution. This parameter impacts the controller procedure `monitors_stuck.time_tolerance` parameter.
8489

8590
## Issues Creation
8691
- `max_issues_creation`: Integer. Maximum number of issues that can be created by each monitor in a single search. Can be overridden by the monitors' configuration.

docs/http_server.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
# HTTP server
22
The HTTP server provides an API to interact with Sentinela. The available routes are organized into two main categories, based on the deployment setup.
33

4+
> [!IMPORTANT]
5+
> By default the API is served at port `8000`. The docker compose files also expose the port `8000`, so if the port for the server changes, the compose files should be updated accordingly. Another option is to keep the server port at `8000` and changing only the compose files. Using the configuration `8080:8000`, for example, will keep the server running at port `8000`, but it will be accessible through the container's port `8080`.
6+
47
If the container is deployed with the **Controller** (either standalone or alongside the Executor in the same container), all routes are available, allowing interactions with Monitors, Issues, Alerts and the dashboard.
58

69
If the container is deployed with only the **Executor**, only base routes are available.

docs/monitoring_sentinela.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,5 +48,6 @@ The Prometheus metrics provided by Sentinela are:
4848
- Labels: `action_name`
4949
- `executor_request_execution_seconds`: Summary - Time to run the request
5050
- Labels: `action_name`
51+
- `heartbeat_average_time`: Gauge - Average time between heartbeats in seconds
5152
- `registry_monitors_ready_timeout_count`: Counter - Count of times the application timed out waiting for monitors to be ready
5253
- `registry_monitor_not_registered_count`: Counter - Count of times a monitor is not registered after a load attempt

resources/kubernetes_template/config_map.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@ data:
4444
4545
time_zone: America/Sao_Paulo
4646
47+
heartbeat_time: 2
48+
4749
controller_process_schedule: "* * * * *"
4850
controller_concurrency: 5
4951
controller_procedures:
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from .heartbeat import run
2+
3+
__all__ = [
4+
"run",
5+
]
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
import logging
2+
import time
3+
from collections import deque
4+
from itertools import pairwise
5+
6+
import prometheus_client
7+
8+
import utils.app as app
9+
from configs import configs
10+
11+
_logger = logging.getLogger("heartbeat")
12+
13+
prometheus_heartbeat_average_time = prometheus_client.Gauge(
14+
"heartbeat_average_time", "Average time between heartbeats in seconds"
15+
)
16+
17+
18+
def _is_heartbeat_delayed(timestamps: deque[float], threshold: float) -> bool:
19+
"""Determine if the heartbeat is delayed based on the average latency between timestamps"""
20+
if len(timestamps) < 2:
21+
return False
22+
23+
latencies = [b - a for a, b in pairwise(timestamps)]
24+
average_latency = sum(latencies) / len(latencies)
25+
prometheus_heartbeat_average_time.set(average_latency)
26+
return average_latency > threshold
27+
28+
29+
async def run() -> None:
30+
"""Create a heartbeat for the application to detect when some tasks are not yielding control
31+
back to the event loop. If the heartbeat is delayed, a warning message is logged."""
32+
timestamps = deque[float](maxlen=10)
33+
last_warning_timestamp = 0.0
34+
35+
while app.running():
36+
timestamp = time.time()
37+
timestamps.append(timestamp)
38+
heartbeat_delayed = _is_heartbeat_delayed(timestamps, configs.heartbeat_time * 1.05)
39+
40+
# Prevent warning messages from being sent too frequently
41+
can_warn = timestamp - last_warning_timestamp > 10
42+
if can_warn and heartbeat_delayed:
43+
_logger.warning(
44+
"High average heartbeat interval. "
45+
"Blocking operations are preventing tasks from executing"
46+
)
47+
last_warning_timestamp = timestamp
48+
49+
await app.sleep(configs.heartbeat_time)

src/configs/configs_loader.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,8 @@ class Configs:
6262

6363
time_zone: str
6464

65+
heartbeat_time: int
66+
6567
controller_process_schedule: str
6668
controller_concurrency: int
6769
controller_procedures: dict[str, ControllerProcedureConfig]

0 commit comments

Comments
 (0)