Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,10 @@ State machine-related issues often require several data checks and conditional l
3. [Querying data from databases](./docs/querying.md)
4. [Registering a monitor](./docs/monitor_registering.md)
5. [How to run](./docs/how_to_run.md)
6. [Plugins](./docs/plugins.md)
6. [Monitoring Sentinela](.docs/monitoring_sentinela.md)
7. [Plugins](./docs/plugins.md)
1. [Slack](./docs/plugin_slack.md)
7. Interacting with Sentinela
8. Interacting with Sentinela
1. [HTTP server](./docs/http_server.md)
8. Special cases
9. Special cases
1. [Dropping issues](./docs/dropping_issues.md)
44 changes: 44 additions & 0 deletions docs/monitoring_sentinela.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Monitoring Sentinela
Sentinela provides logs and Prometheus metrics that can be used to monitor its status.

## Logs
Logs are separated in **informational**, **warnings** and **errors**.
- **Informational logs**: These logs provide details about the normal operation and execution progress of Sentinela. They are helpful for tracking routine activities and verifying that the system is functioning as expected.
- **Warning logs**: Warning logs indicate potential issues or suboptimal behaviors that may not immediately impact the system’s operation but should be addressed to prevent future problems. While the system may continue to operate, these logs may point to areas that require attention or further investigation.
- **Error logs**: Error logs signal critical issues that require immediate attention. They typically indicate a failure in Sentinela, an external service, or a monitor. These logs should be prioritized as they often point to significant problems that can affect the stability or performance of the system.

## Metrics
The Prometheus metrics provided by Sentinela are:
- `controller_monitors_processed_count`: Counter - Count of monitors processed by the controller
- `controller_monitor_not_registered_count`: Counter - Count of times the controller tries to process a monitor that isn't registered
- `controller_task_queue_error_count`: Counter - Count of times the controller fails to queue a task
- `executor_message_count`: Counter - Count of messages consumed by the executors.
- Labels: `message_type`
- `executor_message_error_count`: Counter - Count of errors when processing messages
- Labels: `message_type`
- `executor_message_processing_count`: Gauge - Count of messages being processed by the executors
- Labels: `message_type`
- `monitor_execution_error`: Counter - Error count for monitors
- Labels: `monitor_id`, `monitor_name`
- `monitor_execution_timeout`: Counter - Timeout count for monitors
- Labels: `monitor_id`, `monitor_name`
- `monitor_running`: Gauge - Flag indicating if the monitor is running
- Labels: `monitor_id`, `monitor_name`
- `monitor_execution_seconds`: Summary - Time to run the monitor
- Labels: `monitor_id`, `monitor_name`
- `monitor_execution_search_seconds`: Summary - Time to run the monitor's 'search' routine
- Labels: `monitor_id`, `monitor_name`
- `monitor_execution_update_seconds`: Summary - Time to run the monitor's 'update' routine
- Labels: `monitor_id`, `monitor_name`
- `monitor_execution_solve_seconds`: Summary - Time to run the monitor's 'solve' routine
- Labels: `monitor_id`, `monitor_name`
- `monitor_execution_alert_seconds`: Summary - Time to run the monitor's 'alert' routine
- Labels: `monitor_id`, `monitor_name`
- `reaction_execution_error`: Counter - Error count for reactions
- Labels: `monitor_id`, `monitor_name`, `event_name`
- `reaction_execution_timeout`: Counter - Timeout count for reactions
- Labels: `monitor_id`, `monitor_name`, `event_name`
- `reaction_execution_seconds`: Summary - Time to run the reaction
- Labels: `monitor_id`, `monitor_name`, `event_name`
- `monitors_ready_timeout_count`: Counter - Count of times the application timed out waiting for monitors to be ready
- `monitor_not_registered_count`: Counter - Count of times a monitor is not registered after a load attempt
4 changes: 2 additions & 2 deletions src/components/controller/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,14 @@
last_monitor_processed_at: datetime

prometheus_monitors_processed_count = prometheus_client.Counter(
"monitors_processed_count", "Count of monitors processed"
"controller_monitors_processed_count", "Count of monitors processed by the controller"
)
prometheus_monitor_not_registered_count = prometheus_client.Counter(
"controller_monitor_not_registered_count",
"Count of times the controller tries to process a monitor that isn't registered",
)
prometheus_task_queue_error_count = prometheus_client.Counter(
"task_queue_error_count",
"controller_task_queue_error_count",
"Count of times the controller fails to queue a task",
)

Expand Down
Loading