diff --git a/README.md b/README.md index 111f3a53..6441b132 100644 --- a/README.md +++ b/README.md @@ -69,9 +69,10 @@ State machine-related issues often require several data checks and conditional l 3. [Querying data from databases](./docs/querying.md) 4. [Registering a monitor](./docs/monitor_registering.md) 5. [How to run](./docs/how_to_run.md) -6. [Plugins](./docs/plugins.md) +6. [Monitoring Sentinela](.docs/monitoring_sentinela.md) +7. [Plugins](./docs/plugins.md) 1. [Slack](./docs/plugin_slack.md) -7. Interacting with Sentinela +8. Interacting with Sentinela 1. [HTTP server](./docs/http_server.md) -8. Special cases +9. Special cases 1. [Dropping issues](./docs/dropping_issues.md) diff --git a/docs/monitoring_sentinela.md b/docs/monitoring_sentinela.md new file mode 100644 index 00000000..a06ab421 --- /dev/null +++ b/docs/monitoring_sentinela.md @@ -0,0 +1,44 @@ +# Monitoring Sentinela +Sentinela provides logs and Prometheus metrics that can be used to monitor its status. + +## Logs +Logs are separated in **informational**, **warnings** and **errors**. +- **Informational logs**: These logs provide details about the normal operation and execution progress of Sentinela. They are helpful for tracking routine activities and verifying that the system is functioning as expected. +- **Warning logs**: Warning logs indicate potential issues or suboptimal behaviors that may not immediately impact the system’s operation but should be addressed to prevent future problems. While the system may continue to operate, these logs may point to areas that require attention or further investigation. +- **Error logs**: Error logs signal critical issues that require immediate attention. They typically indicate a failure in Sentinela, an external service, or a monitor. These logs should be prioritized as they often point to significant problems that can affect the stability or performance of the system. + +## Metrics +The Prometheus metrics provided by Sentinela are: +- `controller_monitors_processed_count`: Counter - Count of monitors processed by the controller +- `controller_monitor_not_registered_count`: Counter - Count of times the controller tries to process a monitor that isn't registered +- `controller_task_queue_error_count`: Counter - Count of times the controller fails to queue a task +- `executor_message_count`: Counter - Count of messages consumed by the executors. + - Labels: `message_type` +- `executor_message_error_count`: Counter - Count of errors when processing messages + - Labels: `message_type` +- `executor_message_processing_count`: Gauge - Count of messages being processed by the executors + - Labels: `message_type` +- `monitor_execution_error`: Counter - Error count for monitors + - Labels: `monitor_id`, `monitor_name` +- `monitor_execution_timeout`: Counter - Timeout count for monitors + - Labels: `monitor_id`, `monitor_name` +- `monitor_running`: Gauge - Flag indicating if the monitor is running + - Labels: `monitor_id`, `monitor_name` +- `monitor_execution_seconds`: Summary - Time to run the monitor + - Labels: `monitor_id`, `monitor_name` +- `monitor_execution_search_seconds`: Summary - Time to run the monitor's 'search' routine + - Labels: `monitor_id`, `monitor_name` +- `monitor_execution_update_seconds`: Summary - Time to run the monitor's 'update' routine + - Labels: `monitor_id`, `monitor_name` +- `monitor_execution_solve_seconds`: Summary - Time to run the monitor's 'solve' routine + - Labels: `monitor_id`, `monitor_name` +- `monitor_execution_alert_seconds`: Summary - Time to run the monitor's 'alert' routine + - Labels: `monitor_id`, `monitor_name` +- `reaction_execution_error`: Counter - Error count for reactions + - Labels: `monitor_id`, `monitor_name`, `event_name` +- `reaction_execution_timeout`: Counter - Timeout count for reactions + - Labels: `monitor_id`, `monitor_name`, `event_name` +- `reaction_execution_seconds`: Summary - Time to run the reaction + - Labels: `monitor_id`, `monitor_name`, `event_name` +- `monitors_ready_timeout_count`: Counter - Count of times the application timed out waiting for monitors to be ready +- `monitor_not_registered_count`: Counter - Count of times a monitor is not registered after a load attempt \ No newline at end of file diff --git a/src/components/controller/controller.py b/src/components/controller/controller.py index 4f0f05d1..7de471fd 100644 --- a/src/components/controller/controller.py +++ b/src/components/controller/controller.py @@ -24,14 +24,14 @@ last_monitor_processed_at: datetime prometheus_monitors_processed_count = prometheus_client.Counter( - "monitors_processed_count", "Count of monitors processed" + "controller_monitors_processed_count", "Count of monitors processed by the controller" ) prometheus_monitor_not_registered_count = prometheus_client.Counter( "controller_monitor_not_registered_count", "Count of times the controller tries to process a monitor that isn't registered", ) prometheus_task_queue_error_count = prometheus_client.Counter( - "task_queue_error_count", + "controller_task_queue_error_count", "Count of times the controller fails to queue a task", )