Proposal
Use recording rules for postgres_mixin/alerts
Explanation
Evaluating complex Prometheus alerts directly on raw metrics without recording rules can be very resource-intensive, especially in production environments with multiple PostgreSQL instances. This can lead to slow queries in Prometheus itself, delayed alerts, and unnecessary load on the exporter and database.
Benefits of Recording Rules:
- Precompute expensive queries once, store results, and reference them in multiple alerts.
- Reduce Prometheus CPU and memory usage.
- Enable more stable and reliable alerting.
- Simplify complex alert expressions and reduce duplication.
An example:
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: postgres-alerts
namespace: monitoring
labels:
app: postgres
spec:
groups:
- name: postgres-recording
interval: 30s
rules:
# Transactions per second
- record: pg:transactions:rate5m
expr: |
sum by (instance, datname) (
irate(pg_stat_database_xact_commit{job="postgres"}[5m]) +
irate(pg_stat_database_xact_rollback{job="postgres"}[5m])
)
# Cache hit ratio
- record: pg:cache_hit_ratio:5m
expr: |
sum by (instance, datname) (
rate(pg_stat_database_blks_hit{job="postgres"}[5m])
)
/
sum by (instance, datname) (
rate(pg_stat_database_blks_hit{job="postgres"}[5m]) +
rate(pg_stat_database_blks_read{job="postgres"}[5m])
)
# Connection usage
- record: pg:connections:used_ratio
expr: |
sum by (instance) (pg_stat_activity_count{job="postgres"})
/
sum by (instance) (pg_settings_max_connections{job="postgres"})
# Rollback ratio
- record: pg:rollback_ratio:5m
expr: |
sum by (instance, datname) (
rate(pg_stat_database_xact_rollback{job="postgres"}[5m])
)
/
sum by (instance, datname) (
rate(pg_stat_database_xact_commit{job="postgres"}[5m]) +
rate(pg_stat_database_xact_rollback{job="postgres"}[5m])
)
# Deadlocks per minute
- record: pg:deadlocks:rate5m
expr: |
sum by (instance, datname) (
rate(pg_stat_database_deadlocks{job="postgres"}[5m]) * 60
)
# Replication lag
- record: pg:replication_lag
expr: |
max by (instance) (
pg_replication_lag{job="postgres"}
)
# Timed Checkpoints
- record: pg:checkpoints_timed_ratio:5m
expr: |
sum by (instance) (
rate(pg_stat_bgwriter_checkpoints_timed_total{job="postgres"}[5m])
)
/
sum by (instance) (
rate(pg_stat_bgwriter_checkpoints_timed_total{job="postgres"}[5m]) +
rate(pg_stat_bgwriter_checkpoints_req_total{job="postgres"}[5m])
)
- name: postgres-alerts
rules:
- alert: PostgreSQLDown
expr: max by (instance) (pg_up{job="postgres"}) != 1
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down"
description: "{{ $labels.instance }} is not responding"
- alert: PostgreSQLMaxConnectionsReached
expr: pg:connections:used_ratio > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "Postgres connections count is over the maximum amount."
description: "{{ $labels.instance }} exceeds max connections"
- alert: PostgreSQLHighConnections
expr: pg:connections:used_ratio > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High connection usage"
description: "{{ $labels.instance }} > 80% connections used"
- alert: PostgreSQLQPS
expr: pg:transactions:rate5m > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High QPS"
description: "{{ $labels.instance }} DB {{ $labels.datname }} high throughput"
- alert: PostgreSQLCacheHitRatioLow
expr: pg:cache_hit_ratio:5m < 0.98
for: 5m
labels:
severity: warning
annotations:
summary: "Low cache hit ratio"
description: "{{ $labels.instance }} DB {{ $labels.datname }} cache hit ratio low"
- alert: PostgresTooManyRollbacks
expr: pg:rollback_ratio:5m > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Too many rollbacks"
description: "{{ $labels.instance }} high rollback ratio"
- alert: PostgresHighDeadlocks
expr: pg:deadlocks:rate5m > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Deadlocks detected"
description: "{{ $labels.instance }} DB {{ $labels.datname }} high deadlock rate"
- alert: PostgresReplicationLag
expr: pg:replication_lag > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Replication lag"
description: "{{ $labels.instance }} replication lag > 5 min"
- alert: PostgresExporterErrors
expr: pg_exporter_last_scrape_error{job="postgres"} > 0
for: 30m
labels:
severity: critical
annotations:
summary: "Exporter errors"
description: "{{ $labels.instance }} exporter failing"
- alert: PostgresTooManyCheckpointsRequested
expr: pg:checkpoints_timed_ratio:5m < 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Too many checkpoints requested"
description: "{{ $labels.instance }} checkpoint pressure high"
Proposal
Use recording rules for postgres_mixin/alerts
Explanation
Evaluating complex Prometheus alerts directly on raw metrics without recording rules can be very resource-intensive, especially in production environments with multiple PostgreSQL instances. This can lead to slow queries in Prometheus itself, delayed alerts, and unnecessary load on the exporter and database.
Benefits of Recording Rules:
An example: