Skip to content

Use recordings for postgres_mixin/alerts #1278

@Leaced

Description

@Leaced

Proposal

Use recording rules for postgres_mixin/alerts

Explanation

Evaluating complex Prometheus alerts directly on raw metrics without recording rules can be very resource-intensive, especially in production environments with multiple PostgreSQL instances. This can lead to slow queries in Prometheus itself, delayed alerts, and unnecessary load on the exporter and database.

Benefits of Recording Rules:

  • Precompute expensive queries once, store results, and reference them in multiple alerts.
  • Reduce Prometheus CPU and memory usage.
  • Enable more stable and reliable alerting.
  • Simplify complex alert expressions and reduce duplication.

An example:

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: postgres-alerts
  namespace: monitoring
  labels:
    app: postgres
spec:
  groups:
  - name: postgres-recording
    interval: 30s
    rules:

      # Transactions per second
      - record: pg:transactions:rate5m
        expr: |
          sum by (instance, datname) (
            irate(pg_stat_database_xact_commit{job="postgres"}[5m]) +
            irate(pg_stat_database_xact_rollback{job="postgres"}[5m])
          )

      # Cache hit ratio
      - record: pg:cache_hit_ratio:5m
        expr: |
          sum by (instance, datname) (
            rate(pg_stat_database_blks_hit{job="postgres"}[5m])
          )
          /
          sum by (instance, datname) (
            rate(pg_stat_database_blks_hit{job="postgres"}[5m]) +
            rate(pg_stat_database_blks_read{job="postgres"}[5m])
          )

      # Connection usage
      - record: pg:connections:used_ratio
        expr: |
          sum by (instance) (pg_stat_activity_count{job="postgres"})
          /
          sum by (instance) (pg_settings_max_connections{job="postgres"})

      # Rollback ratio
      - record: pg:rollback_ratio:5m
        expr: |
          sum by (instance, datname) (
            rate(pg_stat_database_xact_rollback{job="postgres"}[5m])
          )
          /
          sum by (instance, datname) (
            rate(pg_stat_database_xact_commit{job="postgres"}[5m]) +
            rate(pg_stat_database_xact_rollback{job="postgres"}[5m])
          )

      # Deadlocks per minute
      - record: pg:deadlocks:rate5m
        expr: |
          sum by (instance, datname) (
            rate(pg_stat_database_deadlocks{job="postgres"}[5m]) * 60
          )

      # Replication lag
      - record: pg:replication_lag
        expr: |
          max by (instance) (
            pg_replication_lag{job="postgres"}
          )

      # Timed Checkpoints
      - record: pg:checkpoints_timed_ratio:5m
        expr: |
          sum by (instance) (
            rate(pg_stat_bgwriter_checkpoints_timed_total{job="postgres"}[5m])
          )
          /
          sum by (instance) (
            rate(pg_stat_bgwriter_checkpoints_timed_total{job="postgres"}[5m]) +
            rate(pg_stat_bgwriter_checkpoints_req_total{job="postgres"}[5m])
          )

  - name: postgres-alerts
    rules:

    - alert: PostgreSQLDown
      expr: max by (instance) (pg_up{job="postgres"}) != 1
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "PostgreSQL is down"
        description: "{{ $labels.instance }} is not responding"

    - alert: PostgreSQLMaxConnectionsReached
      expr: pg:connections:used_ratio > 0.95
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "Postgres connections count is over the maximum amount."
        description: "{{ $labels.instance }} exceeds max connections"

    - alert: PostgreSQLHighConnections
      expr: pg:connections:used_ratio > 0.8
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High connection usage"
        description: "{{ $labels.instance }} > 80% connections used"

    - alert: PostgreSQLQPS
      expr: pg:transactions:rate5m > 10000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High QPS"
        description: "{{ $labels.instance }} DB {{ $labels.datname }} high throughput"

    - alert: PostgreSQLCacheHitRatioLow
      expr: pg:cache_hit_ratio:5m < 0.98
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Low cache hit ratio"
        description: "{{ $labels.instance }} DB {{ $labels.datname }} cache hit ratio low"

    - alert: PostgresTooManyRollbacks
      expr: pg:rollback_ratio:5m > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Too many rollbacks"
        description: "{{ $labels.instance }} high rollback ratio"

    - alert: PostgresHighDeadlocks
      expr: pg:deadlocks:rate5m > 5
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Deadlocks detected"
        description: "{{ $labels.instance }} DB {{ $labels.datname }} high deadlock rate"

    - alert: PostgresReplicationLag
      expr: pg:replication_lag > 300
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Replication lag"
        description: "{{ $labels.instance }} replication lag > 5 min"

    - alert: PostgresExporterErrors
      expr: pg_exporter_last_scrape_error{job="postgres"} > 0
      for: 30m
      labels:
        severity: critical
      annotations:
        summary: "Exporter errors"
        description: "{{ $labels.instance }} exporter failing"

    - alert: PostgresTooManyCheckpointsRequested
      expr: pg:checkpoints_timed_ratio:5m < 0.5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Too many checkpoints requested"
        description: "{{ $labels.instance }} checkpoint pressure high"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions