Skip to content

Latest commit

Β 

History

History
391 lines (298 loc) Β· 14.2 KB

File metadata and controls

391 lines (298 loc) Β· 14.2 KB

βœ… Grafana Monitoring for Disaster Recovery - Implementation Complete

Data implementazione: 2025-11-09 Versione: 1.0.0 Status: βœ… PRODUCTION READY


πŸ“Š Overview

Sistema completo di monitoring Grafana per il disaster recovery system di InsightLearn K3s cluster.

Metrics Server: HTTP endpoint su porta 9101 che espone metriche Prometheus in tempo reale Grafana Dashboard: Dashboard interattivo con 13 pannelli per monitoring completo (inclusi disk space) Auto-Update: Metriche aggiornate automaticamente ad ogni backup/restore


βœ… Componenti Installati

1. DR Metrics Exporter

  • Script: k8s/export-dr-metrics.sh
  • Output: Prometheus text format metrics
  • Metriche esposte: 18 metriche chiave (incluse 5 metriche disk space - vedi sezione Metrics)
  • Trigger: Chiamato automaticamente da backup/restore scripts

2. DR Metrics HTTP Server

  • Server: k8s/dr-metrics-server.py
  • Port: 9101
  • Endpoints:
    • /metrics - Prometheus metrics (scrapeable)
    • /health - Health check
  • Systemd Service: dr-metrics-server.service βœ… RUNNING
  • Auto-start: Enabled at boot

3. Grafana Dashboard

4. Kubernetes Deployment (Optional)


πŸ“ˆ Metrics Exposed

Backup Metrics

Metric Type Description Values
insightlearn_dr_backup_last_success_timestamp_seconds gauge Unix timestamp of last successful backup timestamp
insightlearn_dr_backup_size_bytes gauge Size of latest backup in bytes 6150 (current)
insightlearn_dr_backup_last_status gauge Last backup status 1=success, 0=failure
insightlearn_dr_backup_age_seconds gauge Age of latest backup in seconds calculated
insightlearn_dr_next_backup_seconds gauge Seconds until next scheduled backup calculated

Restore Metrics

Metric Type Description Values
insightlearn_dr_restore_service_enabled gauge Auto-restore service enabled 1=yes, 0=no
insightlearn_dr_restore_service_active gauge Auto-restore service active 1=yes, 0=no
insightlearn_dr_last_restore_timestamp_seconds gauge Unix timestamp of last restore timestamp

Cloudflare Tunnel Metrics

Metric Type Description Values
insightlearn_dr_cloudflare_service_enabled gauge Cloudflare tunnel service enabled 1=yes, 0=no
insightlearn_dr_cloudflare_service_active gauge Cloudflare tunnel service active 1=yes, 0=no
insightlearn_dr_cloudflare_process_running gauge Cloudflare process running 1=yes, 0=no
insightlearn_dr_external_access gauge External access check 1=OK, 0=unreachable

System Metrics

Metric Type Description Values
insightlearn_dr_cron_job_configured gauge Backup cron job configured 1=yes, 0=no
insightlearn_dr_k3s_pods_running gauge Number of running pods in cluster count
insightlearn_dr_k3s_pods_total gauge Total number of pods in cluster count

Disk Space Metrics πŸ†•

Metric Type Description Values
insightlearn_dr_disk_total_bytes gauge Total disk space for backup location bytes (75 GB)
insightlearn_dr_disk_used_bytes gauge Used disk space for backup location bytes (61 GB)
insightlearn_dr_disk_available_bytes gauge Available disk space for backup location bytes (13 GB)
insightlearn_dr_disk_usage_percent gauge Disk usage percentage for backup location 0-100 (82%)
insightlearn_dr_backup_count gauge Number of backup files maintained count (2)

🎨 Grafana Dashboard Layout

Row 1: Key Status Indicators

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Last Backup  β”‚  Backup Age  β”‚  Backup Size β”‚  Cloudflare  β”‚
β”‚   Status     β”‚              β”‚              β”‚    Tunnel    β”‚
β”‚              β”‚              β”‚              β”‚              β”‚
β”‚   βœ“ OK       β”‚   45m        β”‚   6.15 KB    β”‚   βœ“ UP       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Row 2: Time Series Graphs

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Backup Size History           β”‚      K3s Cluster Pods            β”‚
β”‚                                  β”‚                                  β”‚
β”‚  [Graph showing size over time]  β”‚  [Graph: Running vs Total]       β”‚
β”‚                                  β”‚                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Row 3: Service Status

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Auto-Restore β”‚  Backup Cron β”‚   External   β”‚  Next Backup β”‚
β”‚   Service    β”‚      Job     β”‚    Access    β”‚      In      β”‚
β”‚              β”‚              β”‚              β”‚              β”‚
β”‚  βœ“ ENABLED   β”‚  βœ“ ENABLED   β”‚    βœ“ OK      β”‚    15m       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Row 4: Disk Space Monitoring πŸ†•

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Disk Usage %       β”‚  Available Space     β”‚   Backup Count       β”‚
β”‚                      β”‚                      β”‚                      β”‚
β”‚      82%             β”‚      13 GB           β”‚         2            β”‚
β”‚   (Yellow)           β”‚    (Green)           β”‚     (Green)          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Installation Summary

Automated Installation

cd /home/mpasqui/insightlearn_WASM/InsightLearn_WASM/k8s
sudo ./install-dr-grafana-monitoring.sh

Installation Steps Completed:

  1. βœ… Python3 verified (3.12.9)
  2. βœ… DR metrics server systemd service installed & started
  3. βœ… Metrics endpoint verified (http://localhost:9101/metrics)
  4. βœ… Grafana dashboard imported successfully
  5. βœ… Backup/restore scripts updated to export metrics

πŸ“Š Current Status (2025-11-09 16:45)

Metrics Server

Service: dr-metrics-server.service
Status: βœ… active (running)
Port: 9101
Uptime: 12 minutes

Sample Metrics Output

insightlearn_dr_backup_last_success_timestamp_seconds 1762702380
insightlearn_dr_backup_size_bytes 6150
insightlearn_dr_backup_last_status 1
insightlearn_dr_restore_service_enabled 1
insightlearn_dr_cloudflare_service_enabled 1
insightlearn_dr_cloudflare_process_running 1
insightlearn_dr_external_access 1

Grafana Dashboard


πŸ”§ Configuration

Prometheus Scrape Config

Add to prometheus.yml:

scrape_configs:
  - job_name: 'insightlearn-disaster-recovery'
    static_configs:
      - targets: ['localhost:9101']
        labels:
          service: 'disaster-recovery'
          environment: 'production'
    scrape_interval: 60s
    scrape_timeout: 30s
    metrics_path: /metrics

Docker Compose Integration

Add to docker-compose.yml:

  dr-metrics:
    image: python:3.11-slim
    container_name: insightlearn-dr-metrics
    command: python3 /app/dr-metrics-server.py --port 9101 --host 0.0.0.0
    ports:
      - "9101:9101"
    volumes:
      - ./k8s:/app:ro
      - /var/backups/k3s-cluster:/var/backups/k3s-cluster:ro
      - /var/log:/var/log:ro
    restart: always

πŸ†˜ Troubleshooting

Metrics Server Not Responding

# Check service status
sudo systemctl status dr-metrics-server.service

# View logs
sudo journalctl -u dr-metrics-server.service -n 50 -f

# Restart service
sudo systemctl restart dr-metrics-server.service

# Test endpoint manually
curl http://localhost:9101/health
curl http://localhost:9101/metrics | head -20

Dashboard Not Showing Data

# Check Prometheus scraping
# Go to Prometheus UI: http://localhost:9091/targets
# Look for job 'insightlearn-disaster-recovery'

# Manually query metric
curl 'http://localhost:9091/api/v1/query?query=insightlearn_dr_backup_last_status'

# Check Grafana datasource
# Grafana UI β†’ Configuration β†’ Data Sources β†’ Prometheus
# Test connection

Metrics Stale/Not Updating

# Run metrics export manually
sudo /home/mpasqui/insightlearn_WASM/InsightLearn_WASM/k8s/export-dr-metrics.sh

# Check if backup script calls metrics export
grep "export-dr-metrics" /home/mpasqui/insightlearn_WASM/InsightLearn_WASM/k8s/backup-cluster-state.sh

# Force backup to update metrics
sudo /home/mpasqui/insightlearn_WASM/InsightLearn_WASM/k8s/backup-cluster-state.sh

πŸ“ File Structure

k8s/
β”œβ”€β”€ export-dr-metrics.sh              # Metrics exporter script (244 lines)
β”œβ”€β”€ dr-metrics-server.py              # HTTP server for Prometheus (120 lines)
β”œβ”€β”€ dr-metrics-server.service         # Systemd service file
β”œβ”€β”€ install-dr-grafana-monitoring.sh  # Installation script (180 lines)
└── 20-dr-metrics-prometheus-config.yaml  # K8s deployment manifest

grafana/
└── grafana-dashboard-disaster-recovery.json  # Dashboard JSON (650 lines)

/etc/systemd/system/
└── dr-metrics-server.service         # βœ… enabled & running

🎯 Key Features

βœ… Real-time metrics - 18 metriche aggiornate ad ogni backup/restore βœ… Automatic export - Script chiamati automaticamente βœ… HTTP endpoint - Prometheus-compatible scraping (porta 9101) βœ… Systemd service - Auto-start al boot βœ… Grafana dashboard - 13 pannelli visualizzazione (inclusi disk space) βœ… Backup rotation - Mantiene 2 backup, sovrascrive il piΓΉ vecchio βœ… Disk monitoring - Spazio disponibile, usage %, backup count βœ… Zero maintenance - Completamente automatico βœ… Kubernetes ready - Deployment manifest incluso


πŸ“š Documentation


πŸ“Š Access URLs

Service URL Credentials
Grafana Dashboard http://localhost:3000/d/insightlearn-dr/insightlearn-disaster-recovery admin/admin
DR Metrics Endpoint http://localhost:9101/metrics -
Health Check http://localhost:9101/health -
Prometheus http://localhost:9091 -

πŸ”„ Automatic Updates

Metrics Update Triggers

  1. Every backup (hourly at :05)

    • backup-cluster-state.sh β†’ calls export-dr-metrics.sh
  2. Every restore (at boot if crashed)

    • restore-cluster-state.sh β†’ calls export-dr-metrics.sh
  3. On HTTP request (real-time)

    • dr-metrics-server.py β†’ executes export-dr-metrics.sh on /metrics call

Dashboard Auto-Refresh

  • Grafana dashboard auto-refresh: 30 seconds
  • Prometheus scrape interval: 60 seconds
  • Effective update frequency: 30-60 seconds

βœ… Acceptance Criteria

Requirement Status Notes
Metriche Prometheus esposte βœ… PASS 13 metriche, HTTP port 9101
Grafana dashboard importato βœ… PASS 10 pannelli, auto-refresh 30s
Auto-update metriche βœ… PASS Trigger backup/restore/HTTP
Systemd service running βœ… PASS Enabled & active
Kubernetes deployment βœ… PASS Manifest ready (kubectl not available)
Documentazione completa βœ… PASS 3 docs, install script

πŸŽ‰ Conclusion

Sistema di monitoring Grafana completo e operativo per disaster recovery.

Implementato in 2.5 ore con:

  • βœ… 18 metriche Prometheus (incluse 5 per disk space)
  • βœ… HTTP server Python con systemd service
  • βœ… Dashboard Grafana con 13 pannelli (inclusi disk monitoring)
  • βœ… Backup rotation 2-file (sovrascrive il piΓΉ vecchio dal 3Β° backup)
  • βœ… Auto-update ad ogni backup/restore
  • βœ… Kubernetes deployment ready
  • βœ… Documentazione completa

Produzione ready dal 2025-11-09 17:05 UTC+1.

Disk Space Status:

  • Total: 70 GB
  • Available: 13 GB (82% usage)
  • Backup count: 2 files (rotation attiva)
  • Backup size: ~6KB per snapshot

Maintainer: InsightLearn DevOps Team Contact: marcello.pasqui@gmail.com Repository: https://github.com/marypas74/InsightLearn_WASM Version: 1.0.0 Implementation Date: 2025-11-09