# Monitoring ExaBGP **Production monitoring for ExaBGP deployments** > 📊 **Monitor health, performance, and BGP state** - ensure high availability --- ## Table of Contents - [Overview](#overview) - [What to Monitor](#what-to-monitor) - [Monitoring Tools](#monitoring-tools) - [Process Monitoring](#process-monitoring) - [BGP Session Monitoring](#bgp-session-monitoring) - [Route Monitoring](#route-monitoring) - [Performance Metrics](#performance-metrics) - [Alerting](#alerting) - [Log Monitoring](#log-monitoring) - [Dashboard Examples](#dashboard-examples) - [Best Practices](#best-practices) --- ## Overview **Production monitoring ensures:** - ExaBGP process health - BGP sessions stay established - Routes are being announced correctly - Performance is acceptable - Early detection of issues --- ## What to Monitor ### Critical Metrics **1. Process Health** - Is ExaBGP running? - Process uptime - CPU usage - Memory usage - Process restarts **2. BGP Session State** - Session established? - Session uptime - Session flaps (up/down events) - Keepalive/hold-time **3. Route Announcements** - Number of routes announced - Number of routes withdrawn - Route changes per minute - Active routes count **4. API Process Health** - API process running? - API process restarts - API command rate - API errors **5. System Health** - Network connectivity - Disk space - System load --- ## Monitoring Tools ### Option 1: Prometheus + Grafana **Most popular stack for modern monitoring** #### Architecture ``` ExaBGP → node_exporter → Prometheus → Grafana (metrics) (storage) (visualization) ``` #### Setup **1. Install node_exporter:** ```bash # Download wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz cd node_exporter-1.6.1.linux-amd64 # Run ./node_exporter & ``` **2. Create custom metrics exporter for ExaBGP:** ```python #!/usr/bin/env python3 """ ExaBGP Prometheus exporter Exposes metrics on :9100/metrics """ from prometheus_client import start_http_server, Gauge, Counter, Info import subprocess import time import re # Define metrics exabgp_up = Gauge('exabgp_up', 'ExaBGP process status (1=up, 0=down)') exabgp_bgp_session_up = Gauge('exabgp_bgp_session_up', 'BGP session status', ['neighbor']) exabgp_routes_announced = Gauge('exabgp_routes_announced', 'Number of routes announced') exabgp_routes_withdrawn = Counter('exabgp_routes_withdrawn_total', 'Total routes withdrawn') exabgp_process_restarts = Counter('exabgp_process_restarts_total', 'Total process restarts') exabgp_info = Info('exabgp', 'ExaBGP version information') def check_exabgp_running(): """Check if ExaBGP process is running""" try: result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True) return result.returncode == 0 except: return False def get_bgp_session_state(neighbor_ip): """Check BGP session state (example - adapt to your setup)""" # This would query your router or parse ExaBGP logs # For demo, return True return True def update_metrics(): """Update all metrics""" # Process status if check_exabgp_running(): exabgp_up.set(1) else: exabgp_up.set(0) exabgp_process_restarts.inc() # BGP session status (example) neighbors = ['192.168.1.1', '192.168.1.2'] for neighbor in neighbors: state = get_bgp_session_state(neighbor) exabgp_bgp_session_up.labels(neighbor=neighbor).set(1 if state else 0) # Version info (one-time) try: result = subprocess.run(['exabgp', '--version'], capture_output=True, text=True) version = result.stdout.strip() exabgp_info.info({'version': version}) except: pass if __name__ == '__main__': # Start metrics server start_http_server(9101) # Metrics on :9101/metrics print("ExaBGP Prometheus exporter started on :9101/metrics") while True: update_metrics() time.sleep(15) # Update every 15 seconds ``` **3. Configure Prometheus** (`/etc/prometheus/prometheus.yml`): ```yaml scrape_configs: - job_name: 'exabgp' static_configs: - targets: ['localhost:9101'] labels: instance: 'exabgp-server-1' ``` **4. Create Grafana dashboard** (see [Dashboard Examples](#dashboard-examples)) --- ### Option 2: Nagios / Icinga **Traditional monitoring tools** **Check script:** ```bash #!/bin/bash # Nagios check for ExaBGP # Check if ExaBGP running if ! pgrep -f exabgp > /dev/null; then echo "CRITICAL: ExaBGP not running" exit 2 fi # Check BGP session (example - adjust for your setup) # Parse exabgp logs or query router echo "OK: ExaBGP running" exit 0 ``` **Nagios config:** ``` define service { use generic-service host_name exabgp-server service_description ExaBGP Process check_command check_exabgp check_interval 1 } ``` --- ### Option 3: Datadog / New Relic **SaaS monitoring platforms** **Datadog custom check:** ```python from datadog import initialize, api import subprocess # Initialize initialize(api_key='YOUR_API_KEY', app_key='YOUR_APP_KEY') # Send metric def send_metric(metric_name, value, tags=None): api.Metric.send( metric=metric_name, points=value, tags=tags or [] ) # Check ExaBGP if check_exabgp_running(): send_metric('exabgp.up', 1, tags=['env:prod']) else: send_metric('exabgp.up', 0, tags=['env:prod']) ``` --- ## Process Monitoring ### Systemd Monitoring **Systemd service with automatic restart:** ```ini # /etc/systemd/system/exabgp.service [Unit] Description=ExaBGP After=network.target [Service] Type=simple ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf Restart=always RestartSec=10 User=exabgp Group=exabgp # Monitoring StandardOutput=append:/var/log/exabgp.log StandardError=append:/var/log/exabgp.log [Install] WantedBy=multi-user.target ``` **Monitor service state:** ```bash # Check status systemctl status exabgp # Monitor restarts journalctl -u exabgp -f | grep -i restart ``` --- ### Process Watchdog **Simple watchdog script:** ```bash #!/bin/bash # ExaBGP watchdog - restarts if process dies while true; do if ! pgrep -f exabgp > /dev/null; then echo "$(date): ExaBGP not running, restarting..." | tee -a /var/log/exabgp-watchdog.log systemctl restart exabgp fi sleep 30 done ``` **Run via cron:** ```cron */5 * * * * /usr/local/bin/exabgp-watchdog.sh ``` --- ## BGP Session Monitoring ### Router-Based Monitoring **Query router for BGP session state:** **Cisco (SNMP):** ```python #!/usr/bin/env python3 """ Check BGP session state via SNMP """ from pysnmp.hlapi import * def check_bgp_session(router_ip, neighbor_ip, community='public'): """ Query BGP session state via SNMP Returns: True if established, False otherwise """ # BGP peer state OID oid = ObjectIdentity('1.3.6.1.2.1.15.3.1.2.' + neighbor_ip) errorIndication, errorStatus, errorIndex, varBinds = next( getCmd(SnmpEngine(), CommunityData(community), UdpTransportTarget((router_ip, 161)), ContextData(), ObjectType(oid)) ) if errorIndication or errorStatus: return False # BGP state: 6 = Established state = int(varBinds[0][1]) return state == 6 # Check session if check_bgp_session('192.168.1.1', '192.168.1.2'): print("BGP session established") else: print("BGP session down!") ``` --- ### Log-Based Monitoring **Parse ExaBGP logs for session state:** ```bash #!/bin/bash # Check BGP session from logs LOG="/var/log/exabgp.log" NEIGHBOR="192.168.1.1" # Check for recent "neighbor up" message if tail -100 "$LOG" | grep -q "neighbor $NEIGHBOR up"; then echo "OK: BGP session to $NEIGHBOR established" exit 0 else echo "CRITICAL: BGP session to $NEIGHBOR not established" exit 2 fi ``` --- ## Route Monitoring ### Track Announced Routes **Monitor route announcements:** ```python #!/usr/bin/env python3 """ Monitor routes announced by ExaBGP Parse logs and track counts """ import re import time from collections import defaultdict routes_announced = defaultdict(int) routes_withdrawn = defaultdict(int) def parse_log_line(line): """Parse log line for route announcements/withdrawals""" # Match: announce route 100.10.0.100/32 announce_match = re.search(r'announce route ([\d\.]+/\d+)', line) if announce_match: prefix = announce_match.group(1) routes_announced[prefix] += 1 return ('announce', prefix) # Match: withdraw route 100.10.0.100/32 withdraw_match = re.search(r'withdraw route ([\d\.]+/\d+)', line) if withdraw_match: prefix = withdraw_match.group(1) routes_withdrawn[prefix] += 1 return ('withdraw', prefix) return None def monitor_routes(): """Monitor route changes""" with open('/var/log/exabgp.log', 'r') as f: # Seek to end f.seek(0, 2) while True: line = f.readline() if line: result = parse_log_line(line) if result: action, prefix = result print(f"[{action.upper()}] {prefix}") # Export metrics (Prometheus, etc.) export_route_metrics() time.sleep(1) if __name__ == '__main__': monitor_routes() ``` --- ### Router-Side Verification **Verify routes on router:** ```bash #!/bin/bash # Check if expected routes are on router ROUTER="192.168.1.1" EXPECTED_ROUTES=("100.10.0.100" "100.10.0.101" "100.10.0.102") for route in "${EXPECTED_ROUTES[@]}"; do # SSH to router and check if ssh $ROUTER "show ip bgp $route" | grep -q "BGP routing table entry"; then echo "OK: Route $route present" else echo "CRITICAL: Route $route missing!" fi done ``` --- ## Performance Metrics ### CPU and Memory **Monitor resource usage:** ```python #!/usr/bin/env python3 import psutil import subprocess def get_exabgp_pid(): """Get ExaBGP process PID""" result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True, text=True) if result.returncode == 0: return int(result.stdout.strip().split('\n')[0]) return None def get_process_metrics(pid): """Get CPU and memory usage""" try: process = psutil.Process(pid) return { 'cpu_percent': process.cpu_percent(interval=1), 'memory_mb': process.memory_info().rss / 1024 / 1024, 'num_threads': process.num_threads(), } except: return None pid = get_exabgp_pid() if pid: metrics = get_process_metrics(pid) print(f"CPU: {metrics['cpu_percent']}%") print(f"Memory: {metrics['memory_mb']:.1f} MB") print(f"Threads: {metrics['num_threads']}") ``` --- ### API Process Metrics **Monitor API process health:** ```python def check_api_process(): """Check if API healthcheck process is running""" result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True) return result.returncode == 0 def count_api_processes(): """Count number of API processes""" result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True, text=True) if result.stdout: return len(result.stdout.strip().split('\n')) return 0 ``` --- ## Alerting ### Alert Conditions **When to alert:** **Critical:** - ExaBGP process down - BGP session down - No routes announced (expected routes missing) - API process crashed **Warning:** - High CPU usage (> 80%) - High memory usage (> 90%) - Route flapping - BGP session flaps **Info:** - Route changes - Process restarts - Configuration reloads --- ### Alert Methods **1. Email Alerts:** ```python import smtplib from email.mime.text import MIMEText def send_email_alert(subject, body): """Send email alert""" msg = MIMEText(body) msg['Subject'] = subject msg['From'] = 'exabgp@example.com' msg['To'] = 'ops@example.com' smtp = smtplib.SMTP('localhost') smtp.send_message(msg) smtp.quit() # Example usage if not check_exabgp_running(): send_email_alert( "CRITICAL: ExaBGP Down", "ExaBGP process is not running on server-1" ) ``` --- **2. Slack Alerts:** ```python import requests def send_slack_alert(message, severity='warning'): """Send Slack alert""" webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" colors = { 'critical': '#FF0000', 'warning': '#FFA500', 'info': '#0000FF' } payload = { "attachments": [{ "color": colors.get(severity, '#808080'), "title": f"ExaBGP Alert - {severity.upper()}", "text": message, "footer": "ExaBGP Monitoring", "ts": int(time.time()) }] } requests.post(webhook_url, json=payload) # Example send_slack_alert("BGP session to 192.168.1.1 down!", severity='critical') ``` --- **3. PagerDuty:** ```python import requests def send_pagerduty_alert(description, severity='error'): """Trigger PagerDuty incident""" url = "https://events.pagerduty.com/v2/enqueue" payload = { "routing_key": "YOUR_ROUTING_KEY", "event_action": "trigger", "payload": { "summary": description, "severity": severity, # critical, error, warning, info "source": "exabgp-monitor", } } requests.post(url, json=payload) # Example send_pagerduty_alert("ExaBGP process down!", severity='critical') ``` --- ## Log Monitoring ### Log Rotation **Configure logrotate:** ``` # /etc/logrotate.d/exabgp /var/log/exabgp.log { daily rotate 7 compress delaycompress missingok notifempty postrotate systemctl reload exabgp > /dev/null 2>&1 || true endscript } ``` --- ### Log Analysis **Search for errors:** ```bash # Critical errors grep -i "error\|critical\|fatal" /var/log/exabgp.log # BGP session changes grep "neighbor.*up\|neighbor.*down" /var/log/exabgp.log # Route changes grep "announce\|withdraw" /var/log/exabgp.log | tail -20 ``` --- ### Centralized Logging **Ship logs to ELK/Splunk:** **Filebeat config** (`/etc/filebeat/filebeat.yml`): ```yaml filebeat.inputs: - type: log enabled: true paths: - /var/log/exabgp.log fields: service: exabgp environment: production output.elasticsearch: hosts: ["localhost:9200"] index: "exabgp-%{+yyyy.MM.dd}" ``` --- ## Dashboard Examples ### Grafana Dashboard **Example panels:** **Panel 1: ExaBGP Status** - Metric: `exabgp_up` - Visualization: Stat - Thresholds: 1 = green, 0 = red **Panel 2: BGP Sessions** - Metric: `exabgp_bgp_session_up{neighbor="*"}` - Visualization: Table - Show all neighbors with status **Panel 3: Routes Announced** - Metric: `exabgp_routes_announced` - Visualization: Graph - Time series of routes **Panel 4: CPU Usage** - Metric: `process_cpu_percent{job="exabgp"}` - Visualization: Graph **Panel 5: Memory Usage** - Metric: `process_resident_memory_bytes{job="exabgp"}` - Visualization: Graph --- ### Sample Grafana JSON ```json { "dashboard": { "title": "ExaBGP Monitoring", "panels": [ { "title": "ExaBGP Status", "targets": [ { "expr": "exabgp_up" } ], "type": "stat" }, { "title": "Routes Announced", "targets": [ { "expr": "exabgp_routes_announced" } ], "type": "graph" } ] } } ``` --- ## Best Practices ### 1. Monitor from Multiple Perspectives ``` ✅ ExaBGP process health (on ExaBGP server) ✅ BGP session state (on router via SNMP/SSH) ✅ Route presence (on router) ✅ End-to-end connectivity (client perspective) ``` --- ### 2. Set Appropriate Thresholds **Avoid alert fatigue:** ```python # Good thresholds CPU_WARNING = 80% CPU_CRITICAL = 95% MEMORY_WARNING = 80% MEMORY_CRITICAL = 90% BGP_SESSION_DOWN_THRESHOLD = 2 checks (dampening) ``` --- ### 3. Monitor Trends **Track over time:** - Route announcement rate - BGP session uptime - Resource usage trends - Error rates --- ### 4. Implement Health Checks **Synthetic monitoring:** ```bash #!/bin/bash # End-to-end health check # Check if service IP responds if curl -sf http://100.10.0.100/health > /dev/null; then echo "OK: Service responding" else echo "CRITICAL: Service not responding" fi ``` --- ### 5. Document Normal Baselines **Know what's normal:** - Typical CPU usage: 5-10% - Typical memory: 50-100 MB - Expected routes: 10 - BGP session uptime: > 30 days --- ## Next Steps ### Learn More - **[Debugging](Debugging)** - Troubleshooting guide - **[Service HA](Service-High-Availability)** - HA patterns - **[API Overview](API-Overview)** - API integration ### Tools - [Prometheus](https://prometheus.io/) - Metrics collection - [Grafana](https://grafana.com/) - Visualization - [Nagios](https://www.nagios.org/) - Traditional monitoring --- **Ready to set up monitoring?** See [Quick Start](Quick-Start) → ---