# Monitoring ExaBGP

**Production monitoring for ExaBGP deployments**

> 📊 **Monitor health, performance, and BGP state** - ensure high availability

---

## Table of Contents

- [Overview](#overview)
- [What to Monitor](#what-to-monitor)
- [Monitoring Tools](#monitoring-tools)
- [Process Monitoring](#process-monitoring)
- [BGP Session Monitoring](#bgp-session-monitoring)
- [Route Monitoring](#route-monitoring)
- [Performance Metrics](#performance-metrics)
- [Alerting](#alerting)
- [Log Monitoring](#log-monitoring)
- [Dashboard Examples](#dashboard-examples)
- [Best Practices](#best-practices)

---

## Overview

**Production monitoring ensures:**
- ExaBGP process health
- BGP sessions stay established
- Routes are being announced correctly
- Performance is acceptable
- Early detection of issues

---

## What to Monitor

### Critical Metrics

**1. Process Health**
- Is ExaBGP running?
- Process uptime
- CPU usage
- Memory usage
- Process restarts

**2. BGP Session State**
- Session established?
- Session uptime
- Session flaps (up/down events)
- Keepalive/hold-time

**3. Route Announcements**
- Number of routes announced
- Number of routes withdrawn
- Route changes per minute
- Active routes count

**4. API Process Health**
- API process running?
- API process restarts
- API command rate
- API errors

**5. System Health**
- Network connectivity
- Disk space
- System load

---

## Monitoring Tools

### Option 1: Prometheus + Grafana

**Most popular stack for modern monitoring**

#### Architecture

```
ExaBGP → node_exporter → Prometheus → Grafana
         (metrics)        (storage)    (visualization)
```

#### Setup

**1. Install node_exporter:**

```bash
# Download
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64

# Run
./node_exporter &
```

**2. Create custom metrics exporter for ExaBGP:**

```python
#!/usr/bin/env python3
"""
ExaBGP Prometheus exporter
Exposes metrics on :9100/metrics
"""
from prometheus_client import start_http_server, Gauge, Counter, Info
import subprocess
import time
import re

# Define metrics
exabgp_up = Gauge('exabgp_up', 'ExaBGP process status (1=up, 0=down)')
exabgp_bgp_session_up = Gauge('exabgp_bgp_session_up', 'BGP session status', ['neighbor'])
exabgp_routes_announced = Gauge('exabgp_routes_announced', 'Number of routes announced')
exabgp_routes_withdrawn = Counter('exabgp_routes_withdrawn_total', 'Total routes withdrawn')
exabgp_process_restarts = Counter('exabgp_process_restarts_total', 'Total process restarts')
exabgp_info = Info('exabgp', 'ExaBGP version information')

def check_exabgp_running():
    """Check if ExaBGP process is running"""
    try:
        result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True)
        return result.returncode == 0
    except:
        return False

def get_bgp_session_state(neighbor_ip):
    """Check BGP session state (example - adapt to your setup)"""
    # This would query your router or parse ExaBGP logs
    # For demo, return True
    return True

def update_metrics():
    """Update all metrics"""
    # Process status
    if check_exabgp_running():
        exabgp_up.set(1)
    else:
        exabgp_up.set(0)
        exabgp_process_restarts.inc()

    # BGP session status (example)
    neighbors = ['192.168.1.1', '192.168.1.2']
    for neighbor in neighbors:
        state = get_bgp_session_state(neighbor)
        exabgp_bgp_session_up.labels(neighbor=neighbor).set(1 if state else 0)

    # Version info (one-time)
    try:
        result = subprocess.run(['exabgp', '--version'], capture_output=True, text=True)
        version = result.stdout.strip()
        exabgp_info.info({'version': version})
    except:
        pass

if __name__ == '__main__':
    # Start metrics server
    start_http_server(9101)  # Metrics on :9101/metrics

    print("ExaBGP Prometheus exporter started on :9101/metrics")

    while True:
        update_metrics()
        time.sleep(15)  # Update every 15 seconds
```

**3. Configure Prometheus** (`/etc/prometheus/prometheus.yml`):

```yaml
scrape_configs:
  - job_name: 'exabgp'
    static_configs:
      - targets: ['localhost:9101']
        labels:
          instance: 'exabgp-server-1'
```

**4. Create Grafana dashboard** (see [Dashboard Examples](#dashboard-examples))

---

### Option 2: Nagios / Icinga

**Traditional monitoring tools**

**Check script:**

```bash
#!/bin/bash
# Nagios check for ExaBGP

# Check if ExaBGP running
if ! pgrep -f exabgp > /dev/null; then
    echo "CRITICAL: ExaBGP not running"
    exit 2
fi

# Check BGP session (example - adjust for your setup)
# Parse exabgp logs or query router

echo "OK: ExaBGP running"
exit 0
```

**Nagios config:**

```
define service {
    use                     generic-service
    host_name               exabgp-server
    service_description     ExaBGP Process
    check_command           check_exabgp
    check_interval          1
}
```

---

### Option 3: Datadog / New Relic

**SaaS monitoring platforms**

**Datadog custom check:**

```python
from datadog import initialize, api
import subprocess

# Initialize
initialize(api_key='YOUR_API_KEY', app_key='YOUR_APP_KEY')

# Send metric
def send_metric(metric_name, value, tags=None):
    api.Metric.send(
        metric=metric_name,
        points=value,
        tags=tags or []
    )

# Check ExaBGP
if check_exabgp_running():
    send_metric('exabgp.up', 1, tags=['env:prod'])
else:
    send_metric('exabgp.up', 0, tags=['env:prod'])
```

---

## Process Monitoring

### Systemd Monitoring

**Systemd service with automatic restart:**

```ini
# /etc/systemd/system/exabgp.service
[Unit]
Description=ExaBGP
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf
Restart=always
RestartSec=10
User=exabgp
Group=exabgp

# Monitoring
StandardOutput=append:/var/log/exabgp.log
StandardError=append:/var/log/exabgp.log

[Install]
WantedBy=multi-user.target
```

**Monitor service state:**

```bash
# Check status
systemctl status exabgp

# Monitor restarts
journalctl -u exabgp -f | grep -i restart
```

---

### Process Watchdog

**Simple watchdog script:**

```bash
#!/bin/bash
# ExaBGP watchdog - restarts if process dies

while true; do
    if ! pgrep -f exabgp > /dev/null; then
        echo "$(date): ExaBGP not running, restarting..." | tee -a /var/log/exabgp-watchdog.log
        systemctl restart exabgp
    fi
    sleep 30
done
```

**Run via cron:**

```cron
*/5 * * * * /usr/local/bin/exabgp-watchdog.sh
```

---

## BGP Session Monitoring

### Router-Based Monitoring

**Query router for BGP session state:**

**Cisco (SNMP):**

```python
#!/usr/bin/env python3
"""
Check BGP session state via SNMP
"""
from pysnmp.hlapi import *

def check_bgp_session(router_ip, neighbor_ip, community='public'):
    """
    Query BGP session state via SNMP
    Returns: True if established, False otherwise
    """
    # BGP peer state OID
    oid = ObjectIdentity('1.3.6.1.2.1.15.3.1.2.' + neighbor_ip)

    errorIndication, errorStatus, errorIndex, varBinds = next(
        getCmd(SnmpEngine(),
               CommunityData(community),
               UdpTransportTarget((router_ip, 161)),
               ContextData(),
               ObjectType(oid))
    )

    if errorIndication or errorStatus:
        return False

    # BGP state: 6 = Established
    state = int(varBinds[0][1])
    return state == 6

# Check session
if check_bgp_session('192.168.1.1', '192.168.1.2'):
    print("BGP session established")
else:
    print("BGP session down!")
```

---

### Log-Based Monitoring

**Parse ExaBGP logs for session state:**

```bash
#!/bin/bash
# Check BGP session from logs

LOG="/var/log/exabgp.log"
NEIGHBOR="192.168.1.1"

# Check for recent "neighbor up" message
if tail -100 "$LOG" | grep -q "neighbor $NEIGHBOR up"; then
    echo "OK: BGP session to $NEIGHBOR established"
    exit 0
else
    echo "CRITICAL: BGP session to $NEIGHBOR not established"
    exit 2
fi
```

---

## Route Monitoring

### Track Announced Routes

**Monitor route announcements:**

```python
#!/usr/bin/env python3
"""
Monitor routes announced by ExaBGP
Parse logs and track counts
"""
import re
import time
from collections import defaultdict

routes_announced = defaultdict(int)
routes_withdrawn = defaultdict(int)

def parse_log_line(line):
    """Parse log line for route announcements/withdrawals"""
    # Match: announce route 100.10.0.100/32
    announce_match = re.search(r'announce route ([\d\.]+/\d+)', line)
    if announce_match:
        prefix = announce_match.group(1)
        routes_announced[prefix] += 1
        return ('announce', prefix)

    # Match: withdraw route 100.10.0.100/32
    withdraw_match = re.search(r'withdraw route ([\d\.]+/\d+)', line)
    if withdraw_match:
        prefix = withdraw_match.group(1)
        routes_withdrawn[prefix] += 1
        return ('withdraw', prefix)

    return None

def monitor_routes():
    """Monitor route changes"""
    with open('/var/log/exabgp.log', 'r') as f:
        # Seek to end
        f.seek(0, 2)

        while True:
            line = f.readline()
            if line:
                result = parse_log_line(line)
                if result:
                    action, prefix = result
                    print(f"[{action.upper()}] {prefix}")

                    # Export metrics (Prometheus, etc.)
                    export_route_metrics()

            time.sleep(1)

if __name__ == '__main__':
    monitor_routes()
```

---

### Router-Side Verification

**Verify routes on router:**

```bash
#!/bin/bash
# Check if expected routes are on router

ROUTER="192.168.1.1"
EXPECTED_ROUTES=("100.10.0.100" "100.10.0.101" "100.10.0.102")

for route in "${EXPECTED_ROUTES[@]}"; do
    # SSH to router and check
    if ssh $ROUTER "show ip bgp $route" | grep -q "BGP routing table entry"; then
        echo "OK: Route $route present"
    else
        echo "CRITICAL: Route $route missing!"
    fi
done
```

---

## Performance Metrics

### CPU and Memory

**Monitor resource usage:**

```python
#!/usr/bin/env python3
import psutil
import subprocess

def get_exabgp_pid():
    """Get ExaBGP process PID"""
    result = subprocess.run(['pgrep', '-f', 'exabgp'], capture_output=True, text=True)
    if result.returncode == 0:
        return int(result.stdout.strip().split('\n')[0])
    return None

def get_process_metrics(pid):
    """Get CPU and memory usage"""
    try:
        process = psutil.Process(pid)

        return {
            'cpu_percent': process.cpu_percent(interval=1),
            'memory_mb': process.memory_info().rss / 1024 / 1024,
            'num_threads': process.num_threads(),
        }
    except:
        return None

pid = get_exabgp_pid()
if pid:
    metrics = get_process_metrics(pid)
    print(f"CPU: {metrics['cpu_percent']}%")
    print(f"Memory: {metrics['memory_mb']:.1f} MB")
    print(f"Threads: {metrics['num_threads']}")
```

---

### API Process Metrics

**Monitor API process health:**

```python
def check_api_process():
    """Check if API healthcheck process is running"""
    result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True)
    return result.returncode == 0

def count_api_processes():
    """Count number of API processes"""
    result = subprocess.run(['pgrep', '-f', 'healthcheck.py'], capture_output=True, text=True)
    if result.stdout:
        return len(result.stdout.strip().split('\n'))
    return 0
```

---

## Alerting

### Alert Conditions

**When to alert:**

**Critical:**
- ExaBGP process down
- BGP session down
- No routes announced (expected routes missing)
- API process crashed

**Warning:**
- High CPU usage (> 80%)
- High memory usage (> 90%)
- Route flapping
- BGP session flaps

**Info:**
- Route changes
- Process restarts
- Configuration reloads

---

### Alert Methods

**1. Email Alerts:**

```python
import smtplib
from email.mime.text import MIMEText

def send_email_alert(subject, body):
    """Send email alert"""
    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = 'exabgp@example.com'
    msg['To'] = 'ops@example.com'

    smtp = smtplib.SMTP('localhost')
    smtp.send_message(msg)
    smtp.quit()

# Example usage
if not check_exabgp_running():
    send_email_alert(
        "CRITICAL: ExaBGP Down",
        "ExaBGP process is not running on server-1"
    )
```

---

**2. Slack Alerts:**

```python
import requests

def send_slack_alert(message, severity='warning'):
    """Send Slack alert"""
    webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

    colors = {
        'critical': '#FF0000',
        'warning': '#FFA500',
        'info': '#0000FF'
    }

    payload = {
        "attachments": [{
            "color": colors.get(severity, '#808080'),
            "title": f"ExaBGP Alert - {severity.upper()}",
            "text": message,
            "footer": "ExaBGP Monitoring",
            "ts": int(time.time())
        }]
    }

    requests.post(webhook_url, json=payload)

# Example
send_slack_alert("BGP session to 192.168.1.1 down!", severity='critical')
```

---

**3. PagerDuty:**

```python
import requests

def send_pagerduty_alert(description, severity='error'):
    """Trigger PagerDuty incident"""
    url = "https://events.pagerduty.com/v2/enqueue"

    payload = {
        "routing_key": "YOUR_ROUTING_KEY",
        "event_action": "trigger",
        "payload": {
            "summary": description,
            "severity": severity,  # critical, error, warning, info
            "source": "exabgp-monitor",
        }
    }

    requests.post(url, json=payload)

# Example
send_pagerduty_alert("ExaBGP process down!", severity='critical')
```

---

## Log Monitoring

### Log Rotation

**Configure logrotate:**

```
# /etc/logrotate.d/exabgp
/var/log/exabgp.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    postrotate
        systemctl reload exabgp > /dev/null 2>&1 || true
    endscript
}
```

---

### Log Analysis

**Search for errors:**

```bash
# Critical errors
grep -i "error\|critical\|fatal" /var/log/exabgp.log

# BGP session changes
grep "neighbor.*up\|neighbor.*down" /var/log/exabgp.log

# Route changes
grep "announce\|withdraw" /var/log/exabgp.log | tail -20
```

---

### Centralized Logging

**Ship logs to ELK/Splunk:**

**Filebeat config** (`/etc/filebeat/filebeat.yml`):

```yaml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/exabgp.log
  fields:
    service: exabgp
    environment: production

output.elasticsearch:
  hosts: ["localhost:9200"]
  index: "exabgp-%{+yyyy.MM.dd}"
```

---

## Dashboard Examples

### Grafana Dashboard

**Example panels:**

**Panel 1: ExaBGP Status**
- Metric: `exabgp_up`
- Visualization: Stat
- Thresholds: 1 = green, 0 = red

**Panel 2: BGP Sessions**
- Metric: `exabgp_bgp_session_up{neighbor="*"}`
- Visualization: Table
- Show all neighbors with status

**Panel 3: Routes Announced**
- Metric: `exabgp_routes_announced`
- Visualization: Graph
- Time series of routes

**Panel 4: CPU Usage**
- Metric: `process_cpu_percent{job="exabgp"}`
- Visualization: Graph

**Panel 5: Memory Usage**
- Metric: `process_resident_memory_bytes{job="exabgp"}`
- Visualization: Graph

---

### Sample Grafana JSON

```json
{
  "dashboard": {
    "title": "ExaBGP Monitoring",
    "panels": [
      {
        "title": "ExaBGP Status",
        "targets": [
          {
            "expr": "exabgp_up"
          }
        ],
        "type": "stat"
      },
      {
        "title": "Routes Announced",
        "targets": [
          {
            "expr": "exabgp_routes_announced"
          }
        ],
        "type": "graph"
      }
    ]
  }
}
```

---

## Best Practices

### 1. Monitor from Multiple Perspectives

```
✅ ExaBGP process health (on ExaBGP server)
✅ BGP session state (on router via SNMP/SSH)
✅ Route presence (on router)
✅ End-to-end connectivity (client perspective)
```

---

### 2. Set Appropriate Thresholds

**Avoid alert fatigue:**

```python
# Good thresholds
CPU_WARNING = 80%
CPU_CRITICAL = 95%

MEMORY_WARNING = 80%
MEMORY_CRITICAL = 90%

BGP_SESSION_DOWN_THRESHOLD = 2 checks (dampening)
```

---

### 3. Monitor Trends

**Track over time:**
- Route announcement rate
- BGP session uptime
- Resource usage trends
- Error rates

---

### 4. Implement Health Checks

**Synthetic monitoring:**

```bash
#!/bin/bash
# End-to-end health check

# Check if service IP responds
if curl -sf http://100.10.0.100/health > /dev/null; then
    echo "OK: Service responding"
else
    echo "CRITICAL: Service not responding"
fi
```

---

### 5. Document Normal Baselines

**Know what's normal:**
- Typical CPU usage: 5-10%
- Typical memory: 50-100 MB
- Expected routes: 10
- BGP session uptime: > 30 days

---

## Next Steps

### Learn More

- **[Debugging](Debugging)** - Troubleshooting guide
- **[Service HA](Service-High-Availability)** - HA patterns
- **[API Overview](API-Overview)** - API integration

### Tools

- [Prometheus](https://prometheus.io/) - Metrics collection
- [Grafana](https://grafana.com/) - Visualization
- [Nagios](https://www.nagios.org/) - Traditional monitoring

---

**Ready to set up monitoring?** See [Quick Start](Quick-Start) →

---