# Production Best Practices

**Comprehensive guide to deploying ExaBGP in production environments**

---

## Table of Contents

- [Introduction](#introduction)
- [Security Hardening](#security-hardening)
- [Monitoring and Observability](#monitoring-and-observability)
- [High Availability Architecture](#high-availability-architecture)
- [Performance Tuning](#performance-tuning)
- [Logging and Alerting](#logging-and-alerting)
- [Disaster Recovery](#disaster-recovery)
- [Configuration Management](#configuration-management)
- [Real-World Deployment Patterns](#real-world-deployment-patterns)
- [Testing Strategies](#testing-strategies)
- [Capacity Planning](#capacity-planning)
- [Operational Procedures](#operational-procedures)
- [Troubleshooting](#troubleshooting)

---

## Introduction

Deploying ExaBGP in production requires careful attention to security, reliability, and operational excellence. This guide provides battle-tested patterns from real-world deployments.

### Production Readiness Checklist

**Before going to production:**
- ✅ Security hardening applied
- ✅ Monitoring and alerting configured
- ✅ HA architecture tested
- ✅ Disaster recovery plan documented
- ✅ Logging centralized
- ✅ Runbooks created
- ✅ Load testing completed
- ✅ Rollback procedures tested

**Critical reminder:**
> 🔴 **ExaBGP does NOT manipulate RIB/FIB** - ExaBGP is a pure BGP protocol speaker. When your API programs announce/withdraw routes, ExaBGP sends BGP messages. The **router** installs/removes routes in its RIB/FIB. ExaBGP never touches routing tables directly.

---

## Security Hardening

### Process Isolation

**Run ExaBGP as dedicated user:**

```bash
# Create exabgp user
sudo useradd -r -s /bin/false -d /var/lib/exabgp exabgp

# Set ownership
sudo chown -R exabgp:exabgp /etc/exabgp
sudo chown -R exabgp:exabgp /var/log/exabgp
sudo chown -R exabgp:exabgp /var/lib/exabgp

# Restrict permissions
sudo chmod 750 /etc/exabgp
sudo chmod 640 /etc/exabgp/*.conf
```

---

### Systemd Hardening

**Systemd service with security restrictions:**

```ini
[Unit]
Description=ExaBGP BGP Speaker
After=network.target
Documentation=https://github.com/Exa-Networks/exabgp/wiki

[Service]
Type=simple
User=exabgp
Group=exabgp
ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf
Restart=on-failure
RestartSec=5s

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log/exabgp /var/lib/exabgp
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictRealtime=true
RestrictNamespaces=true
LockPersonality=true
MemoryDenyWriteExecute=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources
SystemCallErrorNumber=EPERM

# Resource limits
LimitNOFILE=65536
LimitNPROC=100
MemoryLimit=512M
CPUQuota=100%

[Install]
WantedBy=multi-user.target
```

**Enable and start:**

```bash
sudo systemctl daemon-reload
sudo systemctl enable exabgp
sudo systemctl start exabgp
```

---

### BGP Authentication

**MD5 authentication (recommended for production):**

```ini
neighbor 192.168.1.1 {
    router-id 192.168.1.2;
    local-address 192.168.1.2;
    local-as 65001;
    peer-as 65000;

    # MD5 authentication
    md5-password "your-strong-password-here";

    # TTL security (GTSM)
    incoming-ttl 255;  # Only accept packets with TTL=255

    family {
        ipv4 unicast;
        ipv4 flow;
    }
}
```

**Generate strong passwords:**

```bash
# Generate random 32-character password
openssl rand -base64 24
```

**Store passwords securely:**

```bash
# Use environment variables
export BGP_MD5_PASSWORD="$(cat /etc/exabgp/secrets/bgp_password)"

# Reference in config (ExaBGP 4.x+)
neighbor 192.168.1.1 {
    md5-password env['BGP_MD5_PASSWORD'];
}
```

---

### API Security

**Restrict API programs:**

```ini
process healthcheck {
    # Use absolute paths
    run /usr/local/bin/exabgp-healthcheck.py;

    # Set working directory
    working-directory /var/lib/exabgp;

    # Limit environment
    env {
        SERVICE_IP = '100.10.0.100';
        CHECK_INTERVAL = '5';
    }

    encoder text;
}
```

**Validate API program permissions:**

```bash
# API programs should be owned by root, not writable by exabgp
sudo chown root:root /usr/local/bin/exabgp-healthcheck.py
sudo chmod 755 /usr/local/bin/exabgp-healthcheck.py

# Prevent tampering
sudo chattr +i /usr/local/bin/exabgp-healthcheck.py  # Make immutable
```

---

### Network Security

**Firewall rules (iptables):**

```bash
# Allow BGP from specific peers only
sudo iptables -A INPUT -p tcp --dport 179 -s 192.168.1.1 -j ACCEPT
sudo iptables -A INPUT -p tcp --dport 179 -j DROP

# Allow established connections
sudo iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

# Save rules
sudo iptables-save > /etc/iptables/rules.v4
```

**Firewall rules (nftables):**

```bash
# /etc/nftables.conf
table inet filter {
    chain input {
        type filter hook input priority 0; policy drop;

        # Allow established connections
        ct state established,related accept

        # Allow BGP from specific peer
        ip saddr 192.168.1.1 tcp dport 179 accept

        # Drop other BGP
        tcp dport 179 drop
    }
}
```

---

## Monitoring and Observability

### Prometheus Metrics Exporter

**Complete metrics exporter:**

```python
#!/usr/bin/env python3
"""
exabgp_exporter.py - Prometheus metrics for ExaBGP
"""
import sys
import json
import time
from prometheus_client import start_http_server, Counter, Gauge, Histogram

# Metrics
bgp_session_up = Gauge('exabgp_bgp_session_up',
                       'BGP session state (1=up, 0=down)',
                       ['peer', 'local_as', 'peer_as'])

bgp_routes_announced = Counter('exabgp_routes_announced_total',
                                'Total routes announced',
                                ['peer', 'afi', 'safi'])

bgp_routes_withdrawn = Counter('exabgp_routes_withdrawn_total',
                                'Total routes withdrawn',
                                ['peer', 'afi', 'safi'])

bgp_active_routes = Gauge('exabgp_active_routes',
                          'Number of active routes',
                          ['peer', 'afi', 'safi'])

bgp_notifications = Counter('exabgp_notifications_total',
                            'BGP NOTIFICATION messages',
                            ['peer', 'code', 'subcode'])

bgp_update_processing_time = Histogram('exabgp_update_processing_seconds',
                                       'Time to process UPDATE messages',
                                       ['peer'])

# State tracking
route_counts = {}

def log(message):
    """Log to STDERR"""
    sys.stderr.write(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {message}\n")
    sys.stderr.flush()

def handle_state(msg):
    """Process STATE messages"""
    peer = msg['neighbor']['address']['peer']
    local_as = msg['neighbor']['asn']['local']
    peer_as = msg['neighbor']['asn']['peer']
    state = msg['neighbor']['state']

    # Update metric
    value = 1 if state == 'up' else 0
    bgp_session_up.labels(peer=peer, local_as=local_as, peer_as=peer_as).set(value)

    log(f"Session {peer}: {state}")

def handle_update(msg):
    """Process UPDATE messages"""
    start_time = time.time()

    peer = msg['neighbor']['address']['peer']
    update = msg['neighbor']['message']['update']

    # Process announcements
    if 'announce' in update:
        for family, routes in update['announce'].items():
            afi_safi = family  # e.g., "ipv4 unicast"
            count = len(routes)

            bgp_routes_announced.labels(peer=peer, afi=family.split()[0],
                                       safi=family.split()[1]).inc(count)

            # Update active count
            key = (peer, afi_safi)
            route_counts[key] = route_counts.get(key, 0) + count
            bgp_active_routes.labels(peer=peer, afi=family.split()[0],
                                    safi=family.split()[1]).set(route_counts[key])

    # Process withdrawals
    if 'withdraw' in update:
        for family, routes in update['withdraw'].items():
            count = len(routes)

            bgp_routes_withdrawn.labels(peer=peer, afi=family.split()[0],
                                       safi=family.split()[1]).inc(count)

            # Update active count
            key = (peer, family)
            route_counts[key] = max(0, route_counts.get(key, 0) - count)
            bgp_active_routes.labels(peer=peer, afi=family.split()[0],
                                    safi=family.split()[1]).set(route_counts[key])

    # Record processing time
    duration = time.time() - start_time
    bgp_update_processing_time.labels(peer=peer).observe(duration)

def handle_notification(msg):
    """Process NOTIFICATION messages"""
    peer = msg['neighbor']['address']['peer']
    notification = msg['neighbor']['message']['notification']

    code = notification.get('code', 0)
    subcode = notification.get('subcode', 0)

    bgp_notifications.labels(peer=peer, code=code, subcode=subcode).inc()

    log(f"NOTIFICATION from {peer}: code={code} subcode={subcode}")

def main():
    """Main metrics exporter"""
    # Start Prometheus HTTP server
    port = 9101
    start_http_server(port)
    log(f"Prometheus metrics server started on port {port}")

    # Process BGP messages
    while True:
        line = sys.stdin.readline()
        if not line:
            break

        try:
            msg = json.loads(line.strip())
            msg_type = msg.get('type')

            if msg_type == 'state':
                handle_state(msg)
            elif msg_type == 'update':
                handle_update(msg)
            elif msg_type == 'notification':
                handle_notification(msg)

        except Exception as e:
            log(f"Error processing message: {e}")

if __name__ == '__main__':
    main()
```

**ExaBGP configuration:**

```ini
process prometheus_exporter {
    run /usr/local/bin/exabgp_exporter.py;
    encoder json;
    receive {
        parsed;
        updates;
        neighbor-changes;
    }
}

neighbor 192.168.1.1 {
    router-id 192.168.1.2;
    local-address 192.168.1.2;
    local-as 65001;
    peer-as 65000;

    api {
        processes [ prometheus_exporter ];
    }
}
```

---

### Prometheus Scrape Config

```yaml
# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'exabgp'
    static_configs:
      - targets: ['localhost:9101']
        labels:
          instance: 'bgp-server-1'
          datacenter: 'dc1'
```

---

### Grafana Dashboard

**Key metrics to visualize:**

1. **BGP Session Status**
   - Query: `exabgp_bgp_session_up`
   - Type: Stat panel
   - Alert: Session down

2. **Route Count**
   - Query: `exabgp_active_routes`
   - Type: Graph
   - Alert: Sudden drops

3. **Announcement Rate**
   - Query: `rate(exabgp_routes_announced_total[5m])`
   - Type: Graph
   - Alert: Unusual spikes

4. **NOTIFICATION Messages**
   - Query: `rate(exabgp_notifications_total[5m])`
   - Type: Graph
   - Alert: Any notifications

---

### Health Check Monitoring

**Monitor ExaBGP process:**

```bash
#!/bin/bash
# /usr/local/bin/check_exabgp.sh

# Check if ExaBGP is running
if ! systemctl is-active --quiet exabgp; then
    echo "CRITICAL: ExaBGP is not running"
    exit 2
fi

# Check if BGP session is established
if ! ss -tn | grep -q ':179.*ESTAB'; then
    echo "WARNING: No established BGP sessions"
    exit 1
fi

echo "OK: ExaBGP running with established sessions"
exit 0
```

**Nagios/Icinga check:**

```ini
# /etc/nagios/nrpe.d/exabgp.cfg
command[check_exabgp]=/usr/local/bin/check_exabgp.sh
```

---

### Centralized Logging

**Syslog configuration:**

```ini
# ExaBGP environment variables
export exabgp.log.destination=syslog
export exabgp.log.level=INFO
export exabgp.log.rib=true
export exabgp.log.packets=false  # Disable in production (noisy)
```

**Rsyslog configuration:**

```
# /etc/rsyslog.d/exabgp.conf
if $programname == 'exabgp' then /var/log/exabgp/exabgp.log
& stop
```

**Logrotate:**

```
# /etc/logrotate.d/exabgp
/var/log/exabgp/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 0640 exabgp exabgp
    sharedscripts
    postrotate
        systemctl reload rsyslog
    endscript
}
```

---

## High Availability Architecture

### Active-Active HA

**Multiple ExaBGP instances announcing same routes:**

```
┌────────────────────────────────────────────────────────┐
│              Router (ECMP enabled)                     │
└────────────────────────────────────────────────────────┘
         ↑                ↑                ↑
         │ BGP            │ BGP            │ BGP
         │                │                │
    ┌────┴────┐      ┌────┴────┐      ┌────┴────┐
    │ ExaBGP  │      │ ExaBGP  │      │ ExaBGP  │
    │ Server1 │      │ Server2 │      │ Server3 │
    └─────────┘      └─────────┘      └─────────┘
         ↓                ↓                ↓
    ┌─────────┐      ┌─────────┐      ┌─────────┐
    │ Service │      │ Service │      │ Service │
    │ Healthy │      │ Healthy │      │ Healthy │
    └─────────┘      └─────────┘      └─────────┘
```

**Benefits:**
- ✅ No single point of failure
- ✅ Automatic load distribution (ECMP)
- ✅ Fast failover (BGP convergence)
- ✅ Horizontal scaling

**Configuration on each server:**

```ini
neighbor 192.168.1.1 {
    router-id 192.168.1.2;  # Unique per server
    local-address 192.168.1.2;  # Unique per server
    local-as 65001;
    peer-as 65000;

    family {
        ipv4 unicast;
    }

    api {
        processes [ healthcheck ];
    }
}

process healthcheck {
    run /usr/local/bin/healthcheck.py;
    encoder text;
}
```

---

### Active-Passive HA with MED

**Primary/backup failover:**

```python
#!/usr/bin/env python3
# Primary server (low MED)
SERVICE_IP = "100.10.0.100"
MED = 100  # Lower is preferred

if is_healthy():
    print(f"announce route {SERVICE_IP}/32 next-hop self med {MED}")
```

```python
#!/usr/bin/env python3
# Backup server (high MED)
SERVICE_IP = "100.10.0.100"
MED = 200  # Higher MED = backup

if is_healthy():
    print(f"announce route {SERVICE_IP}/32 next-hop self med {MED}")
```

---

### Geographic HA

**Multi-datacenter deployment:**

```
       Internet
           │
    ┌──────┴──────┐
    │   Router    │
    └──────┬──────┘
           │ BGP sessions
    ┌──────┴──────────────┐
    │                     │
┌───▼───┐             ┌───▼───┐
│  DC1  │             │  DC2  │
│       │             │       │
│ExaBGP │             │ExaBGP │
│Service│             │Service│
└───────┘             └───────┘
  MED=100               MED=200
```

**Benefits:**
- ✅ Geographic redundancy
- ✅ Disaster recovery
- ✅ Reduced latency (users routed to nearest DC)

---

### Keepalived Integration

**Use keepalived for local HA:**

```
# /etc/keepalived/keepalived.conf
vrrp_script check_exabgp {
    script "/usr/local/bin/check_exabgp.sh"
    interval 2
    weight -20
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100

    virtual_ipaddress {
        192.168.1.100/24
    }

    track_script {
        check_exabgp
    }

    # Run ExaBGP when becoming MASTER
    notify_master "/usr/local/bin/exabgp_start.sh"
    notify_backup "/usr/local/bin/exabgp_stop.sh"
}
```

---

## Performance Tuning

### System Tuning

**Sysctl parameters:**

```bash
# /etc/sysctl.d/99-exabgp.conf

# Increase connection tracking
net.netfilter.nf_conntrack_max = 262144

# TCP tuning
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 3

# Socket buffers
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Backlog
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_max_syn_backlog = 8192

# Apply
sudo sysctl -p /etc/sysctl.d/99-exabgp.conf
```

---

### ExaBGP Tuning

**Environment variables:**

```bash
# Performance tuning
export exabgp.daemon.daemonize=true
export exabgp.log.packets=false  # Disable packet logging (high overhead)
export exabgp.log.parser=false   # Disable parser logging
export exabgp.tcp.bind=''        # Empty = listen on all interfaces
# Note: tcp.attempts is internal/debug only - not for production use
# export exabgp.tcp.attempts=0   # (4.x: tcp.once=false) 0=infinite retries

# Cache tuning
export exabgp.cache.attributes=true
export exabgp.cache.nexthops=true

# Performance mode
export exabgp.profile.enable=false  # Disable profiling
```

---

### Async Reactor (ExaBGP 6.0.0+)

ExaBGP 6.0.0 uses an async/await-based event loop as the **default** reactor, providing better performance and modern Python integration.

**Async reactor (6.0.0):**
```bash
# 6.0.0 uses async reactor (only mode)
exabgp /etc/exabgp/exabgp.conf
```

**Note**: ExaBGP 6.0.0 uses the async reactor as the only mode. Generator-based callbacks still work inside the async system for backward compatibility.

**Benefits of async reactor:**
- Modern Python async/await patterns
- Better event loop integration
- Potential performance improvements
- Comprehensive test coverage (72/72 functional tests)

**If you encounter issues:**
1. Report the issue on GitHub with detailed reproduction steps
2. Include debug output: `env exabgp.log.level=DEBUG exabgp /etc/exabgp/exabgp.conf`
3. Check for any blocking operations in your API scripts

---

### API Process Optimization

**Efficient health checking:**

```python
#!/usr/bin/env python3
"""
Optimized health check - minimize overhead
"""
import sys
import time
import socket

# Configuration
SERVICE_IP = "100.10.0.100"
SERVICE_PORT = 80
CHECK_INTERVAL = 5
SOCKET_TIMEOUT = 2

# Reuse socket for efficiency
def create_socket():
    """Create configured socket"""
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.settimeout(SOCKET_TIMEOUT)
    # Set TCP_NODELAY for faster checks
    sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
    return sock

def check_health_fast():
    """Fast health check"""
    sock = create_socket()
    try:
        result = sock.connect_ex(('127.0.0.1', SERVICE_PORT))
        return result == 0
    except:
        return False
    finally:
        try:
            sock.close()
        except:
            pass

# State tracking (avoid redundant announcements)
announced = False

# Main loop
time.sleep(2)

while True:
    healthy = check_health_fast()

    if healthy and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = True
    elif not healthy and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32\n")
        sys.stdout.flush()
        announced = False

    time.sleep(CHECK_INTERVAL)
```

---

### Resource Limits

**Systemd limits:**

```ini
[Service]
# Limit memory
MemoryLimit=512M
MemoryHigh=400M

# Limit CPU
CPUQuota=50%

# Limit file descriptors
LimitNOFILE=10000

# Limit processes
LimitNPROC=100
```

---

## Logging and Alerting

### Structured Logging

**JSON logging for easy parsing:**

```python
import json
import time

def log_json(level, event, **kwargs):
    """Structured JSON logging"""
    entry = {
        'timestamp': time.time(),
        'level': level,
        'event': event,
        'hostname': socket.gethostname(),
        **kwargs
    }
    print(json.dumps(entry), file=sys.stderr)

# Use it
log_json('INFO', 'route_announced', prefix='100.10.0.0/24', nexthop='self')
log_json('ERROR', 'health_check_failed', service='web', port=80, error='timeout')
log_json('CRITICAL', 'bgp_session_down', peer='192.168.1.1', reason='hold_timer')
```

---

### Alertmanager Integration

**Alert on BGP events:**

```yaml
# /etc/prometheus/alerts/exabgp.yml
groups:
  - name: exabgp
    interval: 30s
    rules:
      # BGP session down
      - alert: BGPSessionDown
        expr: exabgp_bgp_session_up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "BGP session down: {{ $labels.peer }}"
          description: "BGP session to {{ $labels.peer }} has been down for 2 minutes"

      # Route count dropped
      - alert: RouteCountDropped
        expr: |
          (exabgp_active_routes - exabgp_active_routes offset 5m)
          / exabgp_active_routes offset 5m < -0.5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Route count dropped >50%"

      # High notification rate
      - alert: HighNotificationRate
        expr: rate(exabgp_notifications_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High BGP NOTIFICATION rate"
          description: "Receiving {{ $value }} notifications/sec from {{ $labels.peer }}"

      # No route announcements
      - alert: NoRouteAnnouncements
        expr: rate(exabgp_routes_announced_total[10m]) == 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "No routes announced in 15 minutes"
```

---

### PagerDuty Integration

**Alert on critical events:**

```python
import requests

def alert_pagerduty(severity, summary, details):
    """Send alert to PagerDuty"""
    PAGERDUTY_API_KEY = os.getenv('PAGERDUTY_API_KEY')

    event = {
        'routing_key': PAGERDUTY_API_KEY,
        'event_action': 'trigger',
        'payload': {
            'summary': summary,
            'severity': severity,
            'source': socket.gethostname(),
            'custom_details': details
        }
    }

    try:
        response = requests.post(
            'https://events.pagerduty.com/v2/enqueue',
            json=event,
            timeout=5
        )
        response.raise_for_status()
        log_json('INFO', 'pagerduty_alert_sent', summary=summary)
    except Exception as e:
        log_json('ERROR', 'pagerduty_alert_failed', error=str(e))

# Use it
if consecutive_failures > 10:
    alert_pagerduty(
        severity='critical',
        summary='ExaBGP health check failing',
        details={'failures': consecutive_failures, 'service': SERVICE_IP}
    )
```

---

## Disaster Recovery

### Backup Procedures

**Backup ExaBGP configuration:**

```bash
#!/bin/bash
# /usr/local/bin/backup_exabgp.sh

BACKUP_DIR="/var/backups/exabgp"
DATE=$(date +%Y%m%d_%H%M%S)

# Create backup directory
mkdir -p "$BACKUP_DIR"

# Backup config
tar czf "$BACKUP_DIR/exabgp-config-$DATE.tar.gz" \
    /etc/exabgp/*.conf \
    /etc/exabgp/api/ \
    /etc/systemd/system/exabgp.service

# Backup state (if using stateful mode)
cp -a /var/lib/exabgp "$BACKUP_DIR/exabgp-state-$DATE"

# Keep only last 30 backups
find "$BACKUP_DIR" -name "exabgp-*" -mtime +30 -delete

echo "Backup completed: $BACKUP_DIR/exabgp-config-$DATE.tar.gz"
```

**Automate with cron:**

```cron
# /etc/cron.d/exabgp-backup
0 2 * * * root /usr/local/bin/backup_exabgp.sh
```

---

### Recovery Procedures

**Restore from backup:**

```bash
#!/bin/bash
# /usr/local/bin/restore_exabgp.sh

if [ $# -ne 1 ]; then
    echo "Usage: $0 <backup-file>"
    exit 1
fi

BACKUP_FILE="$1"

# Stop ExaBGP
systemctl stop exabgp

# Restore config
tar xzf "$BACKUP_FILE" -C /

# Verify config (ExaBGP 5.x)
exabgp configuration validate /etc/exabgp/exabgp.conf

if [ $? -eq 0 ]; then
    echo "Config valid, starting ExaBGP"
    systemctl start exabgp
else
    echo "ERROR: Invalid config, ExaBGP not started"
    exit 1
fi
```

---

### Disaster Recovery Plan

**Document DR procedures:**

```markdown
# ExaBGP Disaster Recovery Plan

## Scenario 1: ExaBGP Process Crash
**Detection:** Systemd alert, Prometheus metric `up{job="exabgp"}==0`
**Impact:** Routes withdrawn, traffic stops
**Recovery:**
1. Check systemd status: `systemctl status exabgp`
2. Check logs: `journalctl -u exabgp -n 100`
3. Restart: `systemctl restart exabgp`
4. Verify: `exabgpcli show neighbor`

## Scenario 2: BGP Session Down
**Detection:** `exabgp_bgp_session_up==0`
**Impact:** Routes not advertised
**Recovery:**
1. Check peer reachability: `ping <peer-ip>`
2. Check firewall: `iptables -L -n | grep 179`
3. Verify config: `grep <peer-ip> /etc/exabgp/exabgp.conf`
4. Check peer logs
5. Restart session: `systemctl restart exabgp`

## Scenario 3: Server Failure
**Detection:** Host unreachable
**Impact:** Routes withdrawn, traffic fails over to backup
**Recovery:**
1. Verify failover occurred
2. Monitor backup server load
3. Replace/repair failed server
4. Test before returning to service
5. Gradual traffic shift back

## Scenario 4: Configuration Error
**Detection:** ExaBGP fails to start
**Impact:** No BGP announcements
**Recovery:**
1. Validate config: `exabgp configuration validate /etc/exabgp/exabgp.conf` (5.x)
2. Check syntax errors in logs
3. Restore from backup: `/usr/local/bin/restore_exabgp.sh`
4. Test restored config
5. Start ExaBGP

## Scenario 5: API Program Crash Loop
**Detection:** Repeated restarts in logs
**Impact:** Inconsistent route announcements
**Recovery:**
1. Check API program logs
2. Test API program standalone
3. Disable API program temporarily
4. Fix bug
5. Deploy fixed version
6. Re-enable
```

---

## Configuration Management

### Version Control

**Store configs in Git:**

```bash
# Initialize repo
cd /etc/exabgp
git init
git add *.conf api/
git commit -m "Initial ExaBGP configuration"

# Add remote
git remote add origin git@github.com:yourorg/exabgp-configs.git
git push -u origin master
```

**Automated deployment:**

```bash
#!/bin/bash
# /usr/local/bin/deploy_exabgp_config.sh

# Pull latest config
cd /etc/exabgp
git pull origin master

# Validate config (ExaBGP 5.x)
exabgp configuration validate /etc/exabgp/exabgp.conf

if [ $? -eq 0 ]; then
    # Reload ExaBGP
    systemctl reload exabgp
    echo "Configuration deployed successfully"
else
    # Rollback
    git reset --hard HEAD^
    echo "ERROR: Invalid configuration, rolled back"
    exit 1
fi
```

---

### Ansible Automation

**Ansible playbook:**

```yaml
---
# playbooks/exabgp.yml
- name: Deploy ExaBGP
  hosts: bgp_servers
  become: yes

  vars:
    exabgp_version: "4.2.25"
    service_ip: "100.10.0.100"
    peer_ip: "192.168.1.1"
    local_as: 65001
    peer_as: 65000

  tasks:
    - name: Install ExaBGP
      pip:
        name: "exabgp=={{ exabgp_version }}"
        state: present

    - name: Create exabgp user
      user:
        name: exabgp
        system: yes
        shell: /bin/false
        home: /var/lib/exabgp

    - name: Create directories
      file:
        path: "{{ item }}"
        state: directory
        owner: exabgp
        group: exabgp
        mode: 0750
      loop:
        - /etc/exabgp
        - /etc/exabgp/api
        - /var/log/exabgp
        - /var/lib/exabgp

    - name: Deploy configuration
      template:
        src: templates/exabgp.conf.j2
        dest: /etc/exabgp/exabgp.conf
        owner: exabgp
        group: exabgp
        mode: 0640
      notify: restart exabgp

    - name: Deploy API programs
      copy:
        src: "{{ item }}"
        dest: /etc/exabgp/api/
        owner: root
        group: root
        mode: 0755
      loop:
        - files/healthcheck.py
        - files/exporter.py
      notify: restart exabgp

    - name: Deploy systemd service
      template:
        src: templates/exabgp.service.j2
        dest: /etc/systemd/system/exabgp.service
        mode: 0644
      notify:
        - reload systemd
        - restart exabgp

    - name: Enable ExaBGP service
      systemd:
        name: exabgp
        enabled: yes
        state: started

  handlers:
    - name: reload systemd
      systemd:
        daemon_reload: yes

    - name: restart exabgp
      systemd:
        name: exabgp
        state: restarted
```

---

## Real-World Deployment Patterns

### Pattern 1: Anycast DNS

**Use case:** Global DNS service with anycast IPs

```python
#!/usr/bin/env python3
"""
DNS anycast health check
"""
import sys
import time
import dns.resolver

SERVICE_IP = "1.1.1.1"  # Anycast IP
DNS_PORT = 53
TEST_QUERY = "example.com"

def check_dns():
    """Check if DNS resolver is working"""
    try:
        resolver = dns.resolver.Resolver()
        resolver.nameservers = ['127.0.0.1']
        resolver.timeout = 2
        resolver.lifetime = 2

        answer = resolver.resolve(TEST_QUERY, 'A')
        return len(answer) > 0
    except:
        return False

announced = False
time.sleep(2)

while True:
    healthy = check_dns()

    if healthy and not announced:
        sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n")
        sys.stdout.flush()
        announced = True
    elif not healthy and announced:
        sys.stdout.write(f"withdraw route {SERVICE_IP}/32\n")
        sys.stdout.flush()
        announced = False

    time.sleep(5)
```

---

### Pattern 2: DDoS Scrubbing Center

**Use case:** Redirect attack traffic to scrubbing center via FlowSpec

```python
#!/usr/bin/env python3
"""
DDoS mitigation with FlowSpec
"""
import sys
import time

SCRUBBING_VRF = "65001:999"

def announce_flowspec_block(src_prefix, dst_port, protocol):
    """Announce FlowSpec rule to block traffic"""
    rule = (
        f"announce flow route {{ "
        f"match {{ source {src_prefix}; destination-port ={dst_port}; protocol ={protocol}; }} "
        f"then {{ redirect {SCRUBBING_VRF}; }} "
        f"}}"
    )
    sys.stdout.write(rule + "\n")
    sys.stdout.flush()

def detect_attack():
    """Detect DDoS attack (integrate with your IDS)"""
    # Example: Read from IDS output
    # Return (source, dest_port, protocol) if attack detected
    return None

time.sleep(2)

while True:
    attack = detect_attack()

    if attack:
        src, port, proto = attack
        announce_flowspec_block(src, port, proto)
        log(f"Blocked {src} to port {port}")

    time.sleep(1)
```

---

### Pattern 3: Multi-Tier Load Balancing

**Facebook/Meta Katran pattern:**

```
┌─────────────────────────────────────────────────┐
│           Border Router (ECMP)                  │
└─────────────────┬───────────────────────────────┘
                  │
       ┌──────────┼──────────┐
       ▼          ▼          ▼
  ┌─────────┐ ┌─────────┐ ┌─────────┐
  │ ExaBGP  │ │ ExaBGP  │ │ ExaBGP  │
  │ + L4LB  │ │ + L4LB  │ │ + L4LB  │
  │ (XDP)   │ │ (XDP)   │ │ (XDP)   │
  └────┬────┘ └────┬────┘ └────┬────┘
       │           │           │
  ┌────▼───────────▼───────────▼────┐
  │      Backend Servers (ECMP)     │
  └──────────────────────────────────┘
```

---

## Testing Strategies

### Integration Testing

**Test BGP session establishment:**

```bash
#!/bin/bash
# test_bgp_session.sh

# Validate ExaBGP configuration (ExaBGP 5.x)
timeout 30 exabgp configuration validate /etc/exabgp/exabgp.conf

if [ $? -eq 0 ]; then
    echo "✓ Configuration valid"
else
    echo "✗ Configuration invalid"
    exit 1
fi

# Start ExaBGP
systemctl start exabgp
sleep 5

# Check BGP session
if ss -tn | grep -q ':179.*ESTAB'; then
    echo "✓ BGP session established"
else
    echo "✗ BGP session not established"
    systemctl stop exabgp
    exit 1
fi

# Verify route announcement
if exabgpcli show adj-rib out | grep -q '100.10.0.0/24'; then
    echo "✓ Route announced"
else
    echo "✗ Route not announced"
    systemctl stop exabgp
    exit 1
fi

echo "All tests passed"
```

---

### Load Testing

**Simulate high route count:**

```python
#!/usr/bin/env python3
"""
Load test - announce many routes
"""
import sys
import time

# Announce 10,000 routes
time.sleep(2)

for i in range(10000):
    prefix = f"100.{i // 256}.{i % 256}.0/24"
    sys.stdout.write(f"announce route {prefix} next-hop self\n")

    if i % 100 == 0:
        sys.stdout.flush()
        time.sleep(0.1)  # Rate limit

sys.stdout.flush()

# Keep running
while True:
    time.sleep(60)
```

---

## Capacity Planning

### Route Capacity

**Estimate memory usage:**

```
Memory per route (IPv4 unicast):
- Route: ~100 bytes
- Attributes: ~200 bytes
Total: ~300 bytes per route

Example:
- 100,000 routes = 30 MB
- 1,000,000 routes = 300 MB

Add 100 MB for ExaBGP overhead
Add 50 MB per API process

Total for 1M routes with 2 API processes:
300 + 100 + 100 = 500 MB
```

**System requirements:**

| Routes | RAM | CPU |
|--------|-----|-----|
| 1,000 | 256 MB | 1 core |
| 10,000 | 512 MB | 1 core |
| 100,000 | 1 GB | 2 cores |
| 1,000,000 | 2 GB | 4 cores |

---

### BGP Session Limits

**Sessions per server:**

```
ExaBGP can handle:
- 100+ BGP sessions per server (tested)
- Limited by CPU and network bandwidth
- Use separate ExaBGP instances for isolation
```

---

## Operational Procedures

### Deployment Checklist

**Before deploying to production:**

- [ ] Configuration validated with `exabgp configuration validate` (5.x)
- [ ] MD5 authentication configured
- [ ] Firewall rules applied
- [ ] Monitoring configured (Prometheus)
- [ ] Alerts configured (Alertmanager/PagerDuty)
- [ ] Logs centralized (syslog)
- [ ] Backups automated (cron)
- [ ] DR procedures documented
- [ ] Runbooks created
- [ ] Team trained
- [ ] Load testing completed
- [ ] Failover tested
- [ ] Rollback plan tested

---

### Runbook: BGP Session Troubleshooting

```markdown
# Runbook: BGP Session Not Establishing

## Symptoms
- `exabgp_bgp_session_up==0`
- No routes announced
- Log shows: "connection refused" or "timeout"

## Diagnosis

### 1. Check ExaBGP Status
```bash
systemctl status exabgp
journalctl -u exabgp -n 50
```

### 2. Check Network Connectivity
```bash
ping <peer-ip>
traceroute <peer-ip>
```

### 3. Check BGP Port
```bash
# From ExaBGP server
telnet <peer-ip> 179

# Check listening
ss -tlnp | grep 179
```

### 4. Check Firewall
```bash
iptables -L -n -v | grep 179
```

### 5. Verify Configuration
```bash
grep <peer-ip> /etc/exabgp/exabgp.conf
exabgp configuration validate /etc/exabgp/exabgp.conf
```

## Resolution

### If peer unreachable:
- Check network path
- Verify peer is running
- Check firewall on both sides

### If connection refused:
- Verify peer is listening on 179
- Check peer configuration
- Verify MD5 password matches

### If connection timeout:
- Check firewall rules
- Verify routing to peer
- Check MTU/MSS issues

## Escalation
If issue persists after 15 minutes:
1. Page network team
2. Check peer router logs
3. Open vendor support ticket
```

---

## Troubleshooting

### Common Issues

| Issue | Cause | Solution |
|-------|-------|----------|
| Routes not announced | API program not running | Check process status |
| Route flapping | No hysteresis in health check | Add consecutive check threshold |
| High CPU usage | Too many routes | Optimize, add caching |
| Memory leak | API program not cleaning up | Fix resource management |
| BGP session flapping | Network issues or MD5 mismatch | Check logs, verify auth |

---

## See Also

- **[API Overview](API-Overview)** - API architecture
- **[Writing API Programs](Writing-API-Programs)** - Program development
- **[Error Handling](Error-Handling)** - Error handling strategies
- **[Service High Availability](Service-High-Availability)** - HA patterns
- **[Monitoring](Monitoring)** - Monitoring guide
- **[Debugging](Debugging)** - Debugging techniques

---