# Production Best Practices **Comprehensive guide to deploying ExaBGP in production environments** --- ## Table of Contents - [Introduction](#introduction) - [Security Hardening](#security-hardening) - [Monitoring and Observability](#monitoring-and-observability) - [High Availability Architecture](#high-availability-architecture) - [Performance Tuning](#performance-tuning) - [Logging and Alerting](#logging-and-alerting) - [Disaster Recovery](#disaster-recovery) - [Configuration Management](#configuration-management) - [Real-World Deployment Patterns](#real-world-deployment-patterns) - [Testing Strategies](#testing-strategies) - [Capacity Planning](#capacity-planning) - [Operational Procedures](#operational-procedures) - [Troubleshooting](#troubleshooting) --- ## Introduction Deploying ExaBGP in production requires careful attention to security, reliability, and operational excellence. This guide provides battle-tested patterns from real-world deployments. ### Production Readiness Checklist **Before going to production:** - ✅ Security hardening applied - ✅ Monitoring and alerting configured - ✅ HA architecture tested - ✅ Disaster recovery plan documented - ✅ Logging centralized - ✅ Runbooks created - ✅ Load testing completed - ✅ Rollback procedures tested **Critical reminder:** > 🔴 **ExaBGP does NOT manipulate RIB/FIB** - ExaBGP is a pure BGP protocol speaker. When your API programs announce/withdraw routes, ExaBGP sends BGP messages. The **router** installs/removes routes in its RIB/FIB. ExaBGP never touches routing tables directly. --- ## Security Hardening ### Process Isolation **Run ExaBGP as dedicated user:** ```bash # Create exabgp user sudo useradd -r -s /bin/false -d /var/lib/exabgp exabgp # Set ownership sudo chown -R exabgp:exabgp /etc/exabgp sudo chown -R exabgp:exabgp /var/log/exabgp sudo chown -R exabgp:exabgp /var/lib/exabgp # Restrict permissions sudo chmod 750 /etc/exabgp sudo chmod 640 /etc/exabgp/*.conf ``` --- ### Systemd Hardening **Systemd service with security restrictions:** ```ini [Unit] Description=ExaBGP BGP Speaker After=network.target Documentation=https://github.com/Exa-Networks/exabgp/wiki [Service] Type=simple User=exabgp Group=exabgp ExecStart=/usr/local/bin/exabgp /etc/exabgp/exabgp.conf Restart=on-failure RestartSec=5s # Security hardening NoNewPrivileges=true PrivateTmp=true ProtectSystem=strict ProtectHome=true ReadWritePaths=/var/log/exabgp /var/lib/exabgp ProtectKernelTunables=true ProtectKernelModules=true ProtectControlGroups=true RestrictRealtime=true RestrictNamespaces=true LockPersonality=true MemoryDenyWriteExecute=true RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX SystemCallFilter=@system-service SystemCallFilter=~@privileged @resources SystemCallErrorNumber=EPERM # Resource limits LimitNOFILE=65536 LimitNPROC=100 MemoryLimit=512M CPUQuota=100% [Install] WantedBy=multi-user.target ``` **Enable and start:** ```bash sudo systemctl daemon-reload sudo systemctl enable exabgp sudo systemctl start exabgp ``` --- ### BGP Authentication **MD5 authentication (recommended for production):** ```ini neighbor 192.168.1.1 { router-id 192.168.1.2; local-address 192.168.1.2; local-as 65001; peer-as 65000; # MD5 authentication md5-password "your-strong-password-here"; # TTL security (GTSM) incoming-ttl 255; # Only accept packets with TTL=255 family { ipv4 unicast; ipv4 flow; } } ``` **Generate strong passwords:** ```bash # Generate random 32-character password openssl rand -base64 24 ``` **Store passwords securely:** ```bash # Use environment variables export BGP_MD5_PASSWORD="$(cat /etc/exabgp/secrets/bgp_password)" # Reference in config (ExaBGP 4.x+) neighbor 192.168.1.1 { md5-password env['BGP_MD5_PASSWORD']; } ``` --- ### API Security **Restrict API programs:** ```ini process healthcheck { # Use absolute paths run /usr/local/bin/exabgp-healthcheck.py; # Set working directory working-directory /var/lib/exabgp; # Limit environment env { SERVICE_IP = '100.10.0.100'; CHECK_INTERVAL = '5'; } encoder text; } ``` **Validate API program permissions:** ```bash # API programs should be owned by root, not writable by exabgp sudo chown root:root /usr/local/bin/exabgp-healthcheck.py sudo chmod 755 /usr/local/bin/exabgp-healthcheck.py # Prevent tampering sudo chattr +i /usr/local/bin/exabgp-healthcheck.py # Make immutable ``` --- ### Network Security **Firewall rules (iptables):** ```bash # Allow BGP from specific peers only sudo iptables -A INPUT -p tcp --dport 179 -s 192.168.1.1 -j ACCEPT sudo iptables -A INPUT -p tcp --dport 179 -j DROP # Allow established connections sudo iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT # Save rules sudo iptables-save > /etc/iptables/rules.v4 ``` **Firewall rules (nftables):** ```bash # /etc/nftables.conf table inet filter { chain input { type filter hook input priority 0; policy drop; # Allow established connections ct state established,related accept # Allow BGP from specific peer ip saddr 192.168.1.1 tcp dport 179 accept # Drop other BGP tcp dport 179 drop } } ``` --- ## Monitoring and Observability ### Prometheus Metrics Exporter **Complete metrics exporter:** ```python #!/usr/bin/env python3 """ exabgp_exporter.py - Prometheus metrics for ExaBGP """ import sys import json import time from prometheus_client import start_http_server, Counter, Gauge, Histogram # Metrics bgp_session_up = Gauge('exabgp_bgp_session_up', 'BGP session state (1=up, 0=down)', ['peer', 'local_as', 'peer_as']) bgp_routes_announced = Counter('exabgp_routes_announced_total', 'Total routes announced', ['peer', 'afi', 'safi']) bgp_routes_withdrawn = Counter('exabgp_routes_withdrawn_total', 'Total routes withdrawn', ['peer', 'afi', 'safi']) bgp_active_routes = Gauge('exabgp_active_routes', 'Number of active routes', ['peer', 'afi', 'safi']) bgp_notifications = Counter('exabgp_notifications_total', 'BGP NOTIFICATION messages', ['peer', 'code', 'subcode']) bgp_update_processing_time = Histogram('exabgp_update_processing_seconds', 'Time to process UPDATE messages', ['peer']) # State tracking route_counts = {} def log(message): """Log to STDERR""" sys.stderr.write(f"[{time.strftime('%Y-%m-%d %H:%M:%S')}] {message}\n") sys.stderr.flush() def handle_state(msg): """Process STATE messages""" peer = msg['neighbor']['address']['peer'] local_as = msg['neighbor']['asn']['local'] peer_as = msg['neighbor']['asn']['peer'] state = msg['neighbor']['state'] # Update metric value = 1 if state == 'up' else 0 bgp_session_up.labels(peer=peer, local_as=local_as, peer_as=peer_as).set(value) log(f"Session {peer}: {state}") def handle_update(msg): """Process UPDATE messages""" start_time = time.time() peer = msg['neighbor']['address']['peer'] update = msg['neighbor']['message']['update'] # Process announcements if 'announce' in update: for family, routes in update['announce'].items(): afi_safi = family # e.g., "ipv4 unicast" count = len(routes) bgp_routes_announced.labels(peer=peer, afi=family.split()[0], safi=family.split()[1]).inc(count) # Update active count key = (peer, afi_safi) route_counts[key] = route_counts.get(key, 0) + count bgp_active_routes.labels(peer=peer, afi=family.split()[0], safi=family.split()[1]).set(route_counts[key]) # Process withdrawals if 'withdraw' in update: for family, routes in update['withdraw'].items(): count = len(routes) bgp_routes_withdrawn.labels(peer=peer, afi=family.split()[0], safi=family.split()[1]).inc(count) # Update active count key = (peer, family) route_counts[key] = max(0, route_counts.get(key, 0) - count) bgp_active_routes.labels(peer=peer, afi=family.split()[0], safi=family.split()[1]).set(route_counts[key]) # Record processing time duration = time.time() - start_time bgp_update_processing_time.labels(peer=peer).observe(duration) def handle_notification(msg): """Process NOTIFICATION messages""" peer = msg['neighbor']['address']['peer'] notification = msg['neighbor']['message']['notification'] code = notification.get('code', 0) subcode = notification.get('subcode', 0) bgp_notifications.labels(peer=peer, code=code, subcode=subcode).inc() log(f"NOTIFICATION from {peer}: code={code} subcode={subcode}") def main(): """Main metrics exporter""" # Start Prometheus HTTP server port = 9101 start_http_server(port) log(f"Prometheus metrics server started on port {port}") # Process BGP messages while True: line = sys.stdin.readline() if not line: break try: msg = json.loads(line.strip()) msg_type = msg.get('type') if msg_type == 'state': handle_state(msg) elif msg_type == 'update': handle_update(msg) elif msg_type == 'notification': handle_notification(msg) except Exception as e: log(f"Error processing message: {e}") if __name__ == '__main__': main() ``` **ExaBGP configuration:** ```ini process prometheus_exporter { run /usr/local/bin/exabgp_exporter.py; encoder json; receive { parsed; updates; neighbor-changes; } } neighbor 192.168.1.1 { router-id 192.168.1.2; local-address 192.168.1.2; local-as 65001; peer-as 65000; api { processes [ prometheus_exporter ]; } } ``` --- ### Prometheus Scrape Config ```yaml # /etc/prometheus/prometheus.yml scrape_configs: - job_name: 'exabgp' static_configs: - targets: ['localhost:9101'] labels: instance: 'bgp-server-1' datacenter: 'dc1' ``` --- ### Grafana Dashboard **Key metrics to visualize:** 1. **BGP Session Status** - Query: `exabgp_bgp_session_up` - Type: Stat panel - Alert: Session down 2. **Route Count** - Query: `exabgp_active_routes` - Type: Graph - Alert: Sudden drops 3. **Announcement Rate** - Query: `rate(exabgp_routes_announced_total[5m])` - Type: Graph - Alert: Unusual spikes 4. **NOTIFICATION Messages** - Query: `rate(exabgp_notifications_total[5m])` - Type: Graph - Alert: Any notifications --- ### Health Check Monitoring **Monitor ExaBGP process:** ```bash #!/bin/bash # /usr/local/bin/check_exabgp.sh # Check if ExaBGP is running if ! systemctl is-active --quiet exabgp; then echo "CRITICAL: ExaBGP is not running" exit 2 fi # Check if BGP session is established if ! ss -tn | grep -q ':179.*ESTAB'; then echo "WARNING: No established BGP sessions" exit 1 fi echo "OK: ExaBGP running with established sessions" exit 0 ``` **Nagios/Icinga check:** ```ini # /etc/nagios/nrpe.d/exabgp.cfg command[check_exabgp]=/usr/local/bin/check_exabgp.sh ``` --- ### Centralized Logging **Syslog configuration:** ```ini # ExaBGP environment variables export exabgp.log.destination=syslog export exabgp.log.level=INFO export exabgp.log.rib=true export exabgp.log.packets=false # Disable in production (noisy) ``` **Rsyslog configuration:** ``` # /etc/rsyslog.d/exabgp.conf if $programname == 'exabgp' then /var/log/exabgp/exabgp.log & stop ``` **Logrotate:** ``` # /etc/logrotate.d/exabgp /var/log/exabgp/*.log { daily rotate 30 compress delaycompress missingok notifempty create 0640 exabgp exabgp sharedscripts postrotate systemctl reload rsyslog endscript } ``` --- ## High Availability Architecture ### Active-Active HA **Multiple ExaBGP instances announcing same routes:** ``` ┌────────────────────────────────────────────────────────┐ │ Router (ECMP enabled) │ └────────────────────────────────────────────────────────┘ ↑ ↑ ↑ │ BGP │ BGP │ BGP │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ │ ExaBGP │ │ ExaBGP │ │ ExaBGP │ │ Server1 │ │ Server2 │ │ Server3 │ └─────────┘ └─────────┘ └─────────┘ ↓ ↓ ↓ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Service │ │ Service │ │ Service │ │ Healthy │ │ Healthy │ │ Healthy │ └─────────┘ └─────────┘ └─────────┘ ``` **Benefits:** - ✅ No single point of failure - ✅ Automatic load distribution (ECMP) - ✅ Fast failover (BGP convergence) - ✅ Horizontal scaling **Configuration on each server:** ```ini neighbor 192.168.1.1 { router-id 192.168.1.2; # Unique per server local-address 192.168.1.2; # Unique per server local-as 65001; peer-as 65000; family { ipv4 unicast; } api { processes [ healthcheck ]; } } process healthcheck { run /usr/local/bin/healthcheck.py; encoder text; } ``` --- ### Active-Passive HA with MED **Primary/backup failover:** ```python #!/usr/bin/env python3 # Primary server (low MED) SERVICE_IP = "100.10.0.100" MED = 100 # Lower is preferred if is_healthy(): print(f"announce route {SERVICE_IP}/32 next-hop self med {MED}") ``` ```python #!/usr/bin/env python3 # Backup server (high MED) SERVICE_IP = "100.10.0.100" MED = 200 # Higher MED = backup if is_healthy(): print(f"announce route {SERVICE_IP}/32 next-hop self med {MED}") ``` --- ### Geographic HA **Multi-datacenter deployment:** ``` Internet │ ┌──────┴──────┐ │ Router │ └──────┬──────┘ │ BGP sessions ┌──────┴──────────────┐ │ │ ┌───▼───┐ ┌───▼───┐ │ DC1 │ │ DC2 │ │ │ │ │ │ExaBGP │ │ExaBGP │ │Service│ │Service│ └───────┘ └───────┘ MED=100 MED=200 ``` **Benefits:** - ✅ Geographic redundancy - ✅ Disaster recovery - ✅ Reduced latency (users routed to nearest DC) --- ### Keepalived Integration **Use keepalived for local HA:** ``` # /etc/keepalived/keepalived.conf vrrp_script check_exabgp { script "/usr/local/bin/check_exabgp.sh" interval 2 weight -20 } vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100 virtual_ipaddress { 192.168.1.100/24 } track_script { check_exabgp } # Run ExaBGP when becoming MASTER notify_master "/usr/local/bin/exabgp_start.sh" notify_backup "/usr/local/bin/exabgp_stop.sh" } ``` --- ## Performance Tuning ### System Tuning **Sysctl parameters:** ```bash # /etc/sysctl.d/99-exabgp.conf # Increase connection tracking net.netfilter.nf_conntrack_max = 262144 # TCP tuning net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time = 60 net.ipv4.tcp_keepalive_intvl = 10 net.ipv4.tcp_keepalive_probes = 3 # Socket buffers net.core.rmem_max = 134217728 net.core.wmem_max = 134217728 net.ipv4.tcp_rmem = 4096 87380 67108864 net.ipv4.tcp_wmem = 4096 65536 67108864 # Backlog net.core.netdev_max_backlog = 5000 net.ipv4.tcp_max_syn_backlog = 8192 # Apply sudo sysctl -p /etc/sysctl.d/99-exabgp.conf ``` --- ### ExaBGP Tuning **Environment variables:** ```bash # Performance tuning export exabgp.daemon.daemonize=true export exabgp.log.packets=false # Disable packet logging (high overhead) export exabgp.log.parser=false # Disable parser logging export exabgp.tcp.bind='' # Empty = listen on all interfaces # Note: tcp.attempts is internal/debug only - not for production use # export exabgp.tcp.attempts=0 # (4.x: tcp.once=false) 0=infinite retries # Cache tuning export exabgp.cache.attributes=true export exabgp.cache.nexthops=true # Performance mode export exabgp.profile.enable=false # Disable profiling ``` --- ### Async Reactor (ExaBGP 6.0.0+) ExaBGP 6.0.0 uses an async/await-based event loop as the **default** reactor, providing better performance and modern Python integration. **Async reactor (6.0.0):** ```bash # 6.0.0 uses async reactor (only mode) exabgp /etc/exabgp/exabgp.conf ``` **Note**: ExaBGP 6.0.0 uses the async reactor as the only mode. Generator-based callbacks still work inside the async system for backward compatibility. **Benefits of async reactor:** - Modern Python async/await patterns - Better event loop integration - Potential performance improvements - Comprehensive test coverage (72/72 functional tests) **If you encounter issues:** 1. Report the issue on GitHub with detailed reproduction steps 2. Include debug output: `env exabgp.log.level=DEBUG exabgp /etc/exabgp/exabgp.conf` 3. Check for any blocking operations in your API scripts --- ### API Process Optimization **Efficient health checking:** ```python #!/usr/bin/env python3 """ Optimized health check - minimize overhead """ import sys import time import socket # Configuration SERVICE_IP = "100.10.0.100" SERVICE_PORT = 80 CHECK_INTERVAL = 5 SOCKET_TIMEOUT = 2 # Reuse socket for efficiency def create_socket(): """Create configured socket""" sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.settimeout(SOCKET_TIMEOUT) # Set TCP_NODELAY for faster checks sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) return sock def check_health_fast(): """Fast health check""" sock = create_socket() try: result = sock.connect_ex(('127.0.0.1', SERVICE_PORT)) return result == 0 except: return False finally: try: sock.close() except: pass # State tracking (avoid redundant announcements) announced = False # Main loop time.sleep(2) while True: healthy = check_health_fast() if healthy and not announced: sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() announced = True elif not healthy and announced: sys.stdout.write(f"withdraw route {SERVICE_IP}/32\n") sys.stdout.flush() announced = False time.sleep(CHECK_INTERVAL) ``` --- ### Resource Limits **Systemd limits:** ```ini [Service] # Limit memory MemoryLimit=512M MemoryHigh=400M # Limit CPU CPUQuota=50% # Limit file descriptors LimitNOFILE=10000 # Limit processes LimitNPROC=100 ``` --- ## Logging and Alerting ### Structured Logging **JSON logging for easy parsing:** ```python import json import time def log_json(level, event, **kwargs): """Structured JSON logging""" entry = { 'timestamp': time.time(), 'level': level, 'event': event, 'hostname': socket.gethostname(), **kwargs } print(json.dumps(entry), file=sys.stderr) # Use it log_json('INFO', 'route_announced', prefix='100.10.0.0/24', nexthop='self') log_json('ERROR', 'health_check_failed', service='web', port=80, error='timeout') log_json('CRITICAL', 'bgp_session_down', peer='192.168.1.1', reason='hold_timer') ``` --- ### Alertmanager Integration **Alert on BGP events:** ```yaml # /etc/prometheus/alerts/exabgp.yml groups: - name: exabgp interval: 30s rules: # BGP session down - alert: BGPSessionDown expr: exabgp_bgp_session_up == 0 for: 2m labels: severity: critical annotations: summary: "BGP session down: {{ $labels.peer }}" description: "BGP session to {{ $labels.peer }} has been down for 2 minutes" # Route count dropped - alert: RouteCountDropped expr: | (exabgp_active_routes - exabgp_active_routes offset 5m) / exabgp_active_routes offset 5m < -0.5 for: 2m labels: severity: warning annotations: summary: "Route count dropped >50%" # High notification rate - alert: HighNotificationRate expr: rate(exabgp_notifications_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "High BGP NOTIFICATION rate" description: "Receiving {{ $value }} notifications/sec from {{ $labels.peer }}" # No route announcements - alert: NoRouteAnnouncements expr: rate(exabgp_routes_announced_total[10m]) == 0 for: 15m labels: severity: warning annotations: summary: "No routes announced in 15 minutes" ``` --- ### PagerDuty Integration **Alert on critical events:** ```python import requests def alert_pagerduty(severity, summary, details): """Send alert to PagerDuty""" PAGERDUTY_API_KEY = os.getenv('PAGERDUTY_API_KEY') event = { 'routing_key': PAGERDUTY_API_KEY, 'event_action': 'trigger', 'payload': { 'summary': summary, 'severity': severity, 'source': socket.gethostname(), 'custom_details': details } } try: response = requests.post( 'https://events.pagerduty.com/v2/enqueue', json=event, timeout=5 ) response.raise_for_status() log_json('INFO', 'pagerduty_alert_sent', summary=summary) except Exception as e: log_json('ERROR', 'pagerduty_alert_failed', error=str(e)) # Use it if consecutive_failures > 10: alert_pagerduty( severity='critical', summary='ExaBGP health check failing', details={'failures': consecutive_failures, 'service': SERVICE_IP} ) ``` --- ## Disaster Recovery ### Backup Procedures **Backup ExaBGP configuration:** ```bash #!/bin/bash # /usr/local/bin/backup_exabgp.sh BACKUP_DIR="/var/backups/exabgp" DATE=$(date +%Y%m%d_%H%M%S) # Create backup directory mkdir -p "$BACKUP_DIR" # Backup config tar czf "$BACKUP_DIR/exabgp-config-$DATE.tar.gz" \ /etc/exabgp/*.conf \ /etc/exabgp/api/ \ /etc/systemd/system/exabgp.service # Backup state (if using stateful mode) cp -a /var/lib/exabgp "$BACKUP_DIR/exabgp-state-$DATE" # Keep only last 30 backups find "$BACKUP_DIR" -name "exabgp-*" -mtime +30 -delete echo "Backup completed: $BACKUP_DIR/exabgp-config-$DATE.tar.gz" ``` **Automate with cron:** ```cron # /etc/cron.d/exabgp-backup 0 2 * * * root /usr/local/bin/backup_exabgp.sh ``` --- ### Recovery Procedures **Restore from backup:** ```bash #!/bin/bash # /usr/local/bin/restore_exabgp.sh if [ $# -ne 1 ]; then echo "Usage: $0 " exit 1 fi BACKUP_FILE="$1" # Stop ExaBGP systemctl stop exabgp # Restore config tar xzf "$BACKUP_FILE" -C / # Verify config (ExaBGP 5.x) exabgp configuration validate /etc/exabgp/exabgp.conf if [ $? -eq 0 ]; then echo "Config valid, starting ExaBGP" systemctl start exabgp else echo "ERROR: Invalid config, ExaBGP not started" exit 1 fi ``` --- ### Disaster Recovery Plan **Document DR procedures:** ```markdown # ExaBGP Disaster Recovery Plan ## Scenario 1: ExaBGP Process Crash **Detection:** Systemd alert, Prometheus metric `up{job="exabgp"}==0` **Impact:** Routes withdrawn, traffic stops **Recovery:** 1. Check systemd status: `systemctl status exabgp` 2. Check logs: `journalctl -u exabgp -n 100` 3. Restart: `systemctl restart exabgp` 4. Verify: `exabgpcli show neighbor` ## Scenario 2: BGP Session Down **Detection:** `exabgp_bgp_session_up==0` **Impact:** Routes not advertised **Recovery:** 1. Check peer reachability: `ping ` 2. Check firewall: `iptables -L -n | grep 179` 3. Verify config: `grep /etc/exabgp/exabgp.conf` 4. Check peer logs 5. Restart session: `systemctl restart exabgp` ## Scenario 3: Server Failure **Detection:** Host unreachable **Impact:** Routes withdrawn, traffic fails over to backup **Recovery:** 1. Verify failover occurred 2. Monitor backup server load 3. Replace/repair failed server 4. Test before returning to service 5. Gradual traffic shift back ## Scenario 4: Configuration Error **Detection:** ExaBGP fails to start **Impact:** No BGP announcements **Recovery:** 1. Validate config: `exabgp configuration validate /etc/exabgp/exabgp.conf` (5.x) 2. Check syntax errors in logs 3. Restore from backup: `/usr/local/bin/restore_exabgp.sh` 4. Test restored config 5. Start ExaBGP ## Scenario 5: API Program Crash Loop **Detection:** Repeated restarts in logs **Impact:** Inconsistent route announcements **Recovery:** 1. Check API program logs 2. Test API program standalone 3. Disable API program temporarily 4. Fix bug 5. Deploy fixed version 6. Re-enable ``` --- ## Configuration Management ### Version Control **Store configs in Git:** ```bash # Initialize repo cd /etc/exabgp git init git add *.conf api/ git commit -m "Initial ExaBGP configuration" # Add remote git remote add origin git@github.com:yourorg/exabgp-configs.git git push -u origin master ``` **Automated deployment:** ```bash #!/bin/bash # /usr/local/bin/deploy_exabgp_config.sh # Pull latest config cd /etc/exabgp git pull origin master # Validate config (ExaBGP 5.x) exabgp configuration validate /etc/exabgp/exabgp.conf if [ $? -eq 0 ]; then # Reload ExaBGP systemctl reload exabgp echo "Configuration deployed successfully" else # Rollback git reset --hard HEAD^ echo "ERROR: Invalid configuration, rolled back" exit 1 fi ``` --- ### Ansible Automation **Ansible playbook:** ```yaml --- # playbooks/exabgp.yml - name: Deploy ExaBGP hosts: bgp_servers become: yes vars: exabgp_version: "4.2.25" service_ip: "100.10.0.100" peer_ip: "192.168.1.1" local_as: 65001 peer_as: 65000 tasks: - name: Install ExaBGP pip: name: "exabgp=={{ exabgp_version }}" state: present - name: Create exabgp user user: name: exabgp system: yes shell: /bin/false home: /var/lib/exabgp - name: Create directories file: path: "{{ item }}" state: directory owner: exabgp group: exabgp mode: 0750 loop: - /etc/exabgp - /etc/exabgp/api - /var/log/exabgp - /var/lib/exabgp - name: Deploy configuration template: src: templates/exabgp.conf.j2 dest: /etc/exabgp/exabgp.conf owner: exabgp group: exabgp mode: 0640 notify: restart exabgp - name: Deploy API programs copy: src: "{{ item }}" dest: /etc/exabgp/api/ owner: root group: root mode: 0755 loop: - files/healthcheck.py - files/exporter.py notify: restart exabgp - name: Deploy systemd service template: src: templates/exabgp.service.j2 dest: /etc/systemd/system/exabgp.service mode: 0644 notify: - reload systemd - restart exabgp - name: Enable ExaBGP service systemd: name: exabgp enabled: yes state: started handlers: - name: reload systemd systemd: daemon_reload: yes - name: restart exabgp systemd: name: exabgp state: restarted ``` --- ## Real-World Deployment Patterns ### Pattern 1: Anycast DNS **Use case:** Global DNS service with anycast IPs ```python #!/usr/bin/env python3 """ DNS anycast health check """ import sys import time import dns.resolver SERVICE_IP = "1.1.1.1" # Anycast IP DNS_PORT = 53 TEST_QUERY = "example.com" def check_dns(): """Check if DNS resolver is working""" try: resolver = dns.resolver.Resolver() resolver.nameservers = ['127.0.0.1'] resolver.timeout = 2 resolver.lifetime = 2 answer = resolver.resolve(TEST_QUERY, 'A') return len(answer) > 0 except: return False announced = False time.sleep(2) while True: healthy = check_dns() if healthy and not announced: sys.stdout.write(f"announce route {SERVICE_IP}/32 next-hop self\n") sys.stdout.flush() announced = True elif not healthy and announced: sys.stdout.write(f"withdraw route {SERVICE_IP}/32\n") sys.stdout.flush() announced = False time.sleep(5) ``` --- ### Pattern 2: DDoS Scrubbing Center **Use case:** Redirect attack traffic to scrubbing center via FlowSpec ```python #!/usr/bin/env python3 """ DDoS mitigation with FlowSpec """ import sys import time SCRUBBING_VRF = "65001:999" def announce_flowspec_block(src_prefix, dst_port, protocol): """Announce FlowSpec rule to block traffic""" rule = ( f"announce flow route {{ " f"match {{ source {src_prefix}; destination-port ={dst_port}; protocol ={protocol}; }} " f"then {{ redirect {SCRUBBING_VRF}; }} " f"}}" ) sys.stdout.write(rule + "\n") sys.stdout.flush() def detect_attack(): """Detect DDoS attack (integrate with your IDS)""" # Example: Read from IDS output # Return (source, dest_port, protocol) if attack detected return None time.sleep(2) while True: attack = detect_attack() if attack: src, port, proto = attack announce_flowspec_block(src, port, proto) log(f"Blocked {src} to port {port}") time.sleep(1) ``` --- ### Pattern 3: Multi-Tier Load Balancing **Facebook/Meta Katran pattern:** ``` ┌─────────────────────────────────────────────────┐ │ Border Router (ECMP) │ └─────────────────┬───────────────────────────────┘ │ ┌──────────┼──────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ExaBGP │ │ ExaBGP │ │ ExaBGP │ │ + L4LB │ │ + L4LB │ │ + L4LB │ │ (XDP) │ │ (XDP) │ │ (XDP) │ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ ┌────▼───────────▼───────────▼────┐ │ Backend Servers (ECMP) │ └──────────────────────────────────┘ ``` --- ## Testing Strategies ### Integration Testing **Test BGP session establishment:** ```bash #!/bin/bash # test_bgp_session.sh # Validate ExaBGP configuration (ExaBGP 5.x) timeout 30 exabgp configuration validate /etc/exabgp/exabgp.conf if [ $? -eq 0 ]; then echo "✓ Configuration valid" else echo "✗ Configuration invalid" exit 1 fi # Start ExaBGP systemctl start exabgp sleep 5 # Check BGP session if ss -tn | grep -q ':179.*ESTAB'; then echo "✓ BGP session established" else echo "✗ BGP session not established" systemctl stop exabgp exit 1 fi # Verify route announcement if exabgpcli show adj-rib out | grep -q '100.10.0.0/24'; then echo "✓ Route announced" else echo "✗ Route not announced" systemctl stop exabgp exit 1 fi echo "All tests passed" ``` --- ### Load Testing **Simulate high route count:** ```python #!/usr/bin/env python3 """ Load test - announce many routes """ import sys import time # Announce 10,000 routes time.sleep(2) for i in range(10000): prefix = f"100.{i // 256}.{i % 256}.0/24" sys.stdout.write(f"announce route {prefix} next-hop self\n") if i % 100 == 0: sys.stdout.flush() time.sleep(0.1) # Rate limit sys.stdout.flush() # Keep running while True: time.sleep(60) ``` --- ## Capacity Planning ### Route Capacity **Estimate memory usage:** ``` Memory per route (IPv4 unicast): - Route: ~100 bytes - Attributes: ~200 bytes Total: ~300 bytes per route Example: - 100,000 routes = 30 MB - 1,000,000 routes = 300 MB Add 100 MB for ExaBGP overhead Add 50 MB per API process Total for 1M routes with 2 API processes: 300 + 100 + 100 = 500 MB ``` **System requirements:** | Routes | RAM | CPU | |--------|-----|-----| | 1,000 | 256 MB | 1 core | | 10,000 | 512 MB | 1 core | | 100,000 | 1 GB | 2 cores | | 1,000,000 | 2 GB | 4 cores | --- ### BGP Session Limits **Sessions per server:** ``` ExaBGP can handle: - 100+ BGP sessions per server (tested) - Limited by CPU and network bandwidth - Use separate ExaBGP instances for isolation ``` --- ## Operational Procedures ### Deployment Checklist **Before deploying to production:** - [ ] Configuration validated with `exabgp configuration validate` (5.x) - [ ] MD5 authentication configured - [ ] Firewall rules applied - [ ] Monitoring configured (Prometheus) - [ ] Alerts configured (Alertmanager/PagerDuty) - [ ] Logs centralized (syslog) - [ ] Backups automated (cron) - [ ] DR procedures documented - [ ] Runbooks created - [ ] Team trained - [ ] Load testing completed - [ ] Failover tested - [ ] Rollback plan tested --- ### Runbook: BGP Session Troubleshooting ```markdown # Runbook: BGP Session Not Establishing ## Symptoms - `exabgp_bgp_session_up==0` - No routes announced - Log shows: "connection refused" or "timeout" ## Diagnosis ### 1. Check ExaBGP Status ```bash systemctl status exabgp journalctl -u exabgp -n 50 ``` ### 2. Check Network Connectivity ```bash ping traceroute ``` ### 3. Check BGP Port ```bash # From ExaBGP server telnet 179 # Check listening ss -tlnp | grep 179 ``` ### 4. Check Firewall ```bash iptables -L -n -v | grep 179 ``` ### 5. Verify Configuration ```bash grep /etc/exabgp/exabgp.conf exabgp configuration validate /etc/exabgp/exabgp.conf ``` ## Resolution ### If peer unreachable: - Check network path - Verify peer is running - Check firewall on both sides ### If connection refused: - Verify peer is listening on 179 - Check peer configuration - Verify MD5 password matches ### If connection timeout: - Check firewall rules - Verify routing to peer - Check MTU/MSS issues ## Escalation If issue persists after 15 minutes: 1. Page network team 2. Check peer router logs 3. Open vendor support ticket ``` --- ## Troubleshooting ### Common Issues | Issue | Cause | Solution | |-------|-------|----------| | Routes not announced | API program not running | Check process status | | Route flapping | No hysteresis in health check | Add consecutive check threshold | | High CPU usage | Too many routes | Optimize, add caching | | Memory leak | API program not cleaning up | Fix resource management | | BGP session flapping | Network issues or MD5 mismatch | Check logs, verify auth | --- ## See Also - **[API Overview](API-Overview)** - API architecture - **[Writing API Programs](Writing-API-Programs)** - Program development - **[Error Handling](Error-Handling)** - Error handling strategies - **[Service High Availability](Service-High-Availability)** - HA patterns - **[Monitoring](Monitoring)** - Monitoring guide - **[Debugging](Debugging)** - Debugging techniques ---