Database Intelligence Collector - Operations Runbook

Overview
Startup Procedures
Health Monitoring
Common Issues and Solutions
Performance Tuning
Emergency Procedures
Maintenance Operations
Troubleshooting Guide

Overview

The Database Intelligence Collector is an OpenTelemetry-based monitoring solution for PostgreSQL and MySQL databases. This runbook provides operational procedures for managing the collector in production.

Key Components

Data Collection: PostgreSQL/MySQL receivers collecting metrics and query data
Processing Pipeline: Custom processors for sampling, enrichment, and safety
Export: OTLP export to New Relic with optional Prometheus metrics

Critical Metrics to Monitor

Collector memory usage (target: <512MB)
Pipeline throughput (records/second)
Circuit breaker status (per database)
Sampling rates and dropped records
Export success rates

Startup Procedures

Pre-flight Checks

# 1. Verify environment variables
./scripts/validate-env.sh

# 2. Test database connectivity
./scripts/test-connections.sh

# 3. Validate configuration
otelcol validate --config=config/collector-production.yaml

# 4. Check resource limits
ulimit -n  # Should be at least 65536

Normal Startup

# 1. Start collector with production config
otelcol --config=config/collector-production.yaml \
        --set=service.telemetry.logs.level=info

# 2. Verify startup (wait 30 seconds)
curl -s http://localhost:13133/health/ready | jq .

# 3. Check metrics endpoint
curl -s http://localhost:8888/metrics | grep otelcol_

Startup Validation

# Check all components are healthy
curl -s http://localhost:13133/health/ready | jq '.components | keys'

# Expected output:
# ["adaptive_sampler", "circuit_breaker", "planattributeextractor", "verification"]

# Verify data flow
curl -s http://localhost:8888/metrics | grep -E 'otelcol_processor_processed|otelcol_exporter_sent'

Health Monitoring

Health Check Endpoints

Endpoint	Purpose	Expected Response
`:13133/health/live`	Kubernetes liveness	200 OK
`:13133/health/ready`	Kubernetes readiness	200 OK if healthy
`:13133/health`	Detailed health status	JSON with component status
`:8888/metrics`	Prometheus metrics	Metrics in Prometheus format
`:55679/debug/tracez`	zPages trace debugging	HTML trace viewer

Key Health Indicators

# Check overall health
curl -s http://localhost:13133/health | jq '.healthy'

# Check memory usage
curl -s http://localhost:8888/metrics | grep 'process_resident_memory_bytes'

# Check pipeline status
curl -s http://localhost:13133/health | jq '.pipeline_status'

# Check circuit breaker states
curl -s http://localhost:13133/health | jq '.components.circuit_breaker.metrics'

Monitoring Dashboard Queries

# Collector health score (0=unhealthy, 1=healthy)
up{job="otel-collector"}

# Memory usage percentage
100 * (process_resident_memory_bytes / otelcol_process_memory_limit)

# Processing rate
rate(otelcol_processor_accepted_spans[5m])

# Error rate
rate(otelcol_exporter_send_failed_spans[5m]) / rate(otelcol_exporter_sent_spans[5m])

# Circuit breaker trips
increase(otelcol_circuitbreaker_trips_total[1h])

Common Issues and Solutions

Issue: High Memory Usage

Symptoms: Memory usage >75% of limit, potential OOM kills

Diagnosis:

# Check current memory usage
curl -s http://localhost:8888/metrics | grep -E 'go_memstats_alloc_bytes|process_resident_memory'

# Check cache sizes
curl -s http://localhost:13133/health | jq '.components.adaptive_sampler.metrics.cache_size'

# Check batch processor queue
curl -s http://localhost:8888/metrics | grep 'otelcol_processor_batch_batch_size'

Solutions:

Reduce cache sizes:

adaptive_sampler:
  deduplication:
    cache_size: 5000  # Reduced from 10000

Adjust batch settings:

batch:
  send_batch_size: 500  # Reduced from 1000
  timeout: 5s          # Reduced from 10s

Enable memory limiter backpressure:

memory_limiter:
  limit_percentage: 65  # Reduced from 75
  spike_limit_percentage: 15  # Reduced from 20

Issue: Circuit Breaker Frequently Opening

Symptoms: Databases showing as unhealthy, data collection stops

Diagnosis:

# Check circuit breaker status
curl -s http://localhost:13133/health | jq '.components.circuit_breaker.metrics.database_states'

# Check database response times
curl -s http://localhost:8888/metrics | grep 'otelcol_receiver_accepted_metric_points'

# Review logs for errors
journalctl -u otel-collector -n 100 | grep "circuit.*open"

Solutions:

Increase circuit breaker thresholds:

circuit_breaker:
  failure_threshold: 10  # Increased from 5
  open_state_timeout: 60s  # Increased from 30s

Reduce query load:

postgresql:
  collection_interval: 60s  # Increased from 30s

Check database performance:

-- Check for long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query 
FROM pg_stat_activity 
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';

Issue: Export Failures to New Relic

Symptoms: Data not appearing in New Relic, export errors in logs

Diagnosis:

# Check export metrics
curl -s http://localhost:8888/metrics | grep -E 'otelcol_exporter_sent|otelcol_exporter_send_failed'

# Check for cardinality errors
journalctl -u otel-collector -n 100 | grep -i "cardinality\|NrIntegrationError"

# Verify API key
curl -X POST https://otlp.nr-data.net:4317/v1/metrics \
  -H "Api-Key: $NEW_RELIC_LICENSE_KEY" \
  -H "Content-Type: application/x-protobuf"

Solutions:

Enable aggressive sampling:

adaptive_sampler:
  default_sample_rate: 0.01  # Very aggressive sampling
  rules:
    - name: high_cardinality_protection
      conditions:
        - attribute: unique_values
          operator: gt
          value: 1000
      sample_rate: 0.001

Add cardinality reduction:

transform:
  metric_statements:
    - context: metric
      statements:
        - truncate_all(attributes, 100)  # Limit attribute length
        - delete_key(attributes, "query_text") where attributes["sensitive"] == true

Enable export retry with backoff:

otlp/newrelic:
  retry_on_failure:
    enabled: true
    initial_interval: 30s
    max_interval: 300s
    max_elapsed_time: 900s

Performance Tuning

Baseline Performance Targets

Metric	Target	Warning	Critical
Memory Usage	<256MB	>384MB	>450MB
CPU Usage	<20%	>50%	>80%
Processing Latency	<5ms	>10ms	>50ms
Export Queue Size	<1000	>5000	>10000
Sampling Rate	Dynamic	<0.001	0

Optimization Procedures

1. Query Optimization

# Reduce query frequency for non-critical databases
sqlquery/analytics:
  collection_interval: 300s  # 5 minutes for analytics DB
  
sqlquery/production:
  collection_interval: 30s   # 30 seconds for production

2. Sampling Optimization

adaptive_sampler:
  # Environment-specific sampling
  environment_overrides:
    production:
      slow_query_threshold_ms: 2000
      max_records_per_second: 500
      rules:
        - name: business_critical
          conditions:
            - attribute: db.name
              operator: eq
              value: "orders"
          sample_rate: 1.0  # Always sample critical DB

3. Batch Processing Optimization

batch:
  # Optimize for throughput vs latency
  send_batch_size: ${BATCH_SIZE:1000}
  send_batch_max_size: ${BATCH_MAX_SIZE:2000}
  timeout: ${BATCH_TIMEOUT:10s}

Load Testing Procedure

# 1. Generate test load
./scripts/load-test.sh --duration=300 --rps=1000

# 2. Monitor during test
watch -n 1 'curl -s http://localhost:8888/metrics | grep -E "rate|latency|memory"'

# 3. Analyze results
./scripts/analyze-load-test.sh --output=load-test-report.html

Emergency Procedures

Emergency Stop

# Graceful shutdown (allows pipeline drain)
kill -TERM $(pgrep otelcol)

# Force stop if graceful fails after 30s
kill -KILL $(pgrep otelcol)

Circuit Breaker Manual Control

# Open circuit breaker for specific database
curl -X POST http://localhost:13133/circuit_breaker/open \
  -d '{"database": "production_primary"}'

# Close circuit breaker
curl -X POST http://localhost:13133/circuit_breaker/close \
  -d '{"database": "production_primary"}'

Data Pipeline Bypass

# Emergency config with minimal processing
service:
  pipelines:
    metrics/emergency:
      receivers: [postgresql]
      processors: [memory_limiter, batch]  # Skip custom processors
      exporters: [otlp/newrelic]

Rollback Procedure

# 1. Keep last known good config
cp config/collector-production.yaml config/collector-production.backup.yaml

# 2. Rollback if needed
otelcol --config=config/collector-production.backup.yaml

# 3. Verify rollback success
curl -s http://localhost:13133/health/ready

Maintenance Operations

Configuration Updates

# 1. Validate new configuration
otelcol validate --config=config/collector-new.yaml

# 2. Test in dry-run mode
otelcol --config=config/collector-new.yaml --dry-run

# 3. Apply configuration with zero downtime
./scripts/rolling-update.sh --config=config/collector-new.yaml

Cache Maintenance

# Clear adaptive sampler cache
curl -X POST http://localhost:13133/cache/clear \
  -d '{"component": "adaptive_sampler"}'

# Clear plan extractor cache
curl -X POST http://localhost:13133/cache/clear \
  -d '{"component": "plan_extractor"}'

Log Rotation

# Configure log rotation
cat > /etc/logrotate.d/otel-collector << EOF
/var/log/otel/collector.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    create 0644 otel otel
    postrotate
        kill -USR1 $(pgrep otelcol) > /dev/null 2>&1 || true
    endscript
}
EOF

Troubleshooting Guide

Debug Mode Activation

# Enable debug logging
otelcol --config=config/collector-production.yaml \
        --set=service.telemetry.logs.level=debug

# Enable specific component debugging
export GODEBUG=gctrace=1  # GC debugging
export OTEL_RESOURCE_ATTRIBUTES="debug.mode=true"

Common Debug Commands

# Check goroutine count (memory leaks)
curl -s http://localhost:1777/debug/pprof/goroutine?debug=1 | grep goroutine

# CPU profiling
curl -s http://localhost:1777/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof cpu.prof

# Heap profiling
curl -s http://localhost:1777/debug/pprof/heap > heap.prof
go tool pprof heap.prof

# Trace analysis
curl -s http://localhost:55679/debug/tracez

Log Analysis Patterns

# Find processing errors
grep -E "ERROR.*processor" /var/log/otel/collector.log

# Find export failures
grep -E "failed.*export|export.*failed" /var/log/otel/collector.log

# Find circuit breaker events
grep -E "circuit.*open|circuit.*close" /var/log/otel/collector.log

# Find memory pressure events
grep -E "memory.*limit|OOM|allocation.*failed" /var/log/otel/collector.log

Recovery Procedures

From Memory Pressure

Reduce batch sizes
Clear caches
Increase memory limit
Enable aggressive sampling

From Export Failures

Check New Relic API status
Verify credentials
Reduce cardinality
Enable local buffering

From Database Overload

Increase collection intervals
Reduce query complexity
Enable circuit breakers
Add connection pooling

Appendix: Quick Reference

Environment Variables

# Required
NEW_RELIC_LICENSE_KEY    # New Relic API key
POSTGRES_HOST/PORT/USER  # PostgreSQL connection
MYSQL_HOST/PORT/USER     # MySQL connection

# Tuning
SLOW_QUERY_THRESHOLD     # Slow query threshold (ms)
MAX_RECORDS_PER_SECOND   # Rate limit
BATCH_SIZE               # Batch processor size
MEMORY_LIMIT_MB          # Memory limit

# Debugging
OTEL_LOG_LEVEL          # Log level (debug/info/warn/error)
GODEBUG                 # Go runtime debug flags

Useful Aliases

alias otel-health='curl -s http://localhost:13133/health | jq .'
alias otel-metrics='curl -s http://localhost:8888/metrics'
alias otel-logs='journalctl -u otel-collector -f'
alias otel-restart='systemctl restart otel-collector'

Support Contacts

On-Call: Use PagerDuty integration
Slack: #database-intelligence-alerts
Documentation: https://docs.company.com/otel-collector
Source Code: https://github.com/company/database-intelligence-mvp

FilesExpand file tree

RUNBOOK.md

Latest commit

History

RUNBOOK.md

File metadata and controls

Database Intelligence Collector - Operations Runbook

Table of Contents

Overview

Key Components

Critical Metrics to Monitor

Startup Procedures

Pre-flight Checks

Normal Startup

Startup Validation

Health Monitoring

Health Check Endpoints

Key Health Indicators

Monitoring Dashboard Queries

Common Issues and Solutions

Issue: High Memory Usage

Issue: Circuit Breaker Frequently Opening

Issue: Export Failures to New Relic

Performance Tuning

Baseline Performance Targets

Optimization Procedures

1. Query Optimization

2. Sampling Optimization

3. Batch Processing Optimization

Load Testing Procedure

Emergency Procedures

Emergency Stop

Circuit Breaker Manual Control

Data Pipeline Bypass

Rollback Procedure

Maintenance Operations

Configuration Updates

Cache Maintenance

Log Rotation

Troubleshooting Guide

Debug Mode Activation

Common Debug Commands

Log Analysis Patterns

Recovery Procedures

From Memory Pressure

From Export Failures

From Database Overload

Appendix: Quick Reference

Environment Variables

Useful Aliases

Support Contacts