Skip to content

Latest commit

 

History

History
519 lines (384 loc) · 10.6 KB

File metadata and controls

519 lines (384 loc) · 10.6 KB

Monitoring and Logging

This guide covers the logging, basic health checks, and monitoring capabilities of the Open Resource Broker.

Overview

The Open Resource Broker provides basic monitoring capabilities through:

  • Application Logging: Detailed operation logs
  • Health Checks: Basic system health monitoring
  • Error Tracking: Error detection and logging
  • Operation Tracking: Request and machine lifecycle logging

Logging

Log Configuration

Configure logging in your config.json:

{
  "logging": {
    "level": "INFO",
    "file_path": "logs/app.log",
    "console_enabled": true
  }
}

Log Levels

  • DEBUG: Detailed diagnostic information
  • INFO: General operational information
  • WARNING: Warning messages for potential issues
  • ERROR: Error conditions
  • CRITICAL: Critical errors that may cause failures

Log Format

The application uses structured logging:

2025-06-30 10:00:00,123 INFO [RequestService] Request created successfully request_id=req-123 template_id=template-1 machine_count=3
2025-06-30 10:00:01,456 ERROR [AWSProvider] Failed to provision machine error=InvalidParameterValue request_id=req-123

Log Analysis

Common Log Patterns

Request Lifecycle:

# Track request from creation to completion
grep "req-123" logs/app.log | grep -E "(created|status|completed)"

Error Analysis:

# Count error types
grep "ERROR" logs/app.log | cut -d']' -f2 | cut -d':' -f1 | sort | uniq -c

# Find recent errors
tail -100 logs/app.log | grep "ERROR"

Performance Analysis:

# Find slow operations
grep "slow" logs/app.log

# Track request duration
grep "Request.*completed" logs/app.log | grep -o "duration=[0-9]*"

Log Rotation

For production environments, set up log rotation:

Using logrotate (Linux)

Create /etc/logrotate.d/hostfactory:

/path/to/logs/app.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 644 hostfactory hostfactory
}

Manual Log Management

# Archive old logs
mv logs/app.log logs/app.log.$(date +%Y%m%d)
touch logs/app.log

# Compress old logs
gzip logs/app.log.*

# Clean up old logs (keep last 30 days)
find logs/ -name "app.log.*" -mtime +30 -delete

Health Checks

Basic Health Check

The application provides basic health check functionality:

# Check if the application can start and load configuration
python run.py getAvailableTemplates

# Should return templates or empty list without errors

AWS Connectivity Check

# Test AWS credentials and connectivity
aws sts get-caller-identity

# Test EC2 API access
aws ec2 describe-regions --region us-east-1

Configuration Validation

# Validate configuration file
python -c "
import json
with open('config/config.json') as f:
    config = json.load(f)
    print('Configuration is valid JSON')
    print(f'Provider: {config.get(\"provider\", {}).get(\"type\", \"unknown\")}')
"

Storage Health Check

# Check data directory
ls -la data/

# Check if database file is accessible
if [ -f "data/request_database.json" ]; then
    echo "Database file exists"
    python -c "
import json
with open('data/request_database.json') as f:
    data = json.load(f)
    print(f'Database loaded successfully')
"
else
    echo "Database file not found - will be created on first use"
fi

Error Monitoring

Error Types

The application logs various types of errors:

Configuration Errors

ERROR [ConfigManager] Failed to load configuration: File not found
ERROR [ConfigManager] Invalid JSON in configuration file

AWS Provider Errors

ERROR [AWSProvider] AWS API error: InvalidParameterValue
ERROR [AWSProvider] Failed to provision machine: InsufficientInstanceCapacity
ERROR [AWSProvider] Authentication failed: InvalidUserID.NotFound

Application Errors

ERROR [RequestService] Template not found: template-123
ERROR [RequestService] Invalid machine count: -1
ERROR [ApplicationService] Failed to create request: ValidationError

Error Tracking Script

Create a simple error monitoring script:

#!/bin/bash
# error_monitor.sh

LOG_FILE="logs/app.log"
ERROR_COUNT=$(grep "ERROR" "$LOG_FILE" | wc -l)
RECENT_ERRORS=$(tail -100 "$LOG_FILE" | grep "ERROR" | wc -l)

echo "Total errors: $ERROR_COUNT"
echo "Recent errors (last 100 lines): $RECENT_ERRORS"

if [ "$RECENT_ERRORS" -gt 5 ]; then
    echo "WARNING: High error rate detected"
    echo "Recent errors:"
    tail -100 "$LOG_FILE" | grep "ERROR" | tail -5
fi

Operation Monitoring

Request Tracking

Monitor request lifecycle:

# Count active requests
python run.py getReturnRequests --active-only | jq '. | length'

# List recent requests
grep "Request.*created" logs/app.log | tail -10

# Track request completion
grep "Request.*completed" logs/app.log | tail -10

Machine Monitoring

Track machine provisioning:

# Count machines by status
python run.py getReturnRequests | jq '.[] | .machines[] | .status' | sort | uniq -c

# Monitor provisioning time
grep "Machine.*provisioned" logs/app.log | tail -10

AWS API Monitoring

Monitor AWS API usage:

# Count API calls
grep "AWS API" logs/app.log | wc -l

# Check for rate limiting
grep "rate limit" logs/app.log

# Monitor API errors
grep "AWS API.*error" logs/app.log | tail -10

Performance Monitoring

Response Time Tracking

Monitor command execution time:

# Time command execution
time python run.py getAvailableTemplates

# Monitor slow operations
grep "slow" logs/app.log

Resource Usage

Monitor system resources:

# Check memory usage
ps aux | grep python | grep run.py

# Check disk usage
du -sh data/ logs/

# Monitor file handles
lsof | grep python | wc -l

Database Performance

For JSON storage:

# Check database file size
ls -lh data/request_database.json

# Monitor database operations
grep "database" logs/app.log | tail -10

Alerting

Simple Email Alerts

Create a basic alerting script:

#!/bin/bash
# alert_check.sh

LOG_FILE="logs/app.log"
ALERT_EMAIL="admin@example.com"

# Check for critical errors
CRITICAL_ERRORS=$(grep "CRITICAL" "$LOG_FILE" | wc -l)

if [ "$CRITICAL_ERRORS" -gt 0 ]; then
    echo "CRITICAL errors detected in Host Factory Plugin" | \
    mail -s "Host Factory Alert" "$ALERT_EMAIL"
fi

# Check for high error rate
RECENT_ERRORS=$(tail -1000 "$LOG_FILE" | grep "ERROR" | wc -l)

if [ "$RECENT_ERRORS" -gt 50 ]; then
    echo "High error rate detected: $RECENT_ERRORS errors in last 1000 log lines" | \
    mail -s "Host Factory High Error Rate" "$ALERT_EMAIL"
fi

Cron Job Setup

# Add to crontab
crontab -e

# Check every 15 minutes
*/15 * * * * /path/to/alert_check.sh

# Daily log summary
0 8 * * * /path/to/daily_summary.sh

Monitoring Scripts

Daily Summary Script

#!/bin/bash
# daily_summary.sh

LOG_FILE="logs/app.log"
DATE=$(date +%Y-%m-%d)

echo "Host Factory Daily Summary - $DATE"
echo "=================================="

# Request statistics
echo "Requests:"
echo "  Created: $(grep "Request.*created" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  Completed: $(grep "Request.*completed" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  Failed: $(grep "Request.*failed" "$LOG_FILE" | grep "$DATE" | wc -l)"

# Error statistics
echo "Errors:"
echo "  Total: $(grep "ERROR" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  AWS: $(grep "ERROR.*AWS" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  Config: $(grep "ERROR.*Config" "$LOG_FILE" | grep "$DATE" | wc -l)"

# Machine statistics
echo "Machines:"
echo "  Provisioned: $(grep "Machine.*provisioned" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  Terminated: $(grep "Machine.*terminated" "$LOG_FILE" | grep "$DATE" | wc -l)"

Health Check Script

#!/bin/bash
# health_check.sh

echo "Host Factory Health Check"
echo "========================"

# Test basic functionality
echo -n "Basic functionality: "
if python run.py getAvailableTemplates > /dev/null 2>&1; then
    echo "OK"
else
    echo "FAILED"
fi

# Test AWS connectivity
echo -n "AWS connectivity: "
if aws sts get-caller-identity > /dev/null 2>&1; then
    echo "OK"
else
    echo "FAILED"
fi

# Check disk space
echo -n "Disk space: "
DISK_USAGE=$(df . | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -lt 90 ]; then
    echo "OK ($DISK_USAGE%)"
else
    echo "WARNING ($DISK_USAGE%)"
fi

# Check log file size
echo -n "Log file size: "
if [ -f "logs/app.log" ]; then
    LOG_SIZE=$(du -m logs/app.log | cut -f1)
    if [ "$LOG_SIZE" -lt 100 ]; then
        echo "OK (${LOG_SIZE}MB)"
    else
        echo "WARNING (${LOG_SIZE}MB)"
    fi
else
    echo "No log file"
fi

Log Analysis Tools

Error Analysis

# Top error messages
grep "ERROR" logs/app.log | cut -d']' -f3 | sort | uniq -c | sort -nr | head -10

# Error timeline
grep "ERROR" logs/app.log | cut -d' ' -f1-2 | uniq -c

# AWS-specific errors
grep "ERROR.*AWS" logs/app.log | tail -20

Performance Analysis

# Slow operations
grep -E "(slow|timeout|delay)" logs/app.log

# Request duration analysis
grep "duration=" logs/app.log | grep -o "duration=[0-9]*" | sort -n

# API call frequency
grep "AWS API" logs/app.log | cut -d' ' -f1-2 | uniq -c

Troubleshooting Monitoring

Common Issues

Log File Not Created

# Check directory permissions
ls -la logs/

# Create directory if needed
mkdir -p logs
chmod 755 logs

High Log File Size

# Check log file size
ls -lh logs/app.log

# Rotate logs manually
mv logs/app.log logs/app.log.old
touch logs/app.log

Missing Health Check Data

# Verify configuration
python -c "
import json
with open('config/config.json') as f:
    config = json.load(f)
    print('Logging config:', config.get('logging', {}))
"

Integration with External Monitoring

Syslog Integration

Configure syslog forwarding:

# In logging configuration
{
  "logging": {
    "level": "INFO",
    "file_path": "logs/app.log",
    "console_enabled": true,
    "syslog_enabled": true,
    "syslog_facility": "local0"
  }
}

Log Forwarding

Forward logs to centralized logging:

# Using rsyslog
echo "local0.*    @@logserver:514" >> /etc/rsyslog.conf
systemctl restart rsyslog

# Using filebeat (ELK stack)
# Configure filebeat.yml to monitor logs/app.log

Next Steps