Skip to content

Infrastructure Monitoring & Security Alerts for Allocator IAM Role #20

@eberrigan

Description

@eberrigan

Infrastructure Monitoring & Security Alerts for Allocator IAM Role

Problem Statement

PR #19 grants the allocator EC2 instance powerful IAM permissions to autonomously create, manage, and destroy client VMs. While this enables the intended self-service architecture, it also introduces significant security and cost risks:

Security Risks:

  • Unauthorized access to the allocator instance could allow an attacker to:
    • Launch expensive EC2 instances (e.g., p4d.24xlarge at $32/hour)
    • Create IAM roles and escalate privileges
    • Modify security groups to expose resources
    • Delete critical infrastructure
  • No visibility into when/how these powerful permissions are being used
  • No audit trail for compliance or forensic analysis

Cost Risks:

  • Bugs in allocator code could launch hundreds of instances
  • Runaway processes could create resources indefinitely
  • No automated alerts for cost threshold breaches
  • No enforcement of instance type restrictions

Current State:

  • ✅ CloudWatch logging exists for client VM logs (sent to allocator via Lambda)
  • ❌ No monitoring of allocator's AWS API calls (EC2, IAM operations)
  • ❌ No CloudTrail configuration for audit logging
  • ❌ No CloudWatch alarms for security events
  • ❌ No cost monitoring or budget alerts
  • ❌ No SNS notifications for critical events

Context: Existing vs Planned Logging Systems

Current: CloudWatch → Lambda → Allocator (Client VM Logs)

Purpose: Collect application logs from client VMs
Architecture: Client VMs → CloudWatch Logs → Lambda → Allocator API
Scope: Client VM cloud-init and container logs
Status: Currently deployed

Future: Self-Hosted Logging System (Issue #225)

Purpose: Replace CloudWatch for client VM application logs
Architecture: Client VMs → Log Shipper → Allocator API → PostgreSQL
Scope: Client VM application logs only (not infrastructure)
Status: Proposed (will deprecate current CloudWatch → Lambda pipeline)

Proposed: Infrastructure Security Monitoring (This Issue)

Purpose: Monitor allocator's AWS API usage for security/cost control
Architecture: AWS API Calls → CloudTrail → CloudWatch Logs → Metric Filters → Alarms → SNS
Scope: EC2, IAM, and other AWS service calls made by allocator IAM role
Status: Not implemented

Important: This issue is complementary to issue #225. The self-hosted logging system handles application logs for end-users, while this infrastructure monitoring ensures secure operation of the platform itself.

Scope

This issue focuses on monitoring the allocator's infrastructure operations, not client application logs:

In Scope

  • ✅ CloudTrail logging for all API calls by lablink_instance_role_*
  • ✅ CloudWatch metric filters for security-relevant events
  • ✅ CloudWatch alarms for anomalous behavior
  • ✅ SNS notifications to administrators
  • ✅ AWS Budget alerts for cost overruns
  • ✅ Dashboard for real-time monitoring

Out of Scope

  • ❌ Client VM application logs (handled by issue #225)
  • ❌ Allocator application logs (handled by existing Docker logging)
  • ❌ Performance monitoring (separate concern)

Requirements

1. CloudTrail Configuration

Enable CloudTrail to capture all API events related to the allocator's IAM role:

Features:

  • Multi-region trail (captures API calls from all regions)
  • S3 bucket for log storage (encrypted at rest)
  • CloudWatch Logs integration for real-time analysis
  • Log file validation enabled
  • 90-day retention minimum (configurable)

Events to Capture:

  • All ec2:RunInstances calls by allocator role
  • All ec2:TerminateInstances calls by allocator role
  • All iam:CreateRole, iam:DeleteRole calls
  • All security group modifications
  • All failed API calls (authentication/authorization errors)

Cost Considerations:

  • CloudTrail: ~$2/100,000 events + $0.50/1M events delivered to CloudWatch
  • S3 storage: ~$0.023/GB/month
  • Expected cost: $5-15/month depending on usage

2. CloudWatch Metric Filters

Create metric filters to extract security-relevant patterns from CloudTrail logs:

Metric Filters:

Filter Name Pattern Threshold Purpose
RunInstancesCount eventName = RunInstances AND userIdentity.principalId LIKE %lablink_instance_role% >10/5min Detect mass instance launches
TerminateInstancesCount eventName = TerminateInstances AND userIdentity.principalId LIKE %lablink_instance_role% >20/5min Detect mass terminations
LargeInstanceLaunched eventName = RunInstances AND requestParameters.instanceType LIKE p4d.* OR p3.* OR g5.* >0 Detect expensive instances
IAMRoleCreationCount eventName = CreateRole AND userIdentity.principalId LIKE %lablink_instance_role% >10/hour Detect unusual role creation
SecurityGroupModifications eventName LIKE *SecurityGroup* AND userIdentity.principalId LIKE %lablink_instance_role% >20/hour Detect SG tampering
UnauthorizedAPICalls errorCode = (AccessDenied OR UnauthorizedOperation) AND userIdentity.principalId LIKE %lablink_instance_role% >5/15min Detect permission issues or attacks
ConsoleLoginFailures eventName = ConsoleLogin AND errorMessage = "Failed authentication" >3/5min Detect brute force on allocator

3. CloudWatch Alarms

Create alarms that trigger on metric filter thresholds:

Alarm Configuration:

  • Evaluation period: 5 minutes (configurable)
  • Datapoints to alarm: 1 out of 1 (immediate alerting)
  • Missing data treatment: notBreaching (avoid false alarms during quiet periods)
  • Actions: Publish to SNS topic for admin notifications

Example Alarm (Terraform):

resource "aws_cloudwatch_metric_alarm" "mass_instance_launch" {
  alarm_name          = "lablink-mass-instance-launch-${var.resource_suffix}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "RunInstancesCount"
  namespace           = "LabLinkSecurity"
  period              = 300  # 5 minutes
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "Alert when allocator launches >10 instances in 5 minutes"
  alarm_actions       = [aws_sns_topic.admin_alerts.arn]
  treat_missing_data  = "notBreaching"
}

4. SNS Notifications

SNS Topic Configuration:

  • Topic name: lablink-admin-alerts-${resource_suffix}
  • Protocol: Email (configurable for Slack/PagerDuty later)
  • Subscription email: Configured in config.yaml

Example config.yaml additions:

monitoring:
  enabled: true  # Enable/disable all monitoring features
  alerts:
    email: "[email protected]"  # Email for critical alerts
    # Future: Slack webhook, PagerDuty integration key

  # Thresholds (all optional - use defaults if not specified)
  thresholds:
    max_instances_per_5min: 10  # Alert if >10 RunInstances in 5min
    max_terminations_per_5min: 20
    max_iam_roles_per_hour: 10
    max_security_group_changes_per_hour: 20
    max_unauthorized_calls_per_15min: 5

5. AWS Budget Alerts

Budget Configuration:

  • Budget type: Cost budget
  • Time period: Monthly recurring
  • Budget amount: Configurable in config.yaml (default: $500/month)
  • Alert thresholds:
    • 50% of budget (warning)
    • 80% of budget (urgent)
    • 100% of budget (critical)
    • 150% of budget (severe overage)

Example config.yaml:

monitoring:
  budget:
    enabled: true
    monthly_limit_usd: 500  # Alert if monthly costs exceed this
    alerts:
      - threshold_percent: 50
        severity: "warning"
      - threshold_percent: 80
        severity: "urgent"
      - threshold_percent: 100
        severity: "critical"

Terraform Resource:

resource "aws_budgets_budget" "lablink_monthly" {
  name              = "lablink-monthly-budget-${var.resource_suffix}"
  budget_type       = "COST"
  limit_amount      = local.monitoring_config.budget.monthly_limit_usd
  limit_unit        = "USD"
  time_period_start = "2025-01-01_00:00"
  time_unit         = "MONTHLY"

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = [local.monitoring_config.alerts.email]
  }
}

6. CloudWatch Dashboard (Optional)

Create a dashboard for real-time visibility into allocator operations:

Widgets:

  • Total instances launched (last 24h)
  • Total instances terminated (last 24h)
  • Active client VMs (current count)
  • API error rate (unauthorized calls)
  • Cost-to-date this month
  • Top 5 instance types launched
  • IAM roles created/deleted timeline

Access: Available at AWS Console → CloudWatch → Dashboards → lablink-${resource_suffix}

Implementation Plan

Phase 1: CloudTrail Setup (Foundational)

Priority: Critical
Estimated Time: 2-4 hours

Files to Modify:

  • lablink-infrastructure/cloudtrail.tf (new file)
  • lablink-infrastructure/main.tf (add CloudTrail module reference)
  • lablink-infrastructure/config/config.yaml (add monitoring section)

Tasks:

  1. Create S3 bucket for CloudTrail logs with encryption
  2. Create CloudTrail trail with CloudWatch Logs integration
  3. Configure trail to filter for allocator role events only (cost optimization)
  4. Enable log file validation
  5. Set retention policy (90 days default, configurable)

Terraform Resources:

# lablink-infrastructure/cloudtrail.tf

# S3 bucket for CloudTrail logs
resource "aws_s3_bucket" "cloudtrail_logs" {
  bucket = "lablink-cloudtrail-${var.resource_suffix}-${data.aws_caller_identity.current.account_id}"

  tags = {
    Name        = "lablink-cloudtrail-${var.resource_suffix}"
    Environment = var.resource_suffix
  }
}

# S3 bucket encryption
resource "aws_s3_bucket_server_side_encryption_configuration" "cloudtrail_encryption" {
  bucket = aws_s3_bucket.cloudtrail_logs.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

# S3 bucket policy for CloudTrail
resource "aws_s3_bucket_policy" "cloudtrail_policy" {
  bucket = aws_s3_bucket.cloudtrail_logs.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AWSCloudTrailAclCheck"
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action   = "s3:GetBucketAcl"
        Resource = aws_s3_bucket.cloudtrail_logs.arn
      },
      {
        Sid    = "AWSCloudTrailWrite"
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action   = "s3:PutObject"
        Resource = "${aws_s3_bucket.cloudtrail_logs.arn}/*"
        Condition = {
          StringEquals = {
            "s3:x-amz-acl" = "bucket-owner-full-control"
          }
        }
      }
    ]
  })
}

# CloudWatch Log Group for CloudTrail
resource "aws_cloudwatch_log_group" "cloudtrail_logs" {
  name              = "lablink-cloudtrail-${var.resource_suffix}"
  retention_in_days = try(local.config_file.monitoring.cloudtrail_retention_days, 90)
}

# IAM role for CloudTrail to write to CloudWatch
resource "aws_iam_role" "cloudtrail_cloudwatch_role" {
  name = "lablink_cloudtrail_cloudwatch_${var.resource_suffix}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action = "sts:AssumeRole"
      }
    ]
  })
}

# IAM policy for CloudTrail CloudWatch access
resource "aws_iam_role_policy" "cloudtrail_cloudwatch_policy" {
  name = "lablink_cloudtrail_cloudwatch_policy_${var.resource_suffix}"
  role = aws_iam_role.cloudtrail_cloudwatch_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "${aws_cloudwatch_log_group.cloudtrail_logs.arn}:*"
      }
    ]
  })
}

# CloudTrail
resource "aws_cloudtrail" "lablink_trail" {
  name                          = "lablink-trail-${var.resource_suffix}"
  s3_bucket_name                = aws_s3_bucket.cloudtrail_logs.id
  include_global_service_events = true
  is_multi_region_trail         = true
  enable_log_file_validation    = true

  cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.cloudtrail_logs.arn}:*"
  cloud_watch_logs_role_arn  = aws_iam_role.cloudtrail_cloudwatch_role.arn

  event_selector {
    read_write_type           = "All"
    include_management_events = true

    # Optional: Filter to only allocator role events (cost optimization)
    # This reduces CloudTrail costs by ignoring unrelated AWS activity
    data_resource {
      type   = "AWS::EC2::Instance"
      values = ["arn:aws:ec2:*:${data.aws_caller_identity.current.account_id}:instance/*"]
    }
  }

  depends_on = [
    aws_s3_bucket_policy.cloudtrail_policy,
    aws_cloudwatch_log_group.cloudtrail_logs,
    aws_iam_role_policy.cloudtrail_cloudwatch_policy
  ]

  tags = {
    Name        = "lablink-trail-${var.resource_suffix}"
    Environment = var.resource_suffix
  }
}

Testing:

  1. Deploy CloudTrail configuration
  2. Launch a test client VM via allocator
  3. Verify CloudTrail logs appear in S3 bucket
  4. Verify logs appear in CloudWatch Logs group
  5. Check for RunInstances events in CloudWatch Logs Insights

Phase 2: Metric Filters & Alarms (Security Monitoring)

Priority: High
Estimated Time: 4-6 hours

Files to Modify:

  • lablink-infrastructure/cloudwatch_alarms.tf (new file)
  • lablink-infrastructure/config/config.yaml (update monitoring section)

Tasks:

  1. Create SNS topic for admin alerts
  2. Add email subscription (from config.yaml)
  3. Create metric filters for each security pattern
  4. Create CloudWatch alarms for each metric
  5. Test alarm triggering with synthetic events

Terraform Resources:

# lablink-infrastructure/cloudwatch_alarms.tf

# SNS Topic for Admin Alerts
resource "aws_sns_topic" "admin_alerts" {
  name = "lablink-admin-alerts-${var.resource_suffix}"

  tags = {
    Name        = "lablink-admin-alerts-${var.resource_suffix}"
    Environment = var.resource_suffix
  }
}

# SNS Email Subscription
resource "aws_sns_topic_subscription" "admin_email" {
  count     = try(local.config_file.monitoring.enabled, false) ? 1 : 0
  topic_arn = aws_sns_topic.admin_alerts.arn
  protocol  = "email"
  endpoint  = try(local.config_file.monitoring.alerts.email, "")
}

# Metric Filter: Mass Instance Launches
resource "aws_cloudwatch_log_metric_filter" "run_instances" {
  name           = "lablink-run-instances-${var.resource_suffix}"
  log_group_name = aws_cloudwatch_log_group.cloudtrail_logs.name

  pattern = <<PATTERN
{ ($.eventName = RunInstances) && ($.userIdentity.principalId = *lablink_instance_role*) }
PATTERN

  metric_transformation {
    name      = "RunInstancesCount"
    namespace = "LabLinkSecurity/${var.resource_suffix}"
    value     = "1"
    unit      = "Count"
  }
}

# Alarm: Mass Instance Launches
resource "aws_cloudwatch_metric_alarm" "mass_instance_launch" {
  alarm_name          = "lablink-mass-instance-launch-${var.resource_suffix}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "RunInstancesCount"
  namespace           = "LabLinkSecurity/${var.resource_suffix}"
  period              = 300  # 5 minutes
  statistic           = "Sum"
  threshold           = try(local.config_file.monitoring.thresholds.max_instances_per_5min, 10)
  alarm_description   = "Alert when allocator launches >${try(local.config_file.monitoring.thresholds.max_instances_per_5min, 10)} instances in 5 minutes"
  alarm_actions       = [aws_sns_topic.admin_alerts.arn]
  treat_missing_data  = "notBreaching"

  tags = {
    Name        = "lablink-mass-instance-launch-${var.resource_suffix}"
    Environment = var.resource_suffix
    Severity    = "high"
  }
}

# Metric Filter: Large Instance Types
resource "aws_cloudwatch_log_metric_filter" "large_instances" {
  name           = "lablink-large-instances-${var.resource_suffix}"
  log_group_name = aws_cloudwatch_log_group.cloudtrail_logs.name

  pattern = <<PATTERN
{ ($.eventName = RunInstances) && ($.userIdentity.principalId = *lablink_instance_role*) && (($.requestParameters.instanceType = p4d.*) || ($.requestParameters.instanceType = p3.*) || ($.requestParameters.instanceType = g5.*)) }
PATTERN

  metric_transformation {
    name      = "LargeInstanceLaunched"
    namespace = "LabLinkSecurity/${var.resource_suffix}"
    value     = "1"
    unit      = "Count"
  }
}

# Alarm: Large Instance Types
resource "aws_cloudwatch_metric_alarm" "large_instance_launched" {
  alarm_name          = "lablink-large-instance-launched-${var.resource_suffix}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "LargeInstanceLaunched"
  namespace           = "LabLinkSecurity/${var.resource_suffix}"
  period              = 300
  statistic           = "Sum"
  threshold           = 0  # Alert on ANY large instance
  alarm_description   = "Alert when allocator launches expensive instance types (p4d, p3, g5)"
  alarm_actions       = [aws_sns_topic.admin_alerts.arn]
  treat_missing_data  = "notBreaching"

  tags = {
    Name        = "lablink-large-instance-launched-${var.resource_suffix}"
    Environment = var.resource_suffix
    Severity    = "critical"
  }
}

# Metric Filter: Unauthorized API Calls
resource "aws_cloudwatch_log_metric_filter" "unauthorized_calls" {
  name           = "lablink-unauthorized-calls-${var.resource_suffix}"
  log_group_name = aws_cloudwatch_log_group.cloudtrail_logs.name

  pattern = <<PATTERN
{ ($.errorCode = AccessDenied) || ($.errorCode = UnauthorizedOperation) && ($.userIdentity.principalId = *lablink_instance_role*) }
PATTERN

  metric_transformation {
    name      = "UnauthorizedAPICalls"
    namespace = "LabLinkSecurity/${var.resource_suffix}"
    value     = "1"
    unit      = "Count"
  }
}

# Alarm: Unauthorized API Calls
resource "aws_cloudwatch_metric_alarm" "unauthorized_calls" {
  alarm_name          = "lablink-unauthorized-calls-${var.resource_suffix}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "UnauthorizedAPICalls"
  namespace           = "LabLinkSecurity/${var.resource_suffix}"
  period              = 900  # 15 minutes
  statistic           = "Sum"
  threshold           = try(local.config_file.monitoring.thresholds.max_unauthorized_calls_per_15min, 5)
  alarm_description   = "Alert when allocator makes unauthorized API calls (possible attack or permission issue)"
  alarm_actions       = [aws_sns_topic.admin_alerts.arn]
  treat_missing_data  = "notBreaching"

  tags = {
    Name        = "lablink-unauthorized-calls-${var.resource_suffix}"
    Environment = var.resource_suffix
    Severity    = "critical"
  }
}

# Additional alarms follow same pattern for:
# - TerminateInstances
# - IAM role creation
# - Security group modifications
# (omitted for brevity - implement similarly)

Phase 3: AWS Budget Alerts (Cost Monitoring)

Priority: High
Estimated Time: 2-3 hours

Files to Modify:

  • lablink-infrastructure/budgets.tf (new file)
  • lablink-infrastructure/config/config.yaml (update budget settings)

Tasks:

  1. Create AWS Budget for monthly cost tracking
  2. Configure multi-threshold alerts (50%, 80%, 100%, 150%)
  3. Link to SNS topic for notifications
  4. Test with simulated cost data

Terraform Resources:

# lablink-infrastructure/budgets.tf

resource "aws_budgets_budget" "lablink_monthly" {
  count = try(local.config_file.monitoring.budget.enabled, false) ? 1 : 0

  name              = "lablink-monthly-budget-${var.resource_suffix}"
  budget_type       = "COST"
  limit_amount      = try(local.config_file.monitoring.budget.monthly_limit_usd, "500")
  limit_unit        = "USD"
  time_period_start = formatdate("YYYY-MM-01_00:00", timestamp())
  time_unit         = "MONTHLY"

  cost_filter {
    name = "TagKeyValue"
    values = [
      "user:Environment$${var.resource_suffix}",
      "user:ManagedBy$lablink-allocator-${var.resource_suffix}"
    ]
  }

  # 50% warning
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 50
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
  }

  # 80% urgent
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
  }

  # 100% critical
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
  }

  # 150% severe overage
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 150
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
  }

  tags = {
    Name        = "lablink-monthly-budget-${var.resource_suffix}"
    Environment = var.resource_suffix
  }
}

Phase 4: Dashboard (Optional - Nice to Have)

Priority: Low
Estimated Time: 3-4 hours

Files to Modify:

  • lablink-infrastructure/dashboard.tf (new file)

Tasks:

  1. Create CloudWatch Dashboard
  2. Add widgets for key metrics
  3. Configure auto-refresh
  4. Document dashboard URL in README

Phase 5: Documentation & Testing

Priority: High
Estimated Time: 2-3 hours

Files to Modify:

  • lablink-infrastructure/README.md (add monitoring section)
  • lablink-infrastructure/SECURITY.md (update security best practices)
  • lablink-infrastructure/config/*.example.yaml (add monitoring config examples)

Tasks:

  1. Update README with monitoring architecture
  2. Document how to view CloudTrail logs
  3. Document how to respond to alerts
  4. Create runbook for common alert scenarios
  5. Test end-to-end alert flow

Configuration Schema

Additions to config.yaml:

# Infrastructure Monitoring Configuration (optional - defaults shown)
monitoring:
  enabled: true  # Master switch - set to false to disable all monitoring

  # Alert notification settings
  alerts:
    email: "[email protected]"  # Required if monitoring.enabled = true
    # Future integrations:
    # slack_webhook: ""
    # pagerduty_key: ""

  # CloudTrail configuration
  cloudtrail:
    retention_days: 90  # How long to keep CloudTrail logs in CloudWatch

  # Alert thresholds (optional - defaults shown)
  thresholds:
    max_instances_per_5min: 10  # Alarm if >10 RunInstances in 5 minutes
    max_terminations_per_5min: 20  # Alarm if >20 TerminateInstances in 5 minutes
    max_iam_roles_per_hour: 10  # Alarm if >10 IAM CreateRole in 1 hour
    max_security_group_changes_per_hour: 20  # Alarm if >20 SG changes in 1 hour
    max_unauthorized_calls_per_15min: 5  # Alarm if >5 AccessDenied in 15 minutes

  # AWS Budget alerts
  budget:
    enabled: true
    monthly_limit_usd: 500  # Monthly cost budget
    # Alerts automatically sent at 50%, 80%, 100%, 150% thresholds

Example config files to update:

  • config/ci-test.example.yaml - Low budget ($100/month), strict thresholds
  • config/prod.example.yaml - Higher budget ($1000/month), relaxed thresholds
  • config/dev.example.yaml - Monitoring disabled (optional)

Migration Path from Issue #225

Timeline Coordination:

  1. Now (Immediate): Implement this infrastructure monitoring

  2. After Issue #225 Lands: Deprecate client VM CloudWatch logging

    • Remove aws_cloudwatch_log_group.client_vm_logs
    • Remove aws_lambda_function.log_processor
    • Remove aws_cloudwatch_log_subscription_filter.lambda_subscription
    • Keep CloudTrail CloudWatch integration (different log group)

No Conflicts:

  • This issue uses aws_cloudwatch_log_group.cloudtrail_logs (new)
  • Issue #225 deprecates aws_cloudwatch_log_group.client_vm_logs (existing)
  • Different log groups, different purposes

Testing Plan

Unit Tests

  1. Validate Terraform syntax: terraform validate
  2. Verify config schema: lablink-validate-config
  3. Check for resource naming conflicts

Integration Tests (ci-test environment)

  1. Deploy monitoring infrastructure
  2. Verify CloudTrail logging:
    • Launch test VM → Check S3 for CloudTrail logs
    • Terminate test VM → Check CloudWatch Logs
  3. Trigger alarms:
    • Launch 11 instances rapidly → Verify SNS email sent
    • Attempt unauthorized API call → Verify alarm triggers
  4. Verify budget alerts:
    • Simulate cost threshold breach (may require manual AWS Budgets testing)
  5. Check dashboard (if implemented)

Acceptance Criteria

  • CloudTrail logs appear in S3 within 15 minutes of API call
  • CloudTrail logs appear in CloudWatch within 5 minutes
  • Metric filter increments when matching event occurs
  • Alarm triggers when threshold exceeded
  • SNS email delivered to configured address
  • Budget alert sent at 50% threshold
  • Documentation updated with monitoring architecture
  • Example configs include monitoring section

Cost Estimate

Monthly Costs (ci-test environment, ~50 VMs/day):

Service Usage Cost
CloudTrail ~150K events/month $1.50
S3 Storage ~2GB/month $0.05
CloudWatch Logs Ingestion ~1GB/month $0.50
CloudWatch Logs Storage ~1GB/month $0.03
CloudWatch Alarms 7 alarms $0.70
SNS Notifications ~20 emails/month $0.00
AWS Budgets 1 budget $0.00 (first 2 free)
Total ~$2.78/month

Production environment (10x usage): ~$15-20/month

ROI: One prevented p4d.24xlarge instance running for 1 hour ($32) pays for 11 months of monitoring.

Security Considerations

Principle of Least Privilege:

  • CloudTrail role can only write to specific CloudWatch log group
  • SNS topic policy restricts who can publish
  • S3 bucket encrypted at rest (AES256)
  • CloudTrail log file validation prevents tampering

Alert Fatigue Mitigation:

  • Conservative thresholds by default (tunable in config.yaml)
  • treat_missing_data = "notBreaching" prevents false alarms
  • Severity tags on alarms for prioritization

False Positive Scenarios:

  • Legitimate batch operations (e.g., spinning up 20 VMs for class)
    • Solution: Temporarily increase thresholds in config.yaml
  • Development testing triggering alarms
    • Solution: Disable monitoring in dev environment

Future Enhancements (Not in Scope)

  • Slack/PagerDuty integration for alerts
  • Automated remediation (Lambda to terminate unauthorized instances)
  • Cost anomaly detection (ML-based)
  • Integration with AWS Security Hub
  • Custom CloudWatch dashboard with cost projections
  • Log analytics queries for trend analysis
  • Integration with allocator web UI (show alerts in admin panel)

Related Issues & PRs

Acceptance Checklist

Before closing this issue:

  • CloudTrail configured and logging to S3 + CloudWatch
  • All 7 metric filters created
  • All 7 CloudWatch alarms created and tested
  • SNS topic created with email subscription
  • AWS Budget configured with multi-threshold alerts
  • Configuration schema added to config.yaml
  • All example config files updated
  • README.md updated with monitoring section
  • SECURITY.md updated with monitoring best practices
  • End-to-end testing completed in ci-test
  • Cost estimate validated against actual usage
  • Runbook created for responding to alerts

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions