Infrastructure Monitoring & Security Alerts for Allocator IAM Role

# Infrastructure Monitoring & Security Alerts for Allocator IAM Role

## Problem Statement

PR #19 grants the allocator EC2 instance powerful IAM permissions to autonomously create, manage, and destroy client VMs. While this enables the intended self-service architecture, it also introduces significant security and cost risks:

**Security Risks:**
- Unauthorized access to the allocator instance could allow an attacker to:
  - Launch expensive EC2 instances (e.g., p4d.24xlarge at $32/hour)
  - Create IAM roles and escalate privileges
  - Modify security groups to expose resources
  - Delete critical infrastructure
- No visibility into when/how these powerful permissions are being used
- No audit trail for compliance or forensic analysis

**Cost Risks:**
- Bugs in allocator code could launch hundreds of instances
- Runaway processes could create resources indefinitely
- No automated alerts for cost threshold breaches
- No enforcement of instance type restrictions

**Current State:**
- ✅ CloudWatch logging exists for **client VM logs** (sent to allocator via Lambda)
- ❌ No monitoring of **allocator's AWS API calls** (EC2, IAM operations)
- ❌ No CloudTrail configuration for audit logging
- ❌ No CloudWatch alarms for security events
- ❌ No cost monitoring or budget alerts
- ❌ No SNS notifications for critical events

## Context: Existing vs Planned Logging Systems

### Current: CloudWatch → Lambda → Allocator (Client VM Logs)
**Purpose:** Collect application logs from client VMs
**Architecture:** Client VMs → CloudWatch Logs → Lambda → Allocator API
**Scope:** Client VM `cloud-init` and container logs
**Status:** Currently deployed

### Future: Self-Hosted Logging System (Issue #225)
**Purpose:** Replace CloudWatch for client VM application logs
**Architecture:** Client VMs → Log Shipper → Allocator API → PostgreSQL
**Scope:** Client VM application logs only (not infrastructure)
**Status:** Proposed (will deprecate current CloudWatch → Lambda pipeline)

### Proposed: Infrastructure Security Monitoring (This Issue)
**Purpose:** Monitor allocator's AWS API usage for security/cost control
**Architecture:** AWS API Calls → CloudTrail → CloudWatch Logs → Metric Filters → Alarms → SNS
**Scope:** EC2, IAM, and other AWS service calls made by allocator IAM role
**Status:** Not implemented

**Important:** This issue is **complementary** to issue #225. The self-hosted logging system handles application logs for end-users, while this infrastructure monitoring ensures secure operation of the platform itself.

## Scope

This issue focuses on monitoring the **allocator's infrastructure operations**, not client application logs:

### In Scope
- ✅ CloudTrail logging for all API calls by `lablink_instance_role_*`
- ✅ CloudWatch metric filters for security-relevant events
- ✅ CloudWatch alarms for anomalous behavior
- ✅ SNS notifications to administrators
- ✅ AWS Budget alerts for cost overruns
- ✅ Dashboard for real-time monitoring

### Out of Scope
- ❌ Client VM application logs (handled by issue #225)
- ❌ Allocator application logs (handled by existing Docker logging)
- ❌ Performance monitoring (separate concern)

## Requirements

### 1. CloudTrail Configuration

Enable CloudTrail to capture all API events related to the allocator's IAM role:

**Features:**
- Multi-region trail (captures API calls from all regions)
- S3 bucket for log storage (encrypted at rest)
- CloudWatch Logs integration for real-time analysis
- Log file validation enabled
- 90-day retention minimum (configurable)

**Events to Capture:**
- All `ec2:RunInstances` calls by allocator role
- All `ec2:TerminateInstances` calls by allocator role
- All `iam:CreateRole`, `iam:DeleteRole` calls
- All security group modifications
- All failed API calls (authentication/authorization errors)

**Cost Considerations:**
- CloudTrail: ~$2/100,000 events + $0.50/1M events delivered to CloudWatch
- S3 storage: ~$0.023/GB/month
- Expected cost: $5-15/month depending on usage

### 2. CloudWatch Metric Filters

Create metric filters to extract security-relevant patterns from CloudTrail logs:

**Metric Filters:**

| Filter Name | Pattern | Threshold | Purpose |
|-------------|---------|-----------|---------|
| `RunInstancesCount` | `eventName = RunInstances AND userIdentity.principalId LIKE %lablink_instance_role%` | >10/5min | Detect mass instance launches |
| `TerminateInstancesCount` | `eventName = TerminateInstances AND userIdentity.principalId LIKE %lablink_instance_role%` | >20/5min | Detect mass terminations |
| `LargeInstanceLaunched` | `eventName = RunInstances AND requestParameters.instanceType LIKE p4d.* OR p3.* OR g5.*` | >0 | Detect expensive instances |
| `IAMRoleCreationCount` | `eventName = CreateRole AND userIdentity.principalId LIKE %lablink_instance_role%` | >10/hour | Detect unusual role creation |
| `SecurityGroupModifications` | `eventName LIKE *SecurityGroup* AND userIdentity.principalId LIKE %lablink_instance_role%` | >20/hour | Detect SG tampering |
| `UnauthorizedAPICalls` | `errorCode = (AccessDenied OR UnauthorizedOperation) AND userIdentity.principalId LIKE %lablink_instance_role%` | >5/15min | Detect permission issues or attacks |
| `ConsoleLoginFailures` | `eventName = ConsoleLogin AND errorMessage = "Failed authentication"` | >3/5min | Detect brute force on allocator |

### 3. CloudWatch Alarms

Create alarms that trigger on metric filter thresholds:

**Alarm Configuration:**
- Evaluation period: 5 minutes (configurable)
- Datapoints to alarm: 1 out of 1 (immediate alerting)
- Missing data treatment: `notBreaching` (avoid false alarms during quiet periods)
- Actions: Publish to SNS topic for admin notifications

**Example Alarm (Terraform):**
```hcl
resource "aws_cloudwatch_metric_alarm" "mass_instance_launch" {
  alarm_name          = "lablink-mass-instance-launch-${var.resource_suffix}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "RunInstancesCount"
  namespace           = "LabLinkSecurity"
  period              = 300  # 5 minutes
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "Alert when allocator launches >10 instances in 5 minutes"
  alarm_actions       = [aws_sns_topic.admin_alerts.arn]
  treat_missing_data  = "notBreaching"
}
```

### 4. SNS Notifications

**SNS Topic Configuration:**
- Topic name: `lablink-admin-alerts-${resource_suffix}`
- Protocol: Email (configurable for Slack/PagerDuty later)
- Subscription email: Configured in `config.yaml`

**Example config.yaml additions:**
```yaml
monitoring:
  enabled: true  # Enable/disable all monitoring features
  alerts:
    email: "ops-team@example.com"  # Email for critical alerts
    # Future: Slack webhook, PagerDuty integration key

  # Thresholds (all optional - use defaults if not specified)
  thresholds:
    max_instances_per_5min: 10  # Alert if >10 RunInstances in 5min
    max_terminations_per_5min: 20
    max_iam_roles_per_hour: 10
    max_security_group_changes_per_hour: 20
    max_unauthorized_calls_per_15min: 5
```

### 5. AWS Budget Alerts

**Budget Configuration:**
- Budget type: Cost budget
- Time period: Monthly recurring
- Budget amount: Configurable in `config.yaml` (default: $500/month)
- Alert thresholds:
  - 50% of budget (warning)
  - 80% of budget (urgent)
  - 100% of budget (critical)
  - 150% of budget (severe overage)

**Example config.yaml:**
```yaml
monitoring:
  budget:
    enabled: true
    monthly_limit_usd: 500  # Alert if monthly costs exceed this
    alerts:
      - threshold_percent: 50
        severity: "warning"
      - threshold_percent: 80
        severity: "urgent"
      - threshold_percent: 100
        severity: "critical"
```

**Terraform Resource:**
```hcl
resource "aws_budgets_budget" "lablink_monthly" {
  name              = "lablink-monthly-budget-${var.resource_suffix}"
  budget_type       = "COST"
  limit_amount      = local.monitoring_config.budget.monthly_limit_usd
  limit_unit        = "USD"
  time_period_start = "2025-01-01_00:00"
  time_unit         = "MONTHLY"

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = [local.monitoring_config.alerts.email]
  }
}
```

### 6. CloudWatch Dashboard (Optional)

Create a dashboard for real-time visibility into allocator operations:

**Widgets:**
- Total instances launched (last 24h)
- Total instances terminated (last 24h)
- Active client VMs (current count)
- API error rate (unauthorized calls)
- Cost-to-date this month
- Top 5 instance types launched
- IAM roles created/deleted timeline

**Access:** Available at AWS Console → CloudWatch → Dashboards → `lablink-${resource_suffix}`

## Implementation Plan

### Phase 1: CloudTrail Setup (Foundational)
**Priority:** Critical
**Estimated Time:** 2-4 hours

**Files to Modify:**
- `lablink-infrastructure/cloudtrail.tf` (new file)
- `lablink-infrastructure/main.tf` (add CloudTrail module reference)
- `lablink-infrastructure/config/config.yaml` (add monitoring section)

**Tasks:**
1. Create S3 bucket for CloudTrail logs with encryption
2. Create CloudTrail trail with CloudWatch Logs integration
3. Configure trail to filter for allocator role events only (cost optimization)
4. Enable log file validation
5. Set retention policy (90 days default, configurable)

**Terraform Resources:**
```hcl
# lablink-infrastructure/cloudtrail.tf

# S3 bucket for CloudTrail logs
resource "aws_s3_bucket" "cloudtrail_logs" {
  bucket = "lablink-cloudtrail-${var.resource_suffix}-${data.aws_caller_identity.current.account_id}"

  tags = {
    Name        = "lablink-cloudtrail-${var.resource_suffix}"
    Environment = var.resource_suffix
  }
}

# S3 bucket encryption
resource "aws_s3_bucket_server_side_encryption_configuration" "cloudtrail_encryption" {
  bucket = aws_s3_bucket.cloudtrail_logs.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

# S3 bucket policy for CloudTrail
resource "aws_s3_bucket_policy" "cloudtrail_policy" {
  bucket = aws_s3_bucket.cloudtrail_logs.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AWSCloudTrailAclCheck"
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action   = "s3:GetBucketAcl"
        Resource = aws_s3_bucket.cloudtrail_logs.arn
      },
      {
        Sid    = "AWSCloudTrailWrite"
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action   = "s3:PutObject"
        Resource = "${aws_s3_bucket.cloudtrail_logs.arn}/*"
        Condition = {
          StringEquals = {
            "s3:x-amz-acl" = "bucket-owner-full-control"
          }
        }
      }
    ]
  })
}

# CloudWatch Log Group for CloudTrail
resource "aws_cloudwatch_log_group" "cloudtrail_logs" {
  name              = "lablink-cloudtrail-${var.resource_suffix}"
  retention_in_days = try(local.config_file.monitoring.cloudtrail_retention_days, 90)
}

# IAM role for CloudTrail to write to CloudWatch
resource "aws_iam_role" "cloudtrail_cloudwatch_role" {
  name = "lablink_cloudtrail_cloudwatch_${var.resource_suffix}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Service = "cloudtrail.amazonaws.com"
        }
        Action = "sts:AssumeRole"
      }
    ]
  })
}

# IAM policy for CloudTrail CloudWatch access
resource "aws_iam_role_policy" "cloudtrail_cloudwatch_policy" {
  name = "lablink_cloudtrail_cloudwatch_policy_${var.resource_suffix}"
  role = aws_iam_role.cloudtrail_cloudwatch_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogStream",
          "logs:PutLogEvents"
        ]
        Resource = "${aws_cloudwatch_log_group.cloudtrail_logs.arn}:*"
      }
    ]
  })
}

# CloudTrail
resource "aws_cloudtrail" "lablink_trail" {
  name                          = "lablink-trail-${var.resource_suffix}"
  s3_bucket_name                = aws_s3_bucket.cloudtrail_logs.id
  include_global_service_events = true
  is_multi_region_trail         = true
  enable_log_file_validation    = true

  cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.cloudtrail_logs.arn}:*"
  cloud_watch_logs_role_arn  = aws_iam_role.cloudtrail_cloudwatch_role.arn

  event_selector {
    read_write_type           = "All"
    include_management_events = true

    # Optional: Filter to only allocator role events (cost optimization)
    # This reduces CloudTrail costs by ignoring unrelated AWS activity
    data_resource {
      type   = "AWS::EC2::Instance"
      values = ["arn:aws:ec2:*:${data.aws_caller_identity.current.account_id}:instance/*"]
    }
  }

  depends_on = [
    aws_s3_bucket_policy.cloudtrail_policy,
    aws_cloudwatch_log_group.cloudtrail_logs,
    aws_iam_role_policy.cloudtrail_cloudwatch_policy
  ]

  tags = {
    Name        = "lablink-trail-${var.resource_suffix}"
    Environment = var.resource_suffix
  }
}
```

**Testing:**
1. Deploy CloudTrail configuration
2. Launch a test client VM via allocator
3. Verify CloudTrail logs appear in S3 bucket
4. Verify logs appear in CloudWatch Logs group
5. Check for `RunInstances` events in CloudWatch Logs Insights

### Phase 2: Metric Filters & Alarms (Security Monitoring)
**Priority:** High
**Estimated Time:** 4-6 hours

**Files to Modify:**
- `lablink-infrastructure/cloudwatch_alarms.tf` (new file)
- `lablink-infrastructure/config/config.yaml` (update monitoring section)

**Tasks:**
1. Create SNS topic for admin alerts
2. Add email subscription (from config.yaml)
3. Create metric filters for each security pattern
4. Create CloudWatch alarms for each metric
5. Test alarm triggering with synthetic events

**Terraform Resources:**
```hcl
# lablink-infrastructure/cloudwatch_alarms.tf

# SNS Topic for Admin Alerts
resource "aws_sns_topic" "admin_alerts" {
  name = "lablink-admin-alerts-${var.resource_suffix}"

  tags = {
    Name        = "lablink-admin-alerts-${var.resource_suffix}"
    Environment = var.resource_suffix
  }
}

# SNS Email Subscription
resource "aws_sns_topic_subscription" "admin_email" {
  count     = try(local.config_file.monitoring.enabled, false) ? 1 : 0
  topic_arn = aws_sns_topic.admin_alerts.arn
  protocol  = "email"
  endpoint  = try(local.config_file.monitoring.alerts.email, "")
}

# Metric Filter: Mass Instance Launches
resource "aws_cloudwatch_log_metric_filter" "run_instances" {
  name           = "lablink-run-instances-${var.resource_suffix}"
  log_group_name = aws_cloudwatch_log_group.cloudtrail_logs.name

  pattern = <<PATTERN
{ ($.eventName = RunInstances) && ($.userIdentity.principalId = *lablink_instance_role*) }
PATTERN

  metric_transformation {
    name      = "RunInstancesCount"
    namespace = "LabLinkSecurity/${var.resource_suffix}"
    value     = "1"
    unit      = "Count"
  }
}

# Alarm: Mass Instance Launches
resource "aws_cloudwatch_metric_alarm" "mass_instance_launch" {
  alarm_name          = "lablink-mass-instance-launch-${var.resource_suffix}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "RunInstancesCount"
  namespace           = "LabLinkSecurity/${var.resource_suffix}"
  period              = 300  # 5 minutes
  statistic           = "Sum"
  threshold           = try(local.config_file.monitoring.thresholds.max_instances_per_5min, 10)
  alarm_description   = "Alert when allocator launches >${try(local.config_file.monitoring.thresholds.max_instances_per_5min, 10)} instances in 5 minutes"
  alarm_actions       = [aws_sns_topic.admin_alerts.arn]
  treat_missing_data  = "notBreaching"

  tags = {
    Name        = "lablink-mass-instance-launch-${var.resource_suffix}"
    Environment = var.resource_suffix
    Severity    = "high"
  }
}

# Metric Filter: Large Instance Types
resource "aws_cloudwatch_log_metric_filter" "large_instances" {
  name           = "lablink-large-instances-${var.resource_suffix}"
  log_group_name = aws_cloudwatch_log_group.cloudtrail_logs.name

  pattern = <<PATTERN
{ ($.eventName = RunInstances) && ($.userIdentity.principalId = *lablink_instance_role*) && (($.requestParameters.instanceType = p4d.*) || ($.requestParameters.instanceType = p3.*) || ($.requestParameters.instanceType = g5.*)) }
PATTERN

  metric_transformation {
    name      = "LargeInstanceLaunched"
    namespace = "LabLinkSecurity/${var.resource_suffix}"
    value     = "1"
    unit      = "Count"
  }
}

# Alarm: Large Instance Types
resource "aws_cloudwatch_metric_alarm" "large_instance_launched" {
  alarm_name          = "lablink-large-instance-launched-${var.resource_suffix}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "LargeInstanceLaunched"
  namespace           = "LabLinkSecurity/${var.resource_suffix}"
  period              = 300
  statistic           = "Sum"
  threshold           = 0  # Alert on ANY large instance
  alarm_description   = "Alert when allocator launches expensive instance types (p4d, p3, g5)"
  alarm_actions       = [aws_sns_topic.admin_alerts.arn]
  treat_missing_data  = "notBreaching"

  tags = {
    Name        = "lablink-large-instance-launched-${var.resource_suffix}"
    Environment = var.resource_suffix
    Severity    = "critical"
  }
}

# Metric Filter: Unauthorized API Calls
resource "aws_cloudwatch_log_metric_filter" "unauthorized_calls" {
  name           = "lablink-unauthorized-calls-${var.resource_suffix}"
  log_group_name = aws_cloudwatch_log_group.cloudtrail_logs.name

  pattern = <<PATTERN
{ ($.errorCode = AccessDenied) || ($.errorCode = UnauthorizedOperation) && ($.userIdentity.principalId = *lablink_instance_role*) }
PATTERN

  metric_transformation {
    name      = "UnauthorizedAPICalls"
    namespace = "LabLinkSecurity/${var.resource_suffix}"
    value     = "1"
    unit      = "Count"
  }
}

# Alarm: Unauthorized API Calls
resource "aws_cloudwatch_metric_alarm" "unauthorized_calls" {
  alarm_name          = "lablink-unauthorized-calls-${var.resource_suffix}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "UnauthorizedAPICalls"
  namespace           = "LabLinkSecurity/${var.resource_suffix}"
  period              = 900  # 15 minutes
  statistic           = "Sum"
  threshold           = try(local.config_file.monitoring.thresholds.max_unauthorized_calls_per_15min, 5)
  alarm_description   = "Alert when allocator makes unauthorized API calls (possible attack or permission issue)"
  alarm_actions       = [aws_sns_topic.admin_alerts.arn]
  treat_missing_data  = "notBreaching"

  tags = {
    Name        = "lablink-unauthorized-calls-${var.resource_suffix}"
    Environment = var.resource_suffix
    Severity    = "critical"
  }
}

# Additional alarms follow same pattern for:
# - TerminateInstances
# - IAM role creation
# - Security group modifications
# (omitted for brevity - implement similarly)
```

### Phase 3: AWS Budget Alerts (Cost Monitoring)
**Priority:** High
**Estimated Time:** 2-3 hours

**Files to Modify:**
- `lablink-infrastructure/budgets.tf` (new file)
- `lablink-infrastructure/config/config.yaml` (update budget settings)

**Tasks:**
1. Create AWS Budget for monthly cost tracking
2. Configure multi-threshold alerts (50%, 80%, 100%, 150%)
3. Link to SNS topic for notifications
4. Test with simulated cost data

**Terraform Resources:**
```hcl
# lablink-infrastructure/budgets.tf

resource "aws_budgets_budget" "lablink_monthly" {
  count = try(local.config_file.monitoring.budget.enabled, false) ? 1 : 0

  name              = "lablink-monthly-budget-${var.resource_suffix}"
  budget_type       = "COST"
  limit_amount      = try(local.config_file.monitoring.budget.monthly_limit_usd, "500")
  limit_unit        = "USD"
  time_period_start = formatdate("YYYY-MM-01_00:00", timestamp())
  time_unit         = "MONTHLY"

  cost_filter {
    name = "TagKeyValue"
    values = [
      "user:Environment$${var.resource_suffix}",
      "user:ManagedBy$lablink-allocator-${var.resource_suffix}"
    ]
  }

  # 50% warning
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 50
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
  }

  # 80% urgent
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
  }

  # 100% critical
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
  }

  # 150% severe overage
  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 150
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
  }

  tags = {
    Name        = "lablink-monthly-budget-${var.resource_suffix}"
    Environment = var.resource_suffix
  }
}
```

### Phase 4: Dashboard (Optional - Nice to Have)
**Priority:** Low
**Estimated Time:** 3-4 hours

**Files to Modify:**
- `lablink-infrastructure/dashboard.tf` (new file)

**Tasks:**
1. Create CloudWatch Dashboard
2. Add widgets for key metrics
3. Configure auto-refresh
4. Document dashboard URL in README

### Phase 5: Documentation & Testing
**Priority:** High
**Estimated Time:** 2-3 hours

**Files to Modify:**
- `lablink-infrastructure/README.md` (add monitoring section)
- `lablink-infrastructure/SECURITY.md` (update security best practices)
- `lablink-infrastructure/config/*.example.yaml` (add monitoring config examples)

**Tasks:**
1. Update README with monitoring architecture
2. Document how to view CloudTrail logs
3. Document how to respond to alerts
4. Create runbook for common alert scenarios
5. Test end-to-end alert flow

## Configuration Schema

**Additions to `config.yaml`:**

```yaml
# Infrastructure Monitoring Configuration (optional - defaults shown)
monitoring:
  enabled: true  # Master switch - set to false to disable all monitoring

  # Alert notification settings
  alerts:
    email: "ops-team@example.com"  # Required if monitoring.enabled = true
    # Future integrations:
    # slack_webhook: ""
    # pagerduty_key: ""

  # CloudTrail configuration
  cloudtrail:
    retention_days: 90  # How long to keep CloudTrail logs in CloudWatch

  # Alert thresholds (optional - defaults shown)
  thresholds:
    max_instances_per_5min: 10  # Alarm if >10 RunInstances in 5 minutes
    max_terminations_per_5min: 20  # Alarm if >20 TerminateInstances in 5 minutes
    max_iam_roles_per_hour: 10  # Alarm if >10 IAM CreateRole in 1 hour
    max_security_group_changes_per_hour: 20  # Alarm if >20 SG changes in 1 hour
    max_unauthorized_calls_per_15min: 5  # Alarm if >5 AccessDenied in 15 minutes

  # AWS Budget alerts
  budget:
    enabled: true
    monthly_limit_usd: 500  # Monthly cost budget
    # Alerts automatically sent at 50%, 80%, 100%, 150% thresholds
```

**Example config files to update:**

- `config/ci-test.example.yaml` - Low budget ($100/month), strict thresholds
- `config/prod.example.yaml` - Higher budget ($1000/month), relaxed thresholds
- `config/dev.example.yaml` - Monitoring disabled (optional)

## Migration Path from Issue #225

**Timeline Coordination:**

1. **Now (Immediate):** Implement this infrastructure monitoring
   - Independent of issue #225
   - Provides security/cost protection for PR #19

2. **After Issue #225 Lands:** Deprecate client VM CloudWatch logging
   - Remove `aws_cloudwatch_log_group.client_vm_logs`
   - Remove `aws_lambda_function.log_processor`
   - Remove `aws_cloudwatch_log_subscription_filter.lambda_subscription`
   - Keep CloudTrail CloudWatch integration (different log group)

**No Conflicts:**
- This issue uses `aws_cloudwatch_log_group.cloudtrail_logs` (new)
- Issue #225 deprecates `aws_cloudwatch_log_group.client_vm_logs` (existing)
- Different log groups, different purposes

## Testing Plan

### Unit Tests
1. Validate Terraform syntax: `terraform validate`
2. Verify config schema: `lablink-validate-config`
3. Check for resource naming conflicts

### Integration Tests (ci-test environment)
1. Deploy monitoring infrastructure
2. Verify CloudTrail logging:
   - Launch test VM → Check S3 for CloudTrail logs
   - Terminate test VM → Check CloudWatch Logs
3. Trigger alarms:
   - Launch 11 instances rapidly → Verify SNS email sent
   - Attempt unauthorized API call → Verify alarm triggers
4. Verify budget alerts:
   - Simulate cost threshold breach (may require manual AWS Budgets testing)
5. Check dashboard (if implemented)

### Acceptance Criteria
- [ ] CloudTrail logs appear in S3 within 15 minutes of API call
- [ ] CloudTrail logs appear in CloudWatch within 5 minutes
- [ ] Metric filter increments when matching event occurs
- [ ] Alarm triggers when threshold exceeded
- [ ] SNS email delivered to configured address
- [ ] Budget alert sent at 50% threshold
- [ ] Documentation updated with monitoring architecture
- [ ] Example configs include monitoring section

## Cost Estimate

**Monthly Costs (ci-test environment, ~50 VMs/day):**

| Service | Usage | Cost |
|---------|-------|------|
| CloudTrail | ~150K events/month | $1.50 |
| S3 Storage | ~2GB/month | $0.05 |
| CloudWatch Logs Ingestion | ~1GB/month | $0.50 |
| CloudWatch Logs Storage | ~1GB/month | $0.03 |
| CloudWatch Alarms | 7 alarms | $0.70 |
| SNS Notifications | ~20 emails/month | $0.00 |
| AWS Budgets | 1 budget | $0.00 (first 2 free) |
| **Total** | | **~$2.78/month** |

**Production environment (10x usage): ~$15-20/month**

**ROI:** One prevented p4d.24xlarge instance running for 1 hour ($32) pays for 11 months of monitoring.

## Security Considerations

**Principle of Least Privilege:**
- CloudTrail role can only write to specific CloudWatch log group
- SNS topic policy restricts who can publish
- S3 bucket encrypted at rest (AES256)
- CloudTrail log file validation prevents tampering

**Alert Fatigue Mitigation:**
- Conservative thresholds by default (tunable in config.yaml)
- `treat_missing_data = "notBreaching"` prevents false alarms
- Severity tags on alarms for prioritization

**False Positive Scenarios:**
- Legitimate batch operations (e.g., spinning up 20 VMs for class)
  - Solution: Temporarily increase thresholds in config.yaml
- Development testing triggering alarms
  - Solution: Disable monitoring in dev environment

## Future Enhancements (Not in Scope)

- [ ] Slack/PagerDuty integration for alerts
- [ ] Automated remediation (Lambda to terminate unauthorized instances)
- [ ] Cost anomaly detection (ML-based)
- [ ] Integration with AWS Security Hub
- [ ] Custom CloudWatch dashboard with cost projections
- [ ] Log analytics queries for trend analysis
- [ ] Integration with allocator web UI (show alerts in admin panel)

## Related Issues & PRs

- **PR #19**: Add EC2/IAM permissions to allocator (triggers need for this monitoring)
- **Issue #14**: Original request for allocator autonomy
- **Issue #225**: Self-hosted logging system (complementary, not conflicting)

## Acceptance Checklist

Before closing this issue:

- [ ] CloudTrail configured and logging to S3 + CloudWatch
- [ ] All 7 metric filters created
- [ ] All 7 CloudWatch alarms created and tested
- [ ] SNS topic created with email subscription
- [ ] AWS Budget configured with multi-threshold alerts
- [ ] Configuration schema added to config.yaml
- [ ] All example config files updated
- [ ] README.md updated with monitoring section
- [ ] SECURITY.md updated with monitoring best practices
- [ ] End-to-end testing completed in ci-test
- [ ] Cost estimate validated against actual usage
- [ ] Runbook created for responding to alerts

## References

- [AWS CloudTrail Best Practices](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/best-practices-security.html)
- [CloudWatch Logs Metric Filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html)
- [AWS Budgets Documentation](https://docs.aws.amazon.com/cost-management/latest/userguide/budgets-managing-costs.html)
- [CIS AWS Foundations Benchmark](https://www.cisecurity.org/benchmark/amazon_web_services) (Section 3: Monitoring)

Filter Name	Pattern	Threshold	Purpose
`RunInstancesCount`	`eventName = RunInstances AND userIdentity.principalId LIKE %lablink_instance_role%`	>10/5min	Detect mass instance launches
`TerminateInstancesCount`	`eventName = TerminateInstances AND userIdentity.principalId LIKE %lablink_instance_role%`	>20/5min	Detect mass terminations
`LargeInstanceLaunched`	`eventName = RunInstances AND requestParameters.instanceType LIKE p4d.* OR p3.* OR g5.*`	>0	Detect expensive instances
`IAMRoleCreationCount`	`eventName = CreateRole AND userIdentity.principalId LIKE %lablink_instance_role%`	>10/hour	Detect unusual role creation
`SecurityGroupModifications`	`eventName LIKE SecurityGroup AND userIdentity.principalId LIKE %lablink_instance_role%`	>20/hour	Detect SG tampering
`UnauthorizedAPICalls`	`errorCode = (AccessDenied OR UnauthorizedOperation) AND userIdentity.principalId LIKE %lablink_instance_role%`	>5/15min	Detect permission issues or attacks
`ConsoleLoginFailures`	`eventName = ConsoleLogin AND errorMessage = "Failed authentication"`	>3/5min	Detect brute force on allocator

Service	Usage	Cost
CloudTrail	~150K events/month	$1.50
S3 Storage	~2GB/month	$0.05
CloudWatch Logs Ingestion	~1GB/month	$0.50
CloudWatch Logs Storage	~1GB/month	$0.03
CloudWatch Alarms	7 alarms	$0.70
SNS Notifications	~20 emails/month	$0.00
AWS Budgets	1 budget	$0.00 (first 2 free)
Total		~$2.78/month

Infrastructure Monitoring & Security Alerts for Allocator IAM Role #20

Description

Infrastructure Monitoring & Security Alerts for Allocator IAM Role

Problem Statement

Context: Existing vs Planned Logging Systems

Current: CloudWatch → Lambda → Allocator (Client VM Logs)

Future: Self-Hosted Logging System (Issue #225)

Proposed: Infrastructure Security Monitoring (This Issue)

Scope

In Scope

Out of Scope

Requirements

1. CloudTrail Configuration

2. CloudWatch Metric Filters

3. CloudWatch Alarms

4. SNS Notifications

5. AWS Budget Alerts

6. CloudWatch Dashboard (Optional)

Implementation Plan

Phase 1: CloudTrail Setup (Foundational)

Phase 2: Metric Filters & Alarms (Security Monitoring)

Phase 3: AWS Budget Alerts (Cost Monitoring)

Phase 4: Dashboard (Optional - Nice to Have)

Phase 5: Documentation & Testing

Configuration Schema

Migration Path from Issue #225

Testing Plan

Unit Tests

Integration Tests (ci-test environment)

Acceptance Criteria

Cost Estimate

Security Considerations

Future Enhancements (Not in Scope)

Related Issues & PRs

Acceptance Checklist

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions