-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Infrastructure Monitoring & Security Alerts for Allocator IAM Role
Problem Statement
PR #19 grants the allocator EC2 instance powerful IAM permissions to autonomously create, manage, and destroy client VMs. While this enables the intended self-service architecture, it also introduces significant security and cost risks:
Security Risks:
- Unauthorized access to the allocator instance could allow an attacker to:
- Launch expensive EC2 instances (e.g., p4d.24xlarge at $32/hour)
- Create IAM roles and escalate privileges
- Modify security groups to expose resources
- Delete critical infrastructure
- No visibility into when/how these powerful permissions are being used
- No audit trail for compliance or forensic analysis
Cost Risks:
- Bugs in allocator code could launch hundreds of instances
- Runaway processes could create resources indefinitely
- No automated alerts for cost threshold breaches
- No enforcement of instance type restrictions
Current State:
- ✅ CloudWatch logging exists for client VM logs (sent to allocator via Lambda)
- ❌ No monitoring of allocator's AWS API calls (EC2, IAM operations)
- ❌ No CloudTrail configuration for audit logging
- ❌ No CloudWatch alarms for security events
- ❌ No cost monitoring or budget alerts
- ❌ No SNS notifications for critical events
Context: Existing vs Planned Logging Systems
Current: CloudWatch → Lambda → Allocator (Client VM Logs)
Purpose: Collect application logs from client VMs
Architecture: Client VMs → CloudWatch Logs → Lambda → Allocator API
Scope: Client VM cloud-init and container logs
Status: Currently deployed
Future: Self-Hosted Logging System (Issue #225)
Purpose: Replace CloudWatch for client VM application logs
Architecture: Client VMs → Log Shipper → Allocator API → PostgreSQL
Scope: Client VM application logs only (not infrastructure)
Status: Proposed (will deprecate current CloudWatch → Lambda pipeline)
Proposed: Infrastructure Security Monitoring (This Issue)
Purpose: Monitor allocator's AWS API usage for security/cost control
Architecture: AWS API Calls → CloudTrail → CloudWatch Logs → Metric Filters → Alarms → SNS
Scope: EC2, IAM, and other AWS service calls made by allocator IAM role
Status: Not implemented
Important: This issue is complementary to issue #225. The self-hosted logging system handles application logs for end-users, while this infrastructure monitoring ensures secure operation of the platform itself.
Scope
This issue focuses on monitoring the allocator's infrastructure operations, not client application logs:
In Scope
- ✅ CloudTrail logging for all API calls by
lablink_instance_role_* - ✅ CloudWatch metric filters for security-relevant events
- ✅ CloudWatch alarms for anomalous behavior
- ✅ SNS notifications to administrators
- ✅ AWS Budget alerts for cost overruns
- ✅ Dashboard for real-time monitoring
Out of Scope
- ❌ Client VM application logs (handled by issue #225)
- ❌ Allocator application logs (handled by existing Docker logging)
- ❌ Performance monitoring (separate concern)
Requirements
1. CloudTrail Configuration
Enable CloudTrail to capture all API events related to the allocator's IAM role:
Features:
- Multi-region trail (captures API calls from all regions)
- S3 bucket for log storage (encrypted at rest)
- CloudWatch Logs integration for real-time analysis
- Log file validation enabled
- 90-day retention minimum (configurable)
Events to Capture:
- All
ec2:RunInstancescalls by allocator role - All
ec2:TerminateInstancescalls by allocator role - All
iam:CreateRole,iam:DeleteRolecalls - All security group modifications
- All failed API calls (authentication/authorization errors)
Cost Considerations:
- CloudTrail: ~$2/100,000 events + $0.50/1M events delivered to CloudWatch
- S3 storage: ~$0.023/GB/month
- Expected cost: $5-15/month depending on usage
2. CloudWatch Metric Filters
Create metric filters to extract security-relevant patterns from CloudTrail logs:
Metric Filters:
| Filter Name | Pattern | Threshold | Purpose |
|---|---|---|---|
RunInstancesCount |
eventName = RunInstances AND userIdentity.principalId LIKE %lablink_instance_role% |
>10/5min | Detect mass instance launches |
TerminateInstancesCount |
eventName = TerminateInstances AND userIdentity.principalId LIKE %lablink_instance_role% |
>20/5min | Detect mass terminations |
LargeInstanceLaunched |
eventName = RunInstances AND requestParameters.instanceType LIKE p4d.* OR p3.* OR g5.* |
>0 | Detect expensive instances |
IAMRoleCreationCount |
eventName = CreateRole AND userIdentity.principalId LIKE %lablink_instance_role% |
>10/hour | Detect unusual role creation |
SecurityGroupModifications |
eventName LIKE *SecurityGroup* AND userIdentity.principalId LIKE %lablink_instance_role% |
>20/hour | Detect SG tampering |
UnauthorizedAPICalls |
errorCode = (AccessDenied OR UnauthorizedOperation) AND userIdentity.principalId LIKE %lablink_instance_role% |
>5/15min | Detect permission issues or attacks |
ConsoleLoginFailures |
eventName = ConsoleLogin AND errorMessage = "Failed authentication" |
>3/5min | Detect brute force on allocator |
3. CloudWatch Alarms
Create alarms that trigger on metric filter thresholds:
Alarm Configuration:
- Evaluation period: 5 minutes (configurable)
- Datapoints to alarm: 1 out of 1 (immediate alerting)
- Missing data treatment:
notBreaching(avoid false alarms during quiet periods) - Actions: Publish to SNS topic for admin notifications
Example Alarm (Terraform):
resource "aws_cloudwatch_metric_alarm" "mass_instance_launch" {
alarm_name = "lablink-mass-instance-launch-${var.resource_suffix}"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "RunInstancesCount"
namespace = "LabLinkSecurity"
period = 300 # 5 minutes
statistic = "Sum"
threshold = 10
alarm_description = "Alert when allocator launches >10 instances in 5 minutes"
alarm_actions = [aws_sns_topic.admin_alerts.arn]
treat_missing_data = "notBreaching"
}4. SNS Notifications
SNS Topic Configuration:
- Topic name:
lablink-admin-alerts-${resource_suffix} - Protocol: Email (configurable for Slack/PagerDuty later)
- Subscription email: Configured in
config.yaml
Example config.yaml additions:
monitoring:
enabled: true # Enable/disable all monitoring features
alerts:
email: "[email protected]" # Email for critical alerts
# Future: Slack webhook, PagerDuty integration key
# Thresholds (all optional - use defaults if not specified)
thresholds:
max_instances_per_5min: 10 # Alert if >10 RunInstances in 5min
max_terminations_per_5min: 20
max_iam_roles_per_hour: 10
max_security_group_changes_per_hour: 20
max_unauthorized_calls_per_15min: 55. AWS Budget Alerts
Budget Configuration:
- Budget type: Cost budget
- Time period: Monthly recurring
- Budget amount: Configurable in
config.yaml(default: $500/month) - Alert thresholds:
- 50% of budget (warning)
- 80% of budget (urgent)
- 100% of budget (critical)
- 150% of budget (severe overage)
Example config.yaml:
monitoring:
budget:
enabled: true
monthly_limit_usd: 500 # Alert if monthly costs exceed this
alerts:
- threshold_percent: 50
severity: "warning"
- threshold_percent: 80
severity: "urgent"
- threshold_percent: 100
severity: "critical"Terraform Resource:
resource "aws_budgets_budget" "lablink_monthly" {
name = "lablink-monthly-budget-${var.resource_suffix}"
budget_type = "COST"
limit_amount = local.monitoring_config.budget.monthly_limit_usd
limit_unit = "USD"
time_period_start = "2025-01-01_00:00"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [local.monitoring_config.alerts.email]
}
}6. CloudWatch Dashboard (Optional)
Create a dashboard for real-time visibility into allocator operations:
Widgets:
- Total instances launched (last 24h)
- Total instances terminated (last 24h)
- Active client VMs (current count)
- API error rate (unauthorized calls)
- Cost-to-date this month
- Top 5 instance types launched
- IAM roles created/deleted timeline
Access: Available at AWS Console → CloudWatch → Dashboards → lablink-${resource_suffix}
Implementation Plan
Phase 1: CloudTrail Setup (Foundational)
Priority: Critical
Estimated Time: 2-4 hours
Files to Modify:
lablink-infrastructure/cloudtrail.tf(new file)lablink-infrastructure/main.tf(add CloudTrail module reference)lablink-infrastructure/config/config.yaml(add monitoring section)
Tasks:
- Create S3 bucket for CloudTrail logs with encryption
- Create CloudTrail trail with CloudWatch Logs integration
- Configure trail to filter for allocator role events only (cost optimization)
- Enable log file validation
- Set retention policy (90 days default, configurable)
Terraform Resources:
# lablink-infrastructure/cloudtrail.tf
# S3 bucket for CloudTrail logs
resource "aws_s3_bucket" "cloudtrail_logs" {
bucket = "lablink-cloudtrail-${var.resource_suffix}-${data.aws_caller_identity.current.account_id}"
tags = {
Name = "lablink-cloudtrail-${var.resource_suffix}"
Environment = var.resource_suffix
}
}
# S3 bucket encryption
resource "aws_s3_bucket_server_side_encryption_configuration" "cloudtrail_encryption" {
bucket = aws_s3_bucket.cloudtrail_logs.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
# S3 bucket policy for CloudTrail
resource "aws_s3_bucket_policy" "cloudtrail_policy" {
bucket = aws_s3_bucket.cloudtrail_logs.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AWSCloudTrailAclCheck"
Effect = "Allow"
Principal = {
Service = "cloudtrail.amazonaws.com"
}
Action = "s3:GetBucketAcl"
Resource = aws_s3_bucket.cloudtrail_logs.arn
},
{
Sid = "AWSCloudTrailWrite"
Effect = "Allow"
Principal = {
Service = "cloudtrail.amazonaws.com"
}
Action = "s3:PutObject"
Resource = "${aws_s3_bucket.cloudtrail_logs.arn}/*"
Condition = {
StringEquals = {
"s3:x-amz-acl" = "bucket-owner-full-control"
}
}
}
]
})
}
# CloudWatch Log Group for CloudTrail
resource "aws_cloudwatch_log_group" "cloudtrail_logs" {
name = "lablink-cloudtrail-${var.resource_suffix}"
retention_in_days = try(local.config_file.monitoring.cloudtrail_retention_days, 90)
}
# IAM role for CloudTrail to write to CloudWatch
resource "aws_iam_role" "cloudtrail_cloudwatch_role" {
name = "lablink_cloudtrail_cloudwatch_${var.resource_suffix}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Service = "cloudtrail.amazonaws.com"
}
Action = "sts:AssumeRole"
}
]
})
}
# IAM policy for CloudTrail CloudWatch access
resource "aws_iam_role_policy" "cloudtrail_cloudwatch_policy" {
name = "lablink_cloudtrail_cloudwatch_policy_${var.resource_suffix}"
role = aws_iam_role.cloudtrail_cloudwatch_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"logs:CreateLogStream",
"logs:PutLogEvents"
]
Resource = "${aws_cloudwatch_log_group.cloudtrail_logs.arn}:*"
}
]
})
}
# CloudTrail
resource "aws_cloudtrail" "lablink_trail" {
name = "lablink-trail-${var.resource_suffix}"
s3_bucket_name = aws_s3_bucket.cloudtrail_logs.id
include_global_service_events = true
is_multi_region_trail = true
enable_log_file_validation = true
cloud_watch_logs_group_arn = "${aws_cloudwatch_log_group.cloudtrail_logs.arn}:*"
cloud_watch_logs_role_arn = aws_iam_role.cloudtrail_cloudwatch_role.arn
event_selector {
read_write_type = "All"
include_management_events = true
# Optional: Filter to only allocator role events (cost optimization)
# This reduces CloudTrail costs by ignoring unrelated AWS activity
data_resource {
type = "AWS::EC2::Instance"
values = ["arn:aws:ec2:*:${data.aws_caller_identity.current.account_id}:instance/*"]
}
}
depends_on = [
aws_s3_bucket_policy.cloudtrail_policy,
aws_cloudwatch_log_group.cloudtrail_logs,
aws_iam_role_policy.cloudtrail_cloudwatch_policy
]
tags = {
Name = "lablink-trail-${var.resource_suffix}"
Environment = var.resource_suffix
}
}Testing:
- Deploy CloudTrail configuration
- Launch a test client VM via allocator
- Verify CloudTrail logs appear in S3 bucket
- Verify logs appear in CloudWatch Logs group
- Check for
RunInstancesevents in CloudWatch Logs Insights
Phase 2: Metric Filters & Alarms (Security Monitoring)
Priority: High
Estimated Time: 4-6 hours
Files to Modify:
lablink-infrastructure/cloudwatch_alarms.tf(new file)lablink-infrastructure/config/config.yaml(update monitoring section)
Tasks:
- Create SNS topic for admin alerts
- Add email subscription (from config.yaml)
- Create metric filters for each security pattern
- Create CloudWatch alarms for each metric
- Test alarm triggering with synthetic events
Terraform Resources:
# lablink-infrastructure/cloudwatch_alarms.tf
# SNS Topic for Admin Alerts
resource "aws_sns_topic" "admin_alerts" {
name = "lablink-admin-alerts-${var.resource_suffix}"
tags = {
Name = "lablink-admin-alerts-${var.resource_suffix}"
Environment = var.resource_suffix
}
}
# SNS Email Subscription
resource "aws_sns_topic_subscription" "admin_email" {
count = try(local.config_file.monitoring.enabled, false) ? 1 : 0
topic_arn = aws_sns_topic.admin_alerts.arn
protocol = "email"
endpoint = try(local.config_file.monitoring.alerts.email, "")
}
# Metric Filter: Mass Instance Launches
resource "aws_cloudwatch_log_metric_filter" "run_instances" {
name = "lablink-run-instances-${var.resource_suffix}"
log_group_name = aws_cloudwatch_log_group.cloudtrail_logs.name
pattern = <<PATTERN
{ ($.eventName = RunInstances) && ($.userIdentity.principalId = *lablink_instance_role*) }
PATTERN
metric_transformation {
name = "RunInstancesCount"
namespace = "LabLinkSecurity/${var.resource_suffix}"
value = "1"
unit = "Count"
}
}
# Alarm: Mass Instance Launches
resource "aws_cloudwatch_metric_alarm" "mass_instance_launch" {
alarm_name = "lablink-mass-instance-launch-${var.resource_suffix}"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "RunInstancesCount"
namespace = "LabLinkSecurity/${var.resource_suffix}"
period = 300 # 5 minutes
statistic = "Sum"
threshold = try(local.config_file.monitoring.thresholds.max_instances_per_5min, 10)
alarm_description = "Alert when allocator launches >${try(local.config_file.monitoring.thresholds.max_instances_per_5min, 10)} instances in 5 minutes"
alarm_actions = [aws_sns_topic.admin_alerts.arn]
treat_missing_data = "notBreaching"
tags = {
Name = "lablink-mass-instance-launch-${var.resource_suffix}"
Environment = var.resource_suffix
Severity = "high"
}
}
# Metric Filter: Large Instance Types
resource "aws_cloudwatch_log_metric_filter" "large_instances" {
name = "lablink-large-instances-${var.resource_suffix}"
log_group_name = aws_cloudwatch_log_group.cloudtrail_logs.name
pattern = <<PATTERN
{ ($.eventName = RunInstances) && ($.userIdentity.principalId = *lablink_instance_role*) && (($.requestParameters.instanceType = p4d.*) || ($.requestParameters.instanceType = p3.*) || ($.requestParameters.instanceType = g5.*)) }
PATTERN
metric_transformation {
name = "LargeInstanceLaunched"
namespace = "LabLinkSecurity/${var.resource_suffix}"
value = "1"
unit = "Count"
}
}
# Alarm: Large Instance Types
resource "aws_cloudwatch_metric_alarm" "large_instance_launched" {
alarm_name = "lablink-large-instance-launched-${var.resource_suffix}"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "LargeInstanceLaunched"
namespace = "LabLinkSecurity/${var.resource_suffix}"
period = 300
statistic = "Sum"
threshold = 0 # Alert on ANY large instance
alarm_description = "Alert when allocator launches expensive instance types (p4d, p3, g5)"
alarm_actions = [aws_sns_topic.admin_alerts.arn]
treat_missing_data = "notBreaching"
tags = {
Name = "lablink-large-instance-launched-${var.resource_suffix}"
Environment = var.resource_suffix
Severity = "critical"
}
}
# Metric Filter: Unauthorized API Calls
resource "aws_cloudwatch_log_metric_filter" "unauthorized_calls" {
name = "lablink-unauthorized-calls-${var.resource_suffix}"
log_group_name = aws_cloudwatch_log_group.cloudtrail_logs.name
pattern = <<PATTERN
{ ($.errorCode = AccessDenied) || ($.errorCode = UnauthorizedOperation) && ($.userIdentity.principalId = *lablink_instance_role*) }
PATTERN
metric_transformation {
name = "UnauthorizedAPICalls"
namespace = "LabLinkSecurity/${var.resource_suffix}"
value = "1"
unit = "Count"
}
}
# Alarm: Unauthorized API Calls
resource "aws_cloudwatch_metric_alarm" "unauthorized_calls" {
alarm_name = "lablink-unauthorized-calls-${var.resource_suffix}"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "UnauthorizedAPICalls"
namespace = "LabLinkSecurity/${var.resource_suffix}"
period = 900 # 15 minutes
statistic = "Sum"
threshold = try(local.config_file.monitoring.thresholds.max_unauthorized_calls_per_15min, 5)
alarm_description = "Alert when allocator makes unauthorized API calls (possible attack or permission issue)"
alarm_actions = [aws_sns_topic.admin_alerts.arn]
treat_missing_data = "notBreaching"
tags = {
Name = "lablink-unauthorized-calls-${var.resource_suffix}"
Environment = var.resource_suffix
Severity = "critical"
}
}
# Additional alarms follow same pattern for:
# - TerminateInstances
# - IAM role creation
# - Security group modifications
# (omitted for brevity - implement similarly)Phase 3: AWS Budget Alerts (Cost Monitoring)
Priority: High
Estimated Time: 2-3 hours
Files to Modify:
lablink-infrastructure/budgets.tf(new file)lablink-infrastructure/config/config.yaml(update budget settings)
Tasks:
- Create AWS Budget for monthly cost tracking
- Configure multi-threshold alerts (50%, 80%, 100%, 150%)
- Link to SNS topic for notifications
- Test with simulated cost data
Terraform Resources:
# lablink-infrastructure/budgets.tf
resource "aws_budgets_budget" "lablink_monthly" {
count = try(local.config_file.monitoring.budget.enabled, false) ? 1 : 0
name = "lablink-monthly-budget-${var.resource_suffix}"
budget_type = "COST"
limit_amount = try(local.config_file.monitoring.budget.monthly_limit_usd, "500")
limit_unit = "USD"
time_period_start = formatdate("YYYY-MM-01_00:00", timestamp())
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = [
"user:Environment$${var.resource_suffix}",
"user:ManagedBy$lablink-allocator-${var.resource_suffix}"
]
}
# 50% warning
notification {
comparison_operator = "GREATER_THAN"
threshold = 50
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
}
# 80% urgent
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
}
# 100% critical
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
}
# 150% severe overage
notification {
comparison_operator = "GREATER_THAN"
threshold = 150
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [try(local.config_file.monitoring.alerts.email, "")]
}
tags = {
Name = "lablink-monthly-budget-${var.resource_suffix}"
Environment = var.resource_suffix
}
}Phase 4: Dashboard (Optional - Nice to Have)
Priority: Low
Estimated Time: 3-4 hours
Files to Modify:
lablink-infrastructure/dashboard.tf(new file)
Tasks:
- Create CloudWatch Dashboard
- Add widgets for key metrics
- Configure auto-refresh
- Document dashboard URL in README
Phase 5: Documentation & Testing
Priority: High
Estimated Time: 2-3 hours
Files to Modify:
lablink-infrastructure/README.md(add monitoring section)lablink-infrastructure/SECURITY.md(update security best practices)lablink-infrastructure/config/*.example.yaml(add monitoring config examples)
Tasks:
- Update README with monitoring architecture
- Document how to view CloudTrail logs
- Document how to respond to alerts
- Create runbook for common alert scenarios
- Test end-to-end alert flow
Configuration Schema
Additions to config.yaml:
# Infrastructure Monitoring Configuration (optional - defaults shown)
monitoring:
enabled: true # Master switch - set to false to disable all monitoring
# Alert notification settings
alerts:
email: "[email protected]" # Required if monitoring.enabled = true
# Future integrations:
# slack_webhook: ""
# pagerduty_key: ""
# CloudTrail configuration
cloudtrail:
retention_days: 90 # How long to keep CloudTrail logs in CloudWatch
# Alert thresholds (optional - defaults shown)
thresholds:
max_instances_per_5min: 10 # Alarm if >10 RunInstances in 5 minutes
max_terminations_per_5min: 20 # Alarm if >20 TerminateInstances in 5 minutes
max_iam_roles_per_hour: 10 # Alarm if >10 IAM CreateRole in 1 hour
max_security_group_changes_per_hour: 20 # Alarm if >20 SG changes in 1 hour
max_unauthorized_calls_per_15min: 5 # Alarm if >5 AccessDenied in 15 minutes
# AWS Budget alerts
budget:
enabled: true
monthly_limit_usd: 500 # Monthly cost budget
# Alerts automatically sent at 50%, 80%, 100%, 150% thresholdsExample config files to update:
config/ci-test.example.yaml- Low budget ($100/month), strict thresholdsconfig/prod.example.yaml- Higher budget ($1000/month), relaxed thresholdsconfig/dev.example.yaml- Monitoring disabled (optional)
Migration Path from Issue #225
Timeline Coordination:
-
Now (Immediate): Implement this infrastructure monitoring
- Independent of issue #225
- Provides security/cost protection for PR Add Policy Attached to the Allocator EC2 Instance #19
-
After Issue #225 Lands: Deprecate client VM CloudWatch logging
- Remove
aws_cloudwatch_log_group.client_vm_logs - Remove
aws_lambda_function.log_processor - Remove
aws_cloudwatch_log_subscription_filter.lambda_subscription - Keep CloudTrail CloudWatch integration (different log group)
- Remove
No Conflicts:
- This issue uses
aws_cloudwatch_log_group.cloudtrail_logs(new) - Issue #225 deprecates
aws_cloudwatch_log_group.client_vm_logs(existing) - Different log groups, different purposes
Testing Plan
Unit Tests
- Validate Terraform syntax:
terraform validate - Verify config schema:
lablink-validate-config - Check for resource naming conflicts
Integration Tests (ci-test environment)
- Deploy monitoring infrastructure
- Verify CloudTrail logging:
- Launch test VM → Check S3 for CloudTrail logs
- Terminate test VM → Check CloudWatch Logs
- Trigger alarms:
- Launch 11 instances rapidly → Verify SNS email sent
- Attempt unauthorized API call → Verify alarm triggers
- Verify budget alerts:
- Simulate cost threshold breach (may require manual AWS Budgets testing)
- Check dashboard (if implemented)
Acceptance Criteria
- CloudTrail logs appear in S3 within 15 minutes of API call
- CloudTrail logs appear in CloudWatch within 5 minutes
- Metric filter increments when matching event occurs
- Alarm triggers when threshold exceeded
- SNS email delivered to configured address
- Budget alert sent at 50% threshold
- Documentation updated with monitoring architecture
- Example configs include monitoring section
Cost Estimate
Monthly Costs (ci-test environment, ~50 VMs/day):
| Service | Usage | Cost |
|---|---|---|
| CloudTrail | ~150K events/month | $1.50 |
| S3 Storage | ~2GB/month | $0.05 |
| CloudWatch Logs Ingestion | ~1GB/month | $0.50 |
| CloudWatch Logs Storage | ~1GB/month | $0.03 |
| CloudWatch Alarms | 7 alarms | $0.70 |
| SNS Notifications | ~20 emails/month | $0.00 |
| AWS Budgets | 1 budget | $0.00 (first 2 free) |
| Total | ~$2.78/month |
Production environment (10x usage): ~$15-20/month
ROI: One prevented p4d.24xlarge instance running for 1 hour ($32) pays for 11 months of monitoring.
Security Considerations
Principle of Least Privilege:
- CloudTrail role can only write to specific CloudWatch log group
- SNS topic policy restricts who can publish
- S3 bucket encrypted at rest (AES256)
- CloudTrail log file validation prevents tampering
Alert Fatigue Mitigation:
- Conservative thresholds by default (tunable in config.yaml)
treat_missing_data = "notBreaching"prevents false alarms- Severity tags on alarms for prioritization
False Positive Scenarios:
- Legitimate batch operations (e.g., spinning up 20 VMs for class)
- Solution: Temporarily increase thresholds in config.yaml
- Development testing triggering alarms
- Solution: Disable monitoring in dev environment
Future Enhancements (Not in Scope)
- Slack/PagerDuty integration for alerts
- Automated remediation (Lambda to terminate unauthorized instances)
- Cost anomaly detection (ML-based)
- Integration with AWS Security Hub
- Custom CloudWatch dashboard with cost projections
- Log analytics queries for trend analysis
- Integration with allocator web UI (show alerts in admin panel)
Related Issues & PRs
- PR Add Policy Attached to the Allocator EC2 Instance #19: Add EC2/IAM permissions to allocator (triggers need for this monitoring)
- Issue Add EC2 permissions to allocator instance role to eliminate need for user AWS credentials #14: Original request for allocator autonomy
- Issue #225: Self-hosted logging system (complementary, not conflicting)
Acceptance Checklist
Before closing this issue:
- CloudTrail configured and logging to S3 + CloudWatch
- All 7 metric filters created
- All 7 CloudWatch alarms created and tested
- SNS topic created with email subscription
- AWS Budget configured with multi-threshold alerts
- Configuration schema added to config.yaml
- All example config files updated
- README.md updated with monitoring section
- SECURITY.md updated with monitoring best practices
- End-to-end testing completed in ci-test
- Cost estimate validated against actual usage
- Runbook created for responding to alerts