|
| 1 | +# AWS Cleanup |
| 2 | + |
| 3 | +Automated cleanup of orphaned AWS resources left behind by CI jobs (OpenShift Prow). CI jobs create infrastructure (VPCs, EC2 instances, load balancers, etc.) tagged with a `ci-op-` prefix. When jobs fail or time out, these resources remain and accumulate cost. This tooling finds and deletes them. |
| 4 | + |
| 5 | +## Components |
| 6 | + |
| 7 | +| File | Description | |
| 8 | +| --- | --- | |
| 9 | +| `aws_delete.py` | Main cleanup script. Runs standalone (CLI) or as an AWS Lambda function. | |
| 10 | +| `tf/main.tf` | Terraform configuration that provisions the Lambda, IAM roles, S3 bucket, and EventBridge schedule. | |
| 11 | + |
| 12 | +## Prerequisites |
| 13 | + |
| 14 | +- Python 3.10+ |
| 15 | +- `boto3` (`pip install boto3`) |
| 16 | +- An AWS CLI profile (default: `telco-ci`) with sufficient permissions |
| 17 | + |
| 18 | +## CLI Usage |
| 19 | + |
| 20 | +```bash |
| 21 | +# Dry run across all regions (us-east-1, us-east-2, us-west-1, us-west-2) |
| 22 | +python aws_cleanup/aws_delete.py --dry-run |
| 23 | + |
| 24 | +# Real deletion in a specific region |
| 25 | +python aws_cleanup/aws_delete.py --profile telco-ci --region us-east-2 |
| 26 | + |
| 27 | +# Custom tag prefix (default: ci-op-) |
| 28 | +python aws_cleanup/aws_delete.py --tag ci-op- --profile telco-ci |
| 29 | + |
| 30 | +# With email report |
| 31 | +python aws_cleanup/aws_delete.py --profile telco-ci --send-email |
| 32 | +python aws_cleanup/aws_delete.py --profile telco-ci --send-email --to someone@redhat.com |
| 33 | +``` |
| 34 | + |
| 35 | +### CLI Options |
| 36 | + |
| 37 | +| Flag | Default | Description | |
| 38 | +|---|---|---| |
| 39 | +| `--tag` | `ci-op-` | Tag prefix that identifies CI-created resources | |
| 40 | +| `--profile` | `telco-ci` | AWS CLI profile name | |
| 41 | +| `--region` | all 4 US regions | Limit to a single region (`us-east-1`, `us-east-2`, `us-west-1`, `us-west-2`) | |
| 42 | +| `--dry-run` | off | Print what would be deleted without making changes | |
| 43 | +| `--send-email` | off | Send a cost-savings summary email after cleanup | |
| 44 | +| `--to` | `sshnaidm@redhat.com` | Email recipient (used with `--send-email`) | |
| 45 | + |
| 46 | +## How It Works |
| 47 | + |
| 48 | +### Expiration Logic |
| 49 | + |
| 50 | +A resource is eligible for deletion if any of these apply: |
| 51 | + |
| 52 | +1. **`expirationDate` tag** exists and is more than 6 hours past due (format: `YYYY-MM-DDTHH:MM+00:00`). |
| 53 | +2. **`CreateDate`/`CreateTime`** is older than 6 hours AND the resource is tagged with the `ci-op-` prefix (via Name tag, `kubernetes.io/cluster/` tag, UserName, RoleName, or InstanceProfileName). |
| 54 | +3. **Unattached Elastic IPs** with no associated instance or network interface. |
| 55 | + |
| 56 | +### Deletion Order |
| 57 | + |
| 58 | +VPC sub-resources are deleted in dependency order: |
| 59 | + |
| 60 | +1. Load Balancers (Classic and v2) |
| 61 | +2. EC2 Instances |
| 62 | +3. NAT Gateways |
| 63 | +4. Elastic IPs (tagged + unattached) |
| 64 | +5. Internet Gateways (detach, then delete) |
| 65 | +6. Network Interfaces (skips `in-use`) |
| 66 | +7. Route Tables |
| 67 | +8. Security Groups (revokes all rules if blocked by dependencies) |
| 68 | +9. Subnets |
| 69 | +10. VPC Endpoints |
| 70 | +11. VPC |
| 71 | + |
| 72 | +After VPC cleanup, associated S3 buckets and EBS volumes are deleted by tag. Finally, `AWSExpiredResources.eliminate()` sweeps all resource types globally (EC2, LB, NAT, IGW, VPC endpoints, target groups, ENIs, route tables, security groups, subnets, DHCP options, VPCs, EIPs, volumes, S3, IAM users/roles/instance profiles). |
| 73 | + |
| 74 | +Each VPC deletion cycle retries up to 10 times (with 60-second waits) until all sub-resources are removed. |
| 75 | + |
| 76 | +### Cost Estimation |
| 77 | + |
| 78 | +The `Price` class estimates hourly cost savings for each deleted resource using the AWS Pricing API (us-east-1). Prices are cached per resource type. Fallback defaults: |
| 79 | + |
| 80 | +| Resource | Default Price | |
| 81 | +|---|---| |
| 82 | +| EC2 instance | `$0.17/hr` (looked up by instance type) | |
| 83 | +| Classic LB | `$0.025/hr` | |
| 84 | +| NLB/ALB | `$0.0225/hr` | |
| 85 | +| NAT Gateway | `$0.045/hr` | |
| 86 | +| Elastic IP | `$0.005/hr` | |
| 87 | +| EBS Volume | price-per-GB-month / 720 | |
| 88 | +| S3 Bucket | price-per-GB-month / 720 (calculates actual size) | |
| 89 | + |
| 90 | +## AWS Lambda |
| 91 | + |
| 92 | +The script doubles as a Lambda function via the `lambda_handler` entry point. The Lambda is triggered weekly by an EventBridge rule and writes its report to an S3 bucket. |
| 93 | + |
| 94 | +### Lambda Event Payload |
| 95 | + |
| 96 | +```json |
| 97 | +{ |
| 98 | + "tag": "ci-op-", |
| 99 | + "dry_run": false, |
| 100 | + "report_bucket": "telco-ci-cleanup-reports", |
| 101 | + "region": null |
| 102 | +} |
| 103 | +``` |
| 104 | + |
| 105 | +All fields are optional and fall back to the defaults shown above. When `region` is null, all four US regions are processed. |
| 106 | + |
| 107 | +### Reports |
| 108 | + |
| 109 | +Lambda writes reports to `s3://telco-ci-cleanup-reports/reports/YYYY-MM-DD.txt`. Reports are automatically expired after 90 days via an S3 lifecycle rule. |
| 110 | + |
| 111 | +## Terraform (`tf/`) |
| 112 | + |
| 113 | +The `tf/` directory contains a Terraform configuration that provisions all the AWS infrastructure for the Lambda-based cleanup: |
| 114 | + |
| 115 | +### Resources Created |
| 116 | + |
| 117 | +- **IAM user** (`telco-ci-cleanup`) with policies for Lambda deployment, CloudWatch Logs, and S3 report bucket access |
| 118 | +- **IAM role** (`telco-ci-cleanup-lambda-role`) with policies for EC2, ELB, S3, IAM, and Pricing API access |
| 119 | +- **Lambda function** (`telco-ci-aws-cleanup`) running Python 3.13, 256 MB memory, 15-minute timeout |
| 120 | +- **EventBridge rule** triggering the Lambda every Monday at 10:00 AM UTC |
| 121 | +- **S3 bucket** (`telco-ci-cleanup-reports`) with a 90-day lifecycle policy on `reports/` |
| 122 | + |
| 123 | +### Terraform Variables |
| 124 | + |
| 125 | +| Variable | Default | Description | |
| 126 | +|---|---|---| |
| 127 | +| `aws_region` | `us-east-1` | Region for Terraform provider | |
| 128 | +| `aws_profile` | `telco-ci` | AWS CLI profile | |
| 129 | +| `user_name` | `telco-ci-cleanup` | IAM user name | |
| 130 | +| `lambda_role_name` | `telco-ci-cleanup-lambda-role` | Lambda execution role name | |
| 131 | +| `schedule_expression` | `cron(0 10 ? * MON *)` | EventBridge schedule (Monday 10:00 UTC) | |
| 132 | +| `report_bucket_name` | `telco-ci-cleanup-reports` | S3 bucket for reports | |
| 133 | + |
| 134 | +### Usage |
| 135 | + |
| 136 | +```bash |
| 137 | +cd aws_cleanup/tf |
| 138 | +terraform init |
| 139 | +terraform plan -out=tfplan |
| 140 | +terraform apply tfplan |
| 141 | +``` |
| 142 | + |
| 143 | +### Outputs |
| 144 | + |
| 145 | +| Output | Description | |
| 146 | +|---|---| |
| 147 | +| `user_name` | IAM user name | |
| 148 | +| `user_access_key_id` | Access key ID for the IAM user | |
| 149 | +| `user_secret_access_key` | Secret access key (sensitive) | |
| 150 | +| `lambda_role_arn` | ARN of the Lambda execution role | |
| 151 | +| `lambda_function_name` | Name of the Lambda function | |
| 152 | +| `report_bucket` | S3 bucket name for reports | |
| 153 | + |
| 154 | +## CI / CD |
| 155 | + |
| 156 | +Two GitHub Actions workflows: |
| 157 | + |
| 158 | +- **aws-cleanup-check.yml** -- Runs on pushes to `master` and PRs touching `aws_cleanup/**`. Checks syntax (`py_compile`), lint (`pyflakes`, `flake8`), import verification, and CLI help. |
| 159 | +- **aws-cleanup-deploy.yml** -- Runs on pushes to `master` when `aws_cleanup/aws_delete.py` changes. Packages and deploys the updated code to the Lambda function. Requires `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` repository secrets. |
0 commit comments