Skip to content

Commit 657a115

Browse files
committed
Add readme for AWS cleanup
1 parent 5ca68ed commit 657a115

1 file changed

Lines changed: 159 additions & 0 deletions

File tree

aws_cleanup/README.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# AWS Cleanup
2+
3+
Automated cleanup of orphaned AWS resources left behind by CI jobs (OpenShift Prow). CI jobs create infrastructure (VPCs, EC2 instances, load balancers, etc.) tagged with a `ci-op-` prefix. When jobs fail or time out, these resources remain and accumulate cost. This tooling finds and deletes them.
4+
5+
## Components
6+
7+
| File | Description |
8+
| --- | --- |
9+
| `aws_delete.py` | Main cleanup script. Runs standalone (CLI) or as an AWS Lambda function. |
10+
| `tf/main.tf` | Terraform configuration that provisions the Lambda, IAM roles, S3 bucket, and EventBridge schedule. |
11+
12+
## Prerequisites
13+
14+
- Python 3.10+
15+
- `boto3` (`pip install boto3`)
16+
- An AWS CLI profile (default: `telco-ci`) with sufficient permissions
17+
18+
## CLI Usage
19+
20+
```bash
21+
# Dry run across all regions (us-east-1, us-east-2, us-west-1, us-west-2)
22+
python aws_cleanup/aws_delete.py --dry-run
23+
24+
# Real deletion in a specific region
25+
python aws_cleanup/aws_delete.py --profile telco-ci --region us-east-2
26+
27+
# Custom tag prefix (default: ci-op-)
28+
python aws_cleanup/aws_delete.py --tag ci-op- --profile telco-ci
29+
30+
# With email report
31+
python aws_cleanup/aws_delete.py --profile telco-ci --send-email
32+
python aws_cleanup/aws_delete.py --profile telco-ci --send-email --to someone@redhat.com
33+
```
34+
35+
### CLI Options
36+
37+
| Flag | Default | Description |
38+
|---|---|---|
39+
| `--tag` | `ci-op-` | Tag prefix that identifies CI-created resources |
40+
| `--profile` | `telco-ci` | AWS CLI profile name |
41+
| `--region` | all 4 US regions | Limit to a single region (`us-east-1`, `us-east-2`, `us-west-1`, `us-west-2`) |
42+
| `--dry-run` | off | Print what would be deleted without making changes |
43+
| `--send-email` | off | Send a cost-savings summary email after cleanup |
44+
| `--to` | `sshnaidm@redhat.com` | Email recipient (used with `--send-email`) |
45+
46+
## How It Works
47+
48+
### Expiration Logic
49+
50+
A resource is eligible for deletion if any of these apply:
51+
52+
1. **`expirationDate` tag** exists and is more than 6 hours past due (format: `YYYY-MM-DDTHH:MM+00:00`).
53+
2. **`CreateDate`/`CreateTime`** is older than 6 hours AND the resource is tagged with the `ci-op-` prefix (via Name tag, `kubernetes.io/cluster/` tag, UserName, RoleName, or InstanceProfileName).
54+
3. **Unattached Elastic IPs** with no associated instance or network interface.
55+
56+
### Deletion Order
57+
58+
VPC sub-resources are deleted in dependency order:
59+
60+
1. Load Balancers (Classic and v2)
61+
2. EC2 Instances
62+
3. NAT Gateways
63+
4. Elastic IPs (tagged + unattached)
64+
5. Internet Gateways (detach, then delete)
65+
6. Network Interfaces (skips `in-use`)
66+
7. Route Tables
67+
8. Security Groups (revokes all rules if blocked by dependencies)
68+
9. Subnets
69+
10. VPC Endpoints
70+
11. VPC
71+
72+
After VPC cleanup, associated S3 buckets and EBS volumes are deleted by tag. Finally, `AWSExpiredResources.eliminate()` sweeps all resource types globally (EC2, LB, NAT, IGW, VPC endpoints, target groups, ENIs, route tables, security groups, subnets, DHCP options, VPCs, EIPs, volumes, S3, IAM users/roles/instance profiles).
73+
74+
Each VPC deletion cycle retries up to 10 times (with 60-second waits) until all sub-resources are removed.
75+
76+
### Cost Estimation
77+
78+
The `Price` class estimates hourly cost savings for each deleted resource using the AWS Pricing API (us-east-1). Prices are cached per resource type. Fallback defaults:
79+
80+
| Resource | Default Price |
81+
|---|---|
82+
| EC2 instance | `$0.17/hr` (looked up by instance type) |
83+
| Classic LB | `$0.025/hr` |
84+
| NLB/ALB | `$0.0225/hr` |
85+
| NAT Gateway | `$0.045/hr` |
86+
| Elastic IP | `$0.005/hr` |
87+
| EBS Volume | price-per-GB-month / 720 |
88+
| S3 Bucket | price-per-GB-month / 720 (calculates actual size) |
89+
90+
## AWS Lambda
91+
92+
The script doubles as a Lambda function via the `lambda_handler` entry point. The Lambda is triggered weekly by an EventBridge rule and writes its report to an S3 bucket.
93+
94+
### Lambda Event Payload
95+
96+
```json
97+
{
98+
"tag": "ci-op-",
99+
"dry_run": false,
100+
"report_bucket": "telco-ci-cleanup-reports",
101+
"region": null
102+
}
103+
```
104+
105+
All fields are optional and fall back to the defaults shown above. When `region` is null, all four US regions are processed.
106+
107+
### Reports
108+
109+
Lambda writes reports to `s3://telco-ci-cleanup-reports/reports/YYYY-MM-DD.txt`. Reports are automatically expired after 90 days via an S3 lifecycle rule.
110+
111+
## Terraform (`tf/`)
112+
113+
The `tf/` directory contains a Terraform configuration that provisions all the AWS infrastructure for the Lambda-based cleanup:
114+
115+
### Resources Created
116+
117+
- **IAM user** (`telco-ci-cleanup`) with policies for Lambda deployment, CloudWatch Logs, and S3 report bucket access
118+
- **IAM role** (`telco-ci-cleanup-lambda-role`) with policies for EC2, ELB, S3, IAM, and Pricing API access
119+
- **Lambda function** (`telco-ci-aws-cleanup`) running Python 3.13, 256 MB memory, 15-minute timeout
120+
- **EventBridge rule** triggering the Lambda every Monday at 10:00 AM UTC
121+
- **S3 bucket** (`telco-ci-cleanup-reports`) with a 90-day lifecycle policy on `reports/`
122+
123+
### Terraform Variables
124+
125+
| Variable | Default | Description |
126+
|---|---|---|
127+
| `aws_region` | `us-east-1` | Region for Terraform provider |
128+
| `aws_profile` | `telco-ci` | AWS CLI profile |
129+
| `user_name` | `telco-ci-cleanup` | IAM user name |
130+
| `lambda_role_name` | `telco-ci-cleanup-lambda-role` | Lambda execution role name |
131+
| `schedule_expression` | `cron(0 10 ? * MON *)` | EventBridge schedule (Monday 10:00 UTC) |
132+
| `report_bucket_name` | `telco-ci-cleanup-reports` | S3 bucket for reports |
133+
134+
### Usage
135+
136+
```bash
137+
cd aws_cleanup/tf
138+
terraform init
139+
terraform plan -out=tfplan
140+
terraform apply tfplan
141+
```
142+
143+
### Outputs
144+
145+
| Output | Description |
146+
|---|---|
147+
| `user_name` | IAM user name |
148+
| `user_access_key_id` | Access key ID for the IAM user |
149+
| `user_secret_access_key` | Secret access key (sensitive) |
150+
| `lambda_role_arn` | ARN of the Lambda execution role |
151+
| `lambda_function_name` | Name of the Lambda function |
152+
| `report_bucket` | S3 bucket name for reports |
153+
154+
## CI / CD
155+
156+
Two GitHub Actions workflows:
157+
158+
- **aws-cleanup-check.yml** -- Runs on pushes to `master` and PRs touching `aws_cleanup/**`. Checks syntax (`py_compile`), lint (`pyflakes`, `flake8`), import verification, and CLI help.
159+
- **aws-cleanup-deploy.yml** -- Runs on pushes to `master` when `aws_cleanup/aws_delete.py` changes. Packages and deploys the updated code to the Lambda function. Requires `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` repository secrets.

0 commit comments

Comments
 (0)