|
| 1 | +# Deletion protection in template-only CI |
| 2 | + |
| 3 | +Template-only CI (`.github/workflows/template-only-ci-infra.yml`) provisions real AWS resources, runs Terratest against them, and tears them down — all within a single workflow run. This document explains how deletion protection works in that context and what to do when adding new deletion-protected resources. |
| 4 | + |
| 5 | +Deletion protection matters in two different situations: |
| 6 | + |
| 7 | +- **Temporary/PR environments** — use non-default Terraform workspaces, so `is_temporary` is automatically `true` and deletion protection is disabled. This is documented separately. |
| 8 | +- **Template-only CI** (this document) — runs in the default workspace, so `is_temporary` is `false` and the destroy scripts must explicitly override deletion protection before tearing down resources. |
| 9 | + |
| 10 | +## How template-only CI creates and destroys resources |
| 11 | + |
| 12 | +Each CI run follows this lifecycle: |
| 13 | + |
| 14 | +1. **Install** — `nava-platform infra install` creates a fresh project directory with a randomized project name (`plt-tst-act-XXXXX`) |
| 15 | +2. **Set up** — Terratest creates infrastructure layers in order: account → network → build-repository → service |
| 16 | +3. **Test** — Terratest validates the deployed resources (e.g. hitting the service endpoint) |
| 17 | +4. **Destroy** — Go `defer` functions tear down each layer in reverse order using the `template-only-bin/destroy-*` scripts |
| 18 | + |
| 19 | +All of this runs in the **default Terraform workspace**. This is different from PR environments (documented separately), which use temporary workspaces. |
| 20 | + |
| 21 | +## The `is_temporary` pattern for temporary environments |
| 22 | + |
| 23 | +Many resources in template-infra use an `is_temporary` variable to gate deletion protection: |
| 24 | + |
| 25 | +```hcl |
| 26 | +# infra/modules/service/load_balancer.tf |
| 27 | +enable_deletion_protection = !var.is_temporary |
| 28 | +
|
| 29 | +# infra/modules/database/resources/main.tf |
| 30 | +deletion_protection = !var.is_temporary |
| 31 | +
|
| 32 | +# infra/modules/service/access_logs.tf |
| 33 | +force_destroy = var.is_temporary |
| 34 | +
|
| 35 | +# infra/modules/identity-provider/resources/main.tf |
| 36 | +deletion_protection = var.is_temporary ? "INACTIVE" : "ACTIVE" |
| 37 | +``` |
| 38 | + |
| 39 | +In project repos, `is_temporary` is typically derived from the workspace: |
| 40 | + |
| 41 | +```hcl |
| 42 | +is_temporary = terraform.workspace != "default" |
| 43 | +``` |
| 44 | + |
| 45 | +This means temporary/PR workspaces get deletion protection disabled automatically. But template-only CI runs in the **default workspace**, so `is_temporary` evaluates to `false` — deletion protection stays **enabled**. The destroy scripts must explicitly override this. |
| 46 | + |
| 47 | +## How the template destroy scripts handle deletion protection |
| 48 | + |
| 49 | +The `template-only-bin/destroy-*` scripts use `sed` to replace `is_temporary`-based expressions with hardcoded values that disable protection, then run a targeted `terraform apply` to apply those overrides before running `terraform destroy`. The general pattern in each script is: |
| 50 | + |
| 51 | +1. `sed` rewrites the Terraform source to hardcode deletion protection off (e.g. `force_destroy = true`, `enable_deletion_protection = false`) |
| 52 | +2. `terraform apply -target=...` applies only the changed resources so the protection settings take effect in AWS |
| 53 | +3. `terraform destroy` tears down all resources in the layer |
| 54 | + |
| 55 | +The scripts that handle deletion protection overrides are: |
| 56 | + |
| 57 | +- [`template-only-bin/destroy-app-service`](../template-only-bin/destroy-app-service) — overrides ALB deletion protection, S3 `force_destroy` for access logs and storage buckets, and Cognito user pool deletion protection |
| 58 | +- [`template-only-bin/destroy-app-database`](../template-only-bin/destroy-app-database) — overrides RDS cluster deletion protection and backup vault `force_destroy` |
| 59 | +- [`template-only-bin/destroy-account`](../template-only-bin/destroy-account) — uses a different pattern: adds `force_destroy = true` to Terraform backend S3 buckets and flips `prevent_destroy = true` to `false` in lifecycle rules |
| 60 | + |
| 61 | +The remaining scripts ([`template-only-bin/destroy-network`](../template-only-bin/destroy-network) and [`template-only-bin/destroy-app-build-repository`](../template-only-bin/destroy-app-build-repository)) have no deletion-protected resources and run `terraform destroy` directly. |
| 62 | + |
| 63 | +## Adding a new deletion-protected resource |
| 64 | + |
| 65 | +When you add a resource that has deletion protection or `force_destroy` behavior: |
| 66 | + |
| 67 | +1. **Use the `is_temporary` pattern in Terraform.** Gate the protection attribute on `var.is_temporary`, following the conventions above. Add a comment like `# Use a separate line to support automated terraform destroy commands` so the intent is clear. |
| 68 | + |
| 69 | +2. **Update the relevant `template-only-bin/destroy-*` script.** Add a `sed` command that replaces your `is_temporary` expression with a hardcoded value that disables protection. Add a matching `-target` to the `terraform apply` command so the override is applied before `terraform destroy` runs. |
| 70 | + |
| 71 | +3. **Test temporary environment cleanup first.** Verify that your resource is properly cleaned up when destroying a temporary/PR environment (non-default workspace where `is_temporary = true`). This is a much faster dev/test cycle than template-only CI. |
| 72 | + |
| 73 | +4. **Test template-only CI.** Once temporary environment cleanup works, run the template-only CI workflow to verify that the destroy step completes successfully with your sed overrides. A failed destroy leaves orphaned resources in AWS that need manual cleanup (see below). |
| 74 | + |
| 75 | +## Detecting and cleaning up orphaned resources |
| 76 | + |
| 77 | +If a CI run fails or is cancelled before the destroy step completes, resources are left behind in AWS. Template-only CI resources are tagged with the `plt-tst-act-*` project name pattern, and there are two workflows that handle them: |
| 78 | + |
| 79 | +- **Scan** — [`.github/workflows/template-only-scan-orphaned-infra-test-resources.yml`](../.github/workflows/template-only-scan-orphaned-infra-test-resources.yml) runs daily and calls [`template-only-bin/cleanup-test-resources --dry-run`](../template-only-bin/cleanup-test-resources) to detect orphaned resources. It uses the AWS Resource Groups Tagging API to find all resources tagged with `plt-tst-act-*` project names. If orphaned resources are found, the workflow fails to trigger a notification. |
| 80 | + |
| 81 | +- **Cleanup** — [`.github/workflows/template-only-cleanup-orphaned-infra-test-resources.yml`](../.github/workflows/template-only-cleanup-orphaned-infra-test-resources.yml) is a manually-triggered workflow that runs [`template-only-bin/cleanup-test-resources`](../template-only-bin/cleanup-test-resources) to delete orphaned resources. It supports targeting a specific project (e.g. `plt-tst-act-12345`) or finding all matching projects. Cleanup is intentionally manual (not automatic) to avoid masking underlying test issues that should be fixed. |
| 82 | + |
| 83 | +The cleanup script works via AWS APIs and resource tags, not Terraform state — it finds resources by tag, then deletes them in dependency order (ECS services before clusters, S3 contents before buckets, etc.). This means it can clean up resources even when Terraform state has been lost. |
0 commit comments