Skip to content

Commit c845769

Browse files
authored
Add documentation for deletion protection handling in template-only CI (#1006)
- Adds `template-only-docs/deletion-protection-in-ci.md` explaining how template-only CI handles resources with deletion protection during the install → test → destroy lifecycle - Covers the `is_temporary` pattern, the `sed` override approach in destroy scripts, a table of all protected resources, and step-by-step guidance for adding new deletion-protected resources - Documents the orphaned resource detection mechanism via `scan-orphaned-environments` workflow
1 parent 8141122 commit c845769

1 file changed

Lines changed: 83 additions & 0 deletions

File tree

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Deletion protection in template-only CI
2+
3+
Template-only CI (`.github/workflows/template-only-ci-infra.yml`) provisions real AWS resources, runs Terratest against them, and tears them down — all within a single workflow run. This document explains how deletion protection works in that context and what to do when adding new deletion-protected resources.
4+
5+
Deletion protection matters in two different situations:
6+
7+
- **Temporary/PR environments** — use non-default Terraform workspaces, so `is_temporary` is automatically `true` and deletion protection is disabled. This is documented separately.
8+
- **Template-only CI** (this document) — runs in the default workspace, so `is_temporary` is `false` and the destroy scripts must explicitly override deletion protection before tearing down resources.
9+
10+
## How template-only CI creates and destroys resources
11+
12+
Each CI run follows this lifecycle:
13+
14+
1. **Install**`nava-platform infra install` creates a fresh project directory with a randomized project name (`plt-tst-act-XXXXX`)
15+
2. **Set up** — Terratest creates infrastructure layers in order: account → network → build-repository → service
16+
3. **Test** — Terratest validates the deployed resources (e.g. hitting the service endpoint)
17+
4. **Destroy** — Go `defer` functions tear down each layer in reverse order using the `template-only-bin/destroy-*` scripts
18+
19+
All of this runs in the **default Terraform workspace**. This is different from PR environments (documented separately), which use temporary workspaces.
20+
21+
## The `is_temporary` pattern for temporary environments
22+
23+
Many resources in template-infra use an `is_temporary` variable to gate deletion protection:
24+
25+
```hcl
26+
# infra/modules/service/load_balancer.tf
27+
enable_deletion_protection = !var.is_temporary
28+
29+
# infra/modules/database/resources/main.tf
30+
deletion_protection = !var.is_temporary
31+
32+
# infra/modules/service/access_logs.tf
33+
force_destroy = var.is_temporary
34+
35+
# infra/modules/identity-provider/resources/main.tf
36+
deletion_protection = var.is_temporary ? "INACTIVE" : "ACTIVE"
37+
```
38+
39+
In project repos, `is_temporary` is typically derived from the workspace:
40+
41+
```hcl
42+
is_temporary = terraform.workspace != "default"
43+
```
44+
45+
This means temporary/PR workspaces get deletion protection disabled automatically. But template-only CI runs in the **default workspace**, so `is_temporary` evaluates to `false` — deletion protection stays **enabled**. The destroy scripts must explicitly override this.
46+
47+
## How the template destroy scripts handle deletion protection
48+
49+
The `template-only-bin/destroy-*` scripts use `sed` to replace `is_temporary`-based expressions with hardcoded values that disable protection, then run a targeted `terraform apply` to apply those overrides before running `terraform destroy`. The general pattern in each script is:
50+
51+
1. `sed` rewrites the Terraform source to hardcode deletion protection off (e.g. `force_destroy = true`, `enable_deletion_protection = false`)
52+
2. `terraform apply -target=...` applies only the changed resources so the protection settings take effect in AWS
53+
3. `terraform destroy` tears down all resources in the layer
54+
55+
The scripts that handle deletion protection overrides are:
56+
57+
- [`template-only-bin/destroy-app-service`](../template-only-bin/destroy-app-service) — overrides ALB deletion protection, S3 `force_destroy` for access logs and storage buckets, and Cognito user pool deletion protection
58+
- [`template-only-bin/destroy-app-database`](../template-only-bin/destroy-app-database) — overrides RDS cluster deletion protection and backup vault `force_destroy`
59+
- [`template-only-bin/destroy-account`](../template-only-bin/destroy-account) — uses a different pattern: adds `force_destroy = true` to Terraform backend S3 buckets and flips `prevent_destroy = true` to `false` in lifecycle rules
60+
61+
The remaining scripts ([`template-only-bin/destroy-network`](../template-only-bin/destroy-network) and [`template-only-bin/destroy-app-build-repository`](../template-only-bin/destroy-app-build-repository)) have no deletion-protected resources and run `terraform destroy` directly.
62+
63+
## Adding a new deletion-protected resource
64+
65+
When you add a resource that has deletion protection or `force_destroy` behavior:
66+
67+
1. **Use the `is_temporary` pattern in Terraform.** Gate the protection attribute on `var.is_temporary`, following the conventions above. Add a comment like `# Use a separate line to support automated terraform destroy commands` so the intent is clear.
68+
69+
2. **Update the relevant `template-only-bin/destroy-*` script.** Add a `sed` command that replaces your `is_temporary` expression with a hardcoded value that disables protection. Add a matching `-target` to the `terraform apply` command so the override is applied before `terraform destroy` runs.
70+
71+
3. **Test temporary environment cleanup first.** Verify that your resource is properly cleaned up when destroying a temporary/PR environment (non-default workspace where `is_temporary = true`). This is a much faster dev/test cycle than template-only CI.
72+
73+
4. **Test template-only CI.** Once temporary environment cleanup works, run the template-only CI workflow to verify that the destroy step completes successfully with your sed overrides. A failed destroy leaves orphaned resources in AWS that need manual cleanup (see below).
74+
75+
## Detecting and cleaning up orphaned resources
76+
77+
If a CI run fails or is cancelled before the destroy step completes, resources are left behind in AWS. Template-only CI resources are tagged with the `plt-tst-act-*` project name pattern, and there are two workflows that handle them:
78+
79+
- **Scan**[`.github/workflows/template-only-scan-orphaned-infra-test-resources.yml`](../.github/workflows/template-only-scan-orphaned-infra-test-resources.yml) runs daily and calls [`template-only-bin/cleanup-test-resources --dry-run`](../template-only-bin/cleanup-test-resources) to detect orphaned resources. It uses the AWS Resource Groups Tagging API to find all resources tagged with `plt-tst-act-*` project names. If orphaned resources are found, the workflow fails to trigger a notification.
80+
81+
- **Cleanup**[`.github/workflows/template-only-cleanup-orphaned-infra-test-resources.yml`](../.github/workflows/template-only-cleanup-orphaned-infra-test-resources.yml) is a manually-triggered workflow that runs [`template-only-bin/cleanup-test-resources`](../template-only-bin/cleanup-test-resources) to delete orphaned resources. It supports targeting a specific project (e.g. `plt-tst-act-12345`) or finding all matching projects. Cleanup is intentionally manual (not automatic) to avoid masking underlying test issues that should be fixed.
82+
83+
The cleanup script works via AWS APIs and resource tags, not Terraform state — it finds resources by tag, then deletes them in dependency order (ECS services before clusters, S3 contents before buckets, etc.). This means it can clean up resources even when Terraform state has been lost.

0 commit comments

Comments
 (0)