Improves CI test resource cleanup to prevent orphaned AWS resources from failed/cancelled test runs.#973
Conversation
Test runs that fail, timeout, or are cancelled leave AWS resources behind, causing subsequent test failures (e.g., plt-tst-act-32078 left a Route53 zone causing 'multiple zones matched' errors). This adds resilient error handling to in-test teardown functions, a GitHub Actions cleanup step that always runs regardless of test outcome, and a scheduled job to detect orphaned resources. Together these layers ensure cleanup happens even when tests fail unexpectedly and provides a safety net for any missed cleanups.
Quote all variables to prevent word splitting and use command grouping for redirects.
Use bash parameter expansion instead of sed and disable false positive for JMESPath syntax.
doshitan
left a comment
There was a problem hiding this comment.
Go test timeout kills process before defer cleanup runs
Have we seen this happen in practice? If so, where and what was the hangup that caused us to hit the timeout?
Depending what it was, the better fix may be to increase/disable the go test timeout and have the tests implement a more graceful timeout internally. With either the increased go test timeout or GH Action timeout as the backup catch-all for hanging processes.
- Move cleanup script from bin/ to template-only-bin/ (template-specific)
- Fix variable syntax to use ${VAR} consistently throughout script
- Extract hardcoded account ID and region to variables at top of script
- Convert to manual workflow_dispatch trigger instead of automatic cleanup
to avoid masking underlying test issues
Runs daily at 08:00 UTC to detect test resources that weren't cleaned up. If found, sends notification and fails the workflow to alert maintainers. Does not auto-delete - maintainers can manually trigger the cleanup workflow.
Each teardown function now retries up to 3 times with 5s delay between attempts to handle intermittent AWS timeouts during resource deletion. Uses terratest's built-in retry module for robust error handling.
doshitan
left a comment
There was a problem hiding this comment.
Do we have answers to the previous questions in #973 (review)?
.github/workflows/template-only-cleanup-orphaned-infra-test-resources.yml
Show resolved
Hide resolved
- Rename workflows to follow template-only- naming convention - Fix teardown error handling to preserve test failures - Create runCommandWithRetry utility function to reduce code duplication - Remove GitHub Actions cleanup step (now handled by test teardowns)
|
Re: the question about go test timeouts - I reviewed the codebase and workflow runs but couldn't find concrete evidence that the 30m go test timeout has been hit in practice. The PR description mentions I've already addressed the other review comments:
For the timeout question: if we haven't actually seen the go test timeout being hit, the current 30m timeout + the resilient teardown logic (with retry) should be sufficient. The GH Actions job timeout would still serve as the backup catch-all for truly hanging processes. Happy to make additional changes if there's evidence this has been a problem in practice. |
doshitan
left a comment
There was a problem hiding this comment.
The infra tests failed:
Error: ./template_infra_test.go:197:2: undefined: runCommandWithRetry
Error: ./template_infra_test.go:207:2: undefined: runCommandWithRetry
Error: ./template_infra_test.go:217:2: undefined: runCommandWithRetry
Error: ./template_infra_test.go:227:2: undefined: runCommandWithRetry
See inline comments.
And we should post evidence of the cleanup-test-resources working, probably via the GitHub actions? So like update the new GitHub actions to trigger based on pushes to this feature branch temporarily, then run the scan one, capture the results. Eventually we should run the cleanup action itself, but can wait until we are ready to merge (in case we want to do more testing of the cleanup script).
.github/workflows/template-only-cleanup-orphaned-infra-test-resources.yml
Outdated
Show resolved
Hide resolved
.github/workflows/template-only-scan-orphaned-infra-test-resources.yml
Outdated
Show resolved
Hide resolved
.github/workflows/template-only-scan-orphaned-infra-test-resources.yml
Outdated
Show resolved
Hide resolved
The function was previously in infra/test/helpers.go but needs to be accessible from template-only-test package. Moving it to the same file resolves the undefined function error.
Add push trigger for fix/improve-test-cleanup-process branch to demonstrate the cleanup-test-resources script working in CI. Also fix workflow filename in notification message. This trigger should be removed before merging.
|
Fixed the test failures and added workflow testing: Fixes:
Testing the cleanup script:
The scan workflow should now run automatically on pushes to this branch, demonstrating the cleanup-test-resources script in action. Once testing is complete and before merging, we'll remove the temporary push trigger. |
- template-scan-orphaned-infra-test-resources.yml → template-only-scan-orphaned-infra-test-resources.yml - template-cleanup-orphaned-infra-test-resources.yml → template-only-cleanup-orphaned-infra-test-resources.yml - Updated workflow names inside files to match - Updated workflow reference in notification message This follows the template-only- naming convention for resources that should not be installed to projects.
The cleanup-test-resources script is already executable in the git repository (file mode 100755), so chmod +x is not needed.
- cleanup-test-resources → template-only-cleanup-test-resources - Updated all workflow references to use new script name Consistent with template-only- naming convention for resources that should not be installed to projects.
Use { cmd1; cmd2; } >> file syntax instead of individual redirects
to GITHUB_OUTPUT as recommended by shellcheck.
.github/workflows/template-only-cleanup-orphaned-infra-test-resources.yml
Outdated
Show resolved
Hide resolved
.github/workflows/template-only-scan-orphaned-infra-test-resources.yml
Outdated
Show resolved
Hide resolved
.github/workflows/template-only-scan-orphaned-infra-test-resources.yml
Outdated
Show resolved
Hide resolved
…sources.yml Co-authored-by: Tanner Doshier <git@doshitan.com>
- Rename template-only-cleanup-test-resources to cleanup-test-resources (no template-only- prefix needed since it's in template-only-bin/) - Update workflow display names to be human readable - Fix broken pipe errors in scan workflow output parsing
- Use printf with stderr redirect to fix broken pipe errors in scan workflow - Add cleanup_inactive_task_definitions() to directly query and delete inactive ECS task definitions matching plt-tst-act-* pattern - Inactive task definitions aren't returned by Resource Groups Tagging API
|
Issues Fixed
Fixed in commit: a5d77ae Changed from: Evidence: https://github.com/navapbc/template-infra/actions/runs/20143437468/job/57817201077
Fixed in commit: aea90f5 Added stderr redirect to suppress broken pipe messages when head terminates early: Evidence: https://github.com/navapbc/template-infra/actions/runs/20143436251/job/57817196876
Fixed in commit: daa0f2e Added cleanup_inactive_task_definitions() function that directly queries ECS for inactive task definitions matching plt-tst-act-* aws ecs list-task-definition-families --family-prefix "plt-tst-act-" --status INACTIVE CI Status All checks passing:
|
That run still shows: And does not list any orphaned resources, but it found a bunch of "orphaned" projects, so there should be resources right?
The latest scan runs don't show any inactive task definitions? |
- Use '> /dev/null' instead of 'grep -q' to avoid broken pipe - Print full output for better debugging - Add summary showing how many projects were checked
The broken pipe errors occur when piping large output through grep/head because the receiving command exits before all input is consumed. Using a temp file avoids this issue entirely.
…refix Task definitions are named like 'app-dev' but tagged with project=plt-tst-act-*. The previous code searched for family prefix 'plt-tst-act-*' which never matched. Now we list all task definition families, check their tags, and only delete ones tagged with plt-tst-act-* projects.
Should be fixed now |
|
The PR description doesn't quite match the current scope of the work, so can update that. A few other tweaks and probably some clearer comments/docs, but then probably ready to ship. Once we are confident things are mostly working, we should run the cleanup script, confirming everything is cleared out. Then merge and monitor how things work out going forward. |
Removes the runCommandWithRetry function from helpers.go as it's not currently used in any tests. Also removes the unused imports (time, retry, shell) that were only needed for that function.
The --age-hours option was defined but never actually used to filter resources by age. Removing it to avoid confusion.
- Clarify why cleanup_inactive_task_definitions scans ECS directly - Document that Terraform manages task definitions but AWS provider doesn't support automatic deletion on destroy - Add link to upstream issue hashicorp/terraform-provider-aws#5919 - Refactor destroy-app-service to use pushd/popd instead of cd
|
Some other issues we can address out of scope:
|
doshitan
left a comment
There was a problem hiding this comment.
I don't see a recent run of the "scan" GitHub workflow. What does that look like at the moment? What's the current state of the account?
The link in the description is still incorrect, should point to hashicorp/terraform-provider-aws#29682
But if the script is running successfully in dry run mode and clean up mode, with and without resources to find, we should ship it!
Summary
Improves CI test resource cleanup to prevent orphaned AWS resources from failed/cancelled test runs.
Changes
New comprehensive script to find and delete orphaned test resources tagged with plt-tst-act-*:
Enhanced teardown script with ECS task definition cleanup:
Daily scheduled workflow to detect orphaned resources:
Manual workflow to clean up orphaned resources when needed.
Test Plan