Skip to content

Improves CI test resource cleanup to prevent orphaned AWS resources from failed/cancelled test runs.#973

Merged
sean-navapbc merged 46 commits intomainfrom
fix/improve-test-cleanup-process
Jan 9, 2026
Merged

Improves CI test resource cleanup to prevent orphaned AWS resources from failed/cancelled test runs.#973
sean-navapbc merged 46 commits intomainfrom
fix/improve-test-cleanup-process

Conversation

@sean-navapbc
Copy link
Copy Markdown
Contributor

@sean-navapbc sean-navapbc commented Nov 5, 2025

Summary

Improves CI test resource cleanup to prevent orphaned AWS resources from failed/cancelled test runs.

Changes

  1. Cleanup Script (template-only-bin/cleanup-test-resources)

New comprehensive script to find and delete orphaned test resources tagged with plt-tst-act-*:

  • Route 53 hosted zones
  • ECS clusters, services, and task definitions
  • Load balancers and target groups
  • S3 buckets and DynamoDB tables
  • KMS keys (scheduled for deletion with 7-day wait)
  • Supports --dry-run mode to preview changes
  1. Destroy Script (template-only-bin/destroy-app-service)

Enhanced teardown script with ECS task definition cleanup:

  1. Scan Workflow (.github/workflows/template-only-scan-orphaned-infra-test-resources.yml)

Daily scheduled workflow to detect orphaned resources:

  • Runs cleanup script in dry-run mode
  • Fails if orphaned resources are found (alerts on resource leaks)
  1. Cleanup Workflow (.github/workflows/template-only-cleanup-orphaned-infra-test-resources.yml)

Manual workflow to clean up orphaned resources when needed.

Test Plan

  • Run cleanup script with --dry-run to verify detection
  • Run cleanup script to clear existing orphaned resources
  • Verify CI tests pass
  • Monitor for orphaned resources after merge

Test runs that fail, timeout, or are cancelled leave AWS resources behind, causing subsequent test failures (e.g., plt-tst-act-32078 left a Route53 zone causing 'multiple zones matched' errors). This adds resilient error handling to in-test teardown functions, a GitHub Actions cleanup step that always runs regardless of test outcome, and a scheduled job to detect orphaned resources. Together these layers ensure cleanup happens even when tests fail unexpectedly and provides a safety net for any missed cleanups.
@sean-navapbc sean-navapbc requested a review from a team as a code owner November 5, 2025 01:58
Sean Thomas added 3 commits November 5, 2025 14:25
Quote all variables to prevent word splitting and use command grouping for redirects.
Use bash parameter expansion instead of sed and disable false positive for JMESPath syntax.
@sean-navapbc sean-navapbc changed the title Add three-layer defense against orphaned test resources Clean up template-infra pipeline with 2 major changes Nov 5, 2025
@sean-navapbc sean-navapbc requested a review from doshitan November 5, 2025 23:11
Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go test timeout kills process before defer cleanup runs

Have we seen this happen in practice? If so, where and what was the hangup that caused us to hit the timeout?

Depending what it was, the better fix may be to increase/disable the go test timeout and have the tests implement a more graceful timeout internally. With either the increased go test timeout or GH Action timeout as the backup catch-all for hanging processes.

Sean Thomas added 3 commits November 14, 2025 14:36
- Move cleanup script from bin/ to template-only-bin/ (template-specific)
- Fix variable syntax to use ${VAR} consistently throughout script
- Extract hardcoded account ID and region to variables at top of script
- Convert to manual workflow_dispatch trigger instead of automatic cleanup
  to avoid masking underlying test issues
Runs daily at 08:00 UTC to detect test resources that weren't cleaned up.
If found, sends notification and fails the workflow to alert maintainers.
Does not auto-delete - maintainers can manually trigger the cleanup workflow.
Each teardown function now retries up to 3 times with 5s delay between
attempts to handle intermittent AWS timeouts during resource deletion.
Uses terratest's built-in retry module for robust error handling.
Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have answers to the previous questions in #973 (review)?

- Rename workflows to follow template-only- naming convention
- Fix teardown error handling to preserve test failures
- Create runCommandWithRetry utility function to reduce code duplication
- Remove GitHub Actions cleanup step (now handled by test teardowns)
@sean-navapbc
Copy link
Copy Markdown
Contributor Author

sean-navapbc commented Nov 25, 2025

Re: the question about go test timeouts - I reviewed the codebase and workflow runs but couldn't find concrete evidence that the 30m go test timeout has been hit in practice.

The PR description mentions plt-tst-act-32078 from Oct 20 as an example of orphaned resources, but it's unclear if that was specifically due to a go test timeout or another cause (cancellation, failure, etc.).

I've already addressed the other review comments:

  • Fixed teardown error handling (changed from DoWithRetryE to DoWithRetry)
  • Created runCommandWithRetry utility function
  • Removed GitHub Actions cleanup step

For the timeout question: if we haven't actually seen the go test timeout being hit, the current 30m timeout + the resilient teardown logic (with retry) should be sufficient. The GH Actions job timeout would still serve as the backup catch-all for truly hanging processes.

Happy to make additional changes if there's evidence this has been a problem in practice.

Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The infra tests failed:

Error: ./template_infra_test.go:197:2: undefined: runCommandWithRetry
Error: ./template_infra_test.go:207:2: undefined: runCommandWithRetry
Error: ./template_infra_test.go:217:2: undefined: runCommandWithRetry
Error: ./template_infra_test.go:227:2: undefined: runCommandWithRetry

See inline comments.

And we should post evidence of the cleanup-test-resources working, probably via the GitHub actions? So like update the new GitHub actions to trigger based on pushes to this feature branch temporarily, then run the scan one, capture the results. Eventually we should run the cleanup action itself, but can wait until we are ready to merge (in case we want to do more testing of the cleanup script).

Sean Thomas added 2 commits November 26, 2025 15:12
The function was previously in infra/test/helpers.go but needs to be
accessible from template-only-test package. Moving it to the same file
resolves the undefined function error.
Add push trigger for fix/improve-test-cleanup-process branch to
demonstrate the cleanup-test-resources script working in CI.
Also fix workflow filename in notification message.

This trigger should be removed before merging.
@sean-navapbc
Copy link
Copy Markdown
Contributor Author

Fixed the test failures and added workflow testing:

Fixes:

  • ✅ Added runCommandWithRetry helper function to template-only-test/template_infra_test.go (was previously in separate package)
  • ✅ Added missing retry import

Testing the cleanup script:

  • ✅ Added temporary push trigger to template-scan-orphaned-infra-test-resources.yml for this branch
  • ✅ Fixed workflow filename reference in notification message

The scan workflow should now run automatically on pushes to this branch, demonstrating the cleanup-test-resources script in action. Once testing is complete and before merging, we'll remove the temporary push trigger.

Sean Thomas added 4 commits November 26, 2025 15:17
- template-scan-orphaned-infra-test-resources.yml → template-only-scan-orphaned-infra-test-resources.yml
- template-cleanup-orphaned-infra-test-resources.yml → template-only-cleanup-orphaned-infra-test-resources.yml
- Updated workflow names inside files to match
- Updated workflow reference in notification message

This follows the template-only- naming convention for resources that
should not be installed to projects.
The cleanup-test-resources script is already executable in the git
repository (file mode 100755), so chmod +x is not needed.
- cleanup-test-resources → template-only-cleanup-test-resources
- Updated all workflow references to use new script name

Consistent with template-only- naming convention for resources that
should not be installed to projects.
Use { cmd1; cmd2; } >> file syntax instead of individual redirects
to GITHUB_OUTPUT as recommended by shellcheck.
sean-navapbc and others added 2 commits December 1, 2025 11:01
…sources.yml

Co-authored-by: Tanner Doshier <git@doshitan.com>
- Rename template-only-cleanup-test-resources to cleanup-test-resources
  (no template-only- prefix needed since it's in template-only-bin/)
- Update workflow display names to be human readable
- Fix broken pipe errors in scan workflow output parsing
sean-navapbc and others added 4 commits December 11, 2025 12:23
- Use printf with stderr redirect to fix broken pipe errors in scan workflow
- Add cleanup_inactive_task_definitions() to directly query and delete
  inactive ECS task definitions matching plt-tst-act-* pattern
- Inactive task definitions aren't returned by Resource Groups Tagging API
@sean-navapbc
Copy link
Copy Markdown
Contributor Author


Issues Fixed

  1. Lint issue (SC2001) ✅

Fixed in commit: a5d77ae

Changed from:
bucket_name=$(echo "${s3_arn}" | sed 's/arn:aws:s3::://')
To:
bucket_name="${s3_arn#arn:aws:s3:::}"

Evidence: https://github.com/navapbc/template-infra/actions/runs/20143437468/job/57817201077

  1. Broken pipe errors ✅

Fixed in commit: aea90f5

Added stderr redirect to suppress broken pipe messages when head terminates early:
{ echo "$output" | head -50 || true; } 2>/dev/null

Evidence: https://github.com/navapbc/template-infra/actions/runs/20143436251/job/57817196876

  1. Inactive task definitions not being found ✅

Fixed in commit: daa0f2e

Added cleanup_inactive_task_definitions() function that directly queries ECS for inactive task definitions matching plt-tst-act-*
pattern, bypassing the Resource Groups Tagging API which doesn't return INACTIVE task definitions.

aws ecs list-task-definition-families --family-prefix "plt-tst-act-" --status INACTIVE
aws ecs delete-task-definitions --task-definitions "${task_arn}"

CI Status

All checks passing:

  • Lint template scripts ✅
  • Scan for orphaned test resources ✅ (no broken pipe errors)
  • tfsec, checkov, terraform format ✅

@doshitan
Copy link
Copy Markdown
Contributor

2. Broken pipe errors ✅

Fixed in commit: aea90f5

Added stderr redirect to suppress broken pipe messages when head terminates early: { echo "$output" | head -50 || true; } 2>/dev/null

Evidence: https://github.com/navapbc/template-infra/actions/runs/20143436251/job/57817196876

That run still shows:

/home/runner/work/_temp/0c386a34-b716-470a-ad98-86ea74797aaf.sh: line 8: echo: write error: Broken pipe

And does not list any orphaned resources, but it found a bunch of "orphaned" projects, so there should be resources right?

3. Inactive task definitions not being found ✅

The latest scan runs don't show any inactive task definitions?

Sean Thomas added 3 commits December 16, 2025 12:54
- Use '> /dev/null' instead of 'grep -q' to avoid broken pipe
- Print full output for better debugging
- Add summary showing how many projects were checked
The broken pipe errors occur when piping large output through grep/head
because the receiving command exits before all input is consumed.

Using a temp file avoids this issue entirely.
…refix

Task definitions are named like 'app-dev' but tagged with project=plt-tst-act-*.
The previous code searched for family prefix 'plt-tst-act-*' which never matched.
Now we list all task definition families, check their tags, and only delete
ones tagged with plt-tst-act-* projects.
@sean-navapbc
Copy link
Copy Markdown
Contributor Author

2. Broken pipe errors ✅

Fixed in commit: aea90f5
Added stderr redirect to suppress broken pipe messages when head terminates early: { echo "$output" | head -50 || true; } 2>/dev/null
Evidence: https://github.com/navapbc/template-infra/actions/runs/20143436251/job/57817196876

That run still shows:

/home/runner/work/_temp/0c386a34-b716-470a-ad98-86ea74797aaf.sh: line 8: echo: write error: Broken pipe

And does not list any orphaned resources, but it found a bunch of "orphaned" projects, so there should be resources right?

3. Inactive task definitions not being found ✅

The latest scan runs don't show any inactive task definitions?

Should be fixed now

@doshitan
Copy link
Copy Markdown
Contributor

The PR description doesn't quite match the current scope of the work, so can update that. A few other tweaks and probably some clearer comments/docs, but then probably ready to ship. Once we are confident things are mostly working, we should run the cleanup script, confirming everything is cleared out. Then merge and monitor how things work out going forward.

Sean Thomas added 4 commits January 5, 2026 17:48
Removes the runCommandWithRetry function from helpers.go as it's not
currently used in any tests. Also removes the unused imports (time,
retry, shell) that were only needed for that function.
The --age-hours option was defined but never actually used to filter
resources by age. Removing it to avoid confusion.
- Clarify why cleanup_inactive_task_definitions scans ECS directly
- Document that Terraform manages task definitions but AWS provider
  doesn't support automatic deletion on destroy
- Add link to upstream issue hashicorp/terraform-provider-aws#5919
- Refactor destroy-app-service to use pushd/popd instead of cd
@sean-navapbc sean-navapbc changed the title Clean up template-infra pipeline with 2 major changes Improves CI test resource cleanup to prevent orphaned AWS resources from failed/cancelled test runs. Jan 6, 2026
@sean-navapbc
Copy link
Copy Markdown
Contributor Author

Some other issues we can address out of scope:

  1. Route 53 deletion - might fail if hosted zone has records (would need to delete records first)
  2. Hardcoded AWS account ID - AWS_ACCOUNT_ID="533267424629" in cleanup-test-resources

@sean-navapbc sean-navapbc requested a review from doshitan January 8, 2026 00:26
Copy link
Copy Markdown
Contributor

@doshitan doshitan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a recent run of the "scan" GitHub workflow. What does that look like at the moment? What's the current state of the account?

The link in the description is still incorrect, should point to hashicorp/terraform-provider-aws#29682

But if the script is running successfully in dry run mode and clean up mode, with and without resources to find, we should ship it!

@sean-navapbc sean-navapbc merged commit 7ad5ef3 into main Jan 9, 2026
9 checks passed
@sean-navapbc sean-navapbc deleted the fix/improve-test-cleanup-process branch January 9, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants