fix(smoke): cover S3 Files access points + async-delete propagation#344
Merged
Conversation
PR 321's smoke retry surfaced three more layers of orphan-cleanup gaps:
1. delete-file-system fails with ConflictException "has access points"
even after mount targets are gone. AWS::S3Files::AccessPoint must be
deleted before mount targets, which must be deleted before the file
system. Update the s3files sweep to delete in that order:
access points (15s settle) → mount targets (30s settle) → file system.
2. AppRegistry delete-application is async on the server side. The
sweep returned success, deploy fired immediately, and CFN's "already
own application <NDXTry_BOPS_Planning_…>" check rejected the create.
Add a 30s grace period at the end of sweep_orphan_appregistry,
matching the s3files pattern.
3. Orphan ${STACK}-* stacks left over from prior runs were sometimes
in DELETE_IN_PROGRESS state — invisible to the sweep's status filter
(which only listed terminal states). The deploy raced their
AppRegistryAssociation children and hit
"stack status as DELETE_IN_PROGRESS … is not allowed". Add a second
pass at the end of sweep_orphan_stacks that polls list-stacks for
in-progress matches (ParentId==null) and wait_for_stable on each.
Also bundled in: `|| true` on cleanup_orphan and delete_bucket_completely
calls inside their respective sweep loops so a per-item failure no
longer kills the whole pre-deploy via set -e. (The stranded-stack issues
remain the audit trail.) Plus s3files delete stderr captured to var
instead of piped through sed — avoids the pipefail trap that bit us
earlier.
When iterative pre-deploy cleanup has accumulated more debris than the
script can sweep in 90 minutes, manual recovery is needed. Add a
workflow-dispatch workflow + script that runs under the existing
smoke-test-deploy role and clears every:
- ${STACK}* CloudFormation stack (top-level)
- AWS::S3Files::FileSystem (with access points + mount targets first)
- NDXTry_* AppRegistry applications
- ndx-try-* Connect instances
- ndx-try-*${ACCOUNT_ID}* S3 buckets
Same identity, same concurrency group as smoke.yml, so the two can't
run simultaneously. Requires typing NUKE as the confirm input to fire,
to avoid accidental destruction of the smoke account.
This is the script side of "manual cleanup" we kept reaching for during
PR #321's CI iterations. After running this, the next smoke run starts
from a true clean slate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 321 retry surfaced three more pre-deploy orphan-cleanup gaps. This PR closes all three plus bundles in the mount-targets work that went directly onto renovate/node-26.x (PR #321):
Bundled in:
Test plan