fix(smoke): hotfix — orphan-stack sweep skipped live nested children#328
Merged
Conversation
CRITICAL hotfix. Post-merge smoke run on main found the umbrella in CREATE_COMPLETE (the success state from the prior PR-CI deploy) and went into use_canonical → sweep_orphan_stacks. The sweep matched StackSummaries[?starts_with(StackName, \`all-demo-\`)] — which includes ALL 16 of the umbrella's LIVE nested children. Pre-deploy then started cheerfully deleting them one at a time. After 90 minutes GH killed the job for timeout-minutes; by then 14-16 nested stacks had been torn down, half-destroying the working umbrella. Add a `ParentId==\`null\`` filter so the sweep only considers top-level stacks (true orphans). Active nested children of any parent stack have ParentId set and are now skipped. Also switched the starts_with arg from backtick-literal to single-quoted JMESPath string, matching the bucket/appregistry queries we fixed earlier (defensive — backticks-as-JSON has bitten us twice already). The smoke account is now in a torn-half-down state from this incident; the script's UPDATE_ROLLBACK_FAILED ladder should recover it on the next run, but expect 30-90m of CFN churn before the umbrella reaches CREATE_COMPLETE again.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-merge smoke run on main (#26127242680) found the umbrella in `CREATE_COMPLETE` from the prior PR-CI success, then `use_canonical` → `sweep_orphan_stacks` matched `StackSummaries[?starts_with(StackName, 'all-demo-')]` — which includes all 16 of the live nested children. The script then started deleting them one by one. After 90 min GH killed the job for `timeout-minutes`; by then 14-16 nested stacks were gone.
Fix
Add a `ParentId==null` filter so the sweep only considers top-level stacks. Live nested children of any parent have `ParentId` set and are now skipped.
Also moved the `starts_with` arg from a backtick-literal to a single-quoted JMESPath string, matching the bucket/appregistry queries we already fixed. (Backticks-as-JSON has bitten us twice already on this PR; making the whole sweep consistent.)
Recovery plan
The smoke account is now half-deleted. The script's `UPDATE_ROLLBACK_FAILED` / `DELETE_FAILED` ladders should self-recover on the next run (delete the half-broken umbrella, sweep any real orphans now that ParentId filter works, then recreate). Expect 30-90 min of CFN churn before `all-demo` reaches `CREATE_COMPLETE` again.
Test plan