Skip to content

fix(smoke): hotfix — orphan-stack sweep skipped live nested children#328

Merged
chrisns merged 1 commit into
mainfrom
fix/smoke-sweep-orphans-only
May 20, 2026
Merged

fix(smoke): hotfix — orphan-stack sweep skipped live nested children#328
chrisns merged 1 commit into
mainfrom
fix/smoke-sweep-orphans-only

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 20, 2026

Summary

Post-merge smoke run on main (#26127242680) found the umbrella in `CREATE_COMPLETE` from the prior PR-CI success, then `use_canonical` → `sweep_orphan_stacks` matched `StackSummaries[?starts_with(StackName, 'all-demo-')]` — which includes all 16 of the live nested children. The script then started deleting them one by one. After 90 min GH killed the job for `timeout-minutes`; by then 14-16 nested stacks were gone.

Fix

Add a `ParentId==null` filter so the sweep only considers top-level stacks. Live nested children of any parent have `ParentId` set and are now skipped.

Also moved the `starts_with` arg from a backtick-literal to a single-quoted JMESPath string, matching the bucket/appregistry queries we already fixed. (Backticks-as-JSON has bitten us twice already on this PR; making the whole sweep consistent.)

Recovery plan

The smoke account is now half-deleted. The script's `UPDATE_ROLLBACK_FAILED` / `DELETE_FAILED` ladders should self-recover on the next run (delete the half-broken umbrella, sweep any real orphans now that ParentId filter works, then recreate). Expect 30-90 min of CFN churn before `all-demo` reaches `CREATE_COMPLETE` again.

Test plan

  • `bash -n` clean
  • Post-merge smoke on main reaches success on `all-demo` from the half-deleted starting state

CRITICAL hotfix. Post-merge smoke run on main found the umbrella in
CREATE_COMPLETE (the success state from the prior PR-CI deploy) and
went into use_canonical → sweep_orphan_stacks. The sweep matched
StackSummaries[?starts_with(StackName, \`all-demo-\`)] — which
includes ALL 16 of the umbrella's LIVE nested children. Pre-deploy
then started cheerfully deleting them one at a time. After 90 minutes
GH killed the job for timeout-minutes; by then 14-16 nested stacks had
been torn down, half-destroying the working umbrella.

Add a `ParentId==\`null\`` filter so the sweep only considers top-level
stacks (true orphans). Active nested children of any parent stack have
ParentId set and are now skipped.

Also switched the starts_with arg from backtick-literal to
single-quoted JMESPath string, matching the bucket/appregistry queries
we fixed earlier (defensive — backticks-as-JSON has bitten us twice
already).

The smoke account is now in a torn-half-down state from this incident;
the script's UPDATE_ROLLBACK_FAILED ladder should recover it on the
next run, but expect 30-90m of CFN churn before the umbrella reaches
CREATE_COMPLETE again.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 20, 2026 05:39 — with GitHub Actions Failure
@chrisns chrisns added this pull request to the merge queue May 20, 2026
Merged via the queue into main with commit 47c3fba May 20, 2026
10 of 12 checks passed
@chrisns chrisns deleted the fix/smoke-sweep-orphans-only branch May 20, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant