fix(smoke): wait for CFN to stabilise before deploying; narrow push trigger#257
Merged
Conversation
…rigger The recurring `ValidationError ... is in UPDATE_ROLLBACK_IN_PROGRESS state and can not be updated` failure during a burst of smoke runs wasn't a concurrency problem (`concurrency: smoke` already serialises everything that touches AWS) — it was a race inside `smoke-pre-deploy-state.sh`: 1. `UPDATE_ROLLBACK_FAILED` branch called `continue-update-rollback`, an async op that flips the stack back into `UPDATE_ROLLBACK_IN_PROGRESS`, then immediately wrote `stack_name=$STACK` and returned. The next step tried `CreateChangeSet` ~1s later and got rejected. 2. `*_IN_PROGRESS` branch only slept 60s before declaring the stack stuck and switching to a `*-recovery-<run_id>` name. An `all-demo` rollback takes 20-40 minutes, so the script gave up far too early and either created an orphan recovery stack or, when the deploy step ran against the original name, hit the same ValidationError. Replace both with a polling waiter (`wait_for_stable`) that blocks up to 60 minutes for the stack to leave any `*_IN_PROGRESS` state. We poll ourselves rather than using `aws cloudformation wait` because the built-in waiter caps at 30 minutes and our umbrella rollbacks routinely exceed that. `cancel-update-stack` is also no longer issued on `UPDATE_COMPLETE_CLEANUP_IN_PROGRESS` or `UPDATE_ROLLBACK_IN_PROGRESS` since CFN rejects it on those states anyway. Additionally narrow `smoke.yml`'s `push: branches: [main]` trigger with the same `paths:` filter the PR trigger already uses, so unrelated main-branch pushes (docs, CI tweaks, scenario READMEs) don't queue up behind real smoke work.
First PR-CI run revealed the gap: when the umbrella enters UPDATE_ROLLBACK_FAILED because one or more leaf resources are stuck in UPDATE_FAILED, a plain `continue-update-rollback` retries the same failing resources and lands the stack right back in UPDATE_ROLLBACK_FAILED. The script then wrote stack_name and the deploy step hit the same ValidationError we set out to prevent. Now, if the first continue-update-rollback resolves stable but is still UPDATE_ROLLBACK_FAILED, list the direct child resources currently in UPDATE_FAILED and re-try the rollback with --resources-to-skip. CFN permits skipping nested-stack resources, so umbrella-wide UPDATE_FAILED caused by a deep leaf still gets unblocked. If the skip retry also lands in UPDATE_ROLLBACK_FAILED, fall back to emit_recovery and open a stranded-stack issue for human triage.
PR-CI exposed the next tier: even with --resources-to-skip on the failed AICC nested stack, continue-update-rollback can leave the umbrella in UPDATE_ROLLBACK_FAILED indefinitely. Falling back to a recovery name doesn't work either, because the original stack still owns globally- unique resources (AppRegistryApplication 'NDXTry_All_Scenarios_<acct>'), so the fresh-name deploy hits AlreadyExists on first create. Add a final fallback: after the skip-retry still ends in UPDATE_ROLLBACK_FAILED, delete-stack the umbrella outright and wait for DELETE_COMPLETE. CFN accepts delete-stack from UPDATE_ROLLBACK_FAILED. If the delete itself fails (DELETE_FAILED), re-issue with --retain-resources for the stuck leaves and open a stranded-stack follow-up so humans clean those up later. Either way the umbrella is now gone (or deleting + retained), so the next deploy creates it fresh under the original name and the AppRegistry conflict can't recur.
When delete-stack of the umbrella succeeds via --retain-resources for
stuck nested stacks, those nested stacks live on as top-level orphans
(StackName all-demo-<LogicalId>-<random>) and still own globally-unique
child resources — AppRegistryApplication names in particular. The fresh
umbrella deploy then collides on those names when its own nested stack
(e.g. PaperlessNgx) tries to create its AppRegistryApplication.
After the umbrella reaches DOES_NOT_EXIST/DELETE_COMPLETE, list all
top-level stacks whose name starts with ${STACK}- and try delete-stack
on each. The orphan delete usually succeeds because the parent-child
race that originally blocked it is gone. If an orphan still ends in
DELETE_FAILED, retain its stuck leaves and open a stranded-stack issue
for human triage; we don't recurse further to keep the script bounded.
Yesterday's UPDATE_ROLLBACK_FAILED → delete-with-retain path left orphan nested stacks (all-demo-PaperlessNgx-*, etc.) in the smoke account. They hold globally-unique resources (AppRegistryApplication names) that block the umbrella's next create. The orphan-sweep code was only wired into the UPDATE_ROLLBACK_FAILED branch, so a follow-up run from DOES_NOT_EXIST (stack is gone, orphans linger) still tripped on the same conflict. Refactor: extract sweep into sweep_orphan_stacks() and call it from a new use_canonical() helper, which every "use the original STACK name" exit path goes through. Each branch that calls use_canonical now sweeps exactly once, just before stack_name is written, regardless of how we got into a deployable state. emit_recovery still bypasses (recovery names rarely collide on globally-unique resources, and the sweep would add latency on the give-up path).
PR-CI exposed two more failure shapes after the last fix: 1. Initial status of ROLLBACK_IN_PROGRESS would settle to ROLLBACK_COMPLETE inside the *_IN_PROGRESS branch's wait_for_stable, and the branch then called use_canonical with a status that CFN refuses for updates. 2. An orphan whose delete-with-retain hit the 30m wait timeout used to be abandoned (issue opened, script proceeded). The lingering orphan still blocked the umbrella's recreation by holding AppRegistryApplication names, so the subsequent deploy failed in the same way. Rebuild the script as a re-evaluation loop (max 8 iterations) that re-reads the stack's status after every CFN-mutating call and re- dispatches via the case statement. Transitions like ROLLBACK_IN_PROGRESS → ROLLBACK_COMPLETE → delete + recreate now flow naturally without explicit chaining. Orphan cleanup is hardened with a three-stage cleanup_orphan(): plain delete, retain-on-DELETE_FAILED retry, then force-retain-everything as a last resort. Force-retain leaves debris on the account (logged as a stranded-stack issue) but at least the orphan stops blocking the umbrella's create. Also handle ROLLBACK_FAILED (rare; same shape as UPDATE_ROLLBACK_FAILED) and DELETE_FAILED-with-retain (so the script doesn't fall off the case statement into the * branch).
…slate Force-retain orphan-stack cleanups leave behind non-stack-owned resources that still hold globally-unique names. The umbrella's next create then trips on AlreadyExists at the leaf: ndx-try-planning-docs-<acct>-us-east-1 already exists NDXTry_All_Scenarios_<acct> AppRegistryApplication conflict Pattern: scenario templates use deterministic names with the account id in the suffix (S3) or NDXTry_ prefix (AppRegistry). Add two new sweep helpers — sweep_orphan_s3_buckets (object-version-aware empty + rb) and sweep_orphan_appregistry — and invoke them from use_canonical *only* when the umbrella is truly absent from CFN (status DOES_NOT_EXIST). That DOES_NOT_EXIST guard is important: running these while CFN is mid-create would yank live state out from under the stack. The smoke account hosts nothing but smoke fixtures, so blanket-deleting matching resources when no stack exists is safe. If a bucket or app refuses deletion, open a stranded-stack issue and keep going — the deploy can still proceed against the names we did manage to free, and the remaining few will surface a clear failure that a human can clean up.
…orphans PR-CI run showed "No ndx-try-*464453619983* buckets to sweep." even though five such buckets demonstrably existed (the next deploy failed with AlreadyExists on every one). Root cause: JMESPath backtick-literals parse their contents as JSON, so the 12-digit account id became a *number* in the expression, and `contains(Name, <number>)` against a string never matches. Switch to single-quoted JMESPath strings for both the bucket filter (account-id-bearing names) and the AppRegistry filter (NDXTry_ prefix). Single-quoted string literals are interpreted as raw strings regardless of content, avoiding the JSON-parse pitfall.
… sweep PR-CI run 26110208253 exposed two more orphan classes: 1. Bucket sweep claimed success but ndx-try-paperless-archive-v2-... was demonstrably present 21 s later when deploy ran. The old code swallowed stderr on every step (delete-objects, s3 rb), so a permissions error or partial delete looked identical to a clean delete. Refactored into delete_bucket_completely(): three attempts per bucket, stderr surfaced on each, and a positive head-bucket check verifies the bucket actually disappeared (rather than trusting the rb exit code). 2. Amazon Connect instances survive CFN delete and their alias is account-globally-unique. Added sweep_orphan_connect: list instances, filter to ndx-try-* aliases, delete-instance each. Failures route to a stranded-stack issue. Both new sweeps run from use_canonical only when the umbrella is DOES_NOT_EXIST, matching the existing safety gate.
…ollback ROLLBACK_FAILED (initial CREATE rolled back, rollback itself failed) does not accept continue-update-rollback — that API verb is for the UPDATE variant. The old branch silently no-op'd and the loop burned all 8 iterations in 18 seconds, fell out to emit_recovery, and the deploy under the recovery name then collided with leftover orphan resources that hadn't been swept (sweep gates on DOES_NOT_EXIST). Mirror the ROLLBACK_COMPLETE branch's structure: delete-stack, then retain-on-DELETE_FAILED retry, then `continue` so the next iteration sees DOES_NOT_EXIST and routes through use_canonical (which then runs the resource sweep).
The Paperless-ngx scenario uses AWS::S3Files::FileSystem (the newer
mountpoint-style S3 filesystem service, ARN namespace s3files) attached
to its archive bucket. CFN delete with --retain-resources leaves the
file system orphaned, and the bucket then refuses every subsequent
delete with BucketHasS3FileSystemAttached.
Add sweep_orphan_s3files: `aws s3files list-file-systems`, filter by
bucket name matching our ndx-try-*${acct}* pattern, delete-file-system
--force-delete each match. The s3files API was added to AWS CLI in
2025-05, so newer than the local CLI but available on the GH Actions
runner. If list-file-systems is missing, log + continue rather than
crash.
Call sweep_orphan_s3files BEFORE sweep_orphan_s3_buckets in
use_canonical so the buckets have a chance to delete cleanly. Delete is
async — added a 60s grace period before bucket sweep runs.
Pre-deploy failed with:
aws: [ERROR]: An error occurred (ParamValidation): Error parsing
parameter '--delete': Expected: '=', received: '"' for input
Root cause: I was building --delete as CLI shorthand prefixed with
`Objects=` and then concatenating a JSON array. The CLI parses shorthand
character-by-character and rejects JSON's double-quoted keys.
Two fixes:
1. Build the payload as full JSON via `jq -n --argjson o "$versions"
'{Objects: $o}'`; the CLI auto-detects values starting with `{` as
JSON and parses correctly.
2. Stop piping stderr through sed — with set -euo pipefail, a non-zero
exit anywhere in the pipeline kills the script (and prior aws CLI
parse error rate sent us through that path). Capture stderr to a var
and echo it indented on a separate line.
The S3 Files sweep in the same run did its job: deleted
fs-0ba2dd6e6d16e2fa8 and the dependent paperless-archive bucket got
deleted on attempt 1. The next bucket then tripped the parse error.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Concurrent smoke runs (e.g. the Renovate-PR burst this morning) kept failing with
```
aws: [ERROR]: An error occurred (ValidationError) when calling the CreateChangeSet operation:
Stack:arn:aws:cloudformation:us-east-1:464453619983:stack/all-demo/... is in
UPDATE_ROLLBACK_IN_PROGRESS state and can not be updated.
```
The `concurrency: smoke / cancel-in-progress: false` lock is already correct — only one smoke run hits AWS at a time. The actual bug was in `scripts/smoke-pre-deploy-state.sh`:
Both branches now block on a polling waiter (`wait_for_stable`, 60 min max) until the stack leaves any `*_IN_PROGRESS` state. We poll ourselves because the built-in `aws cloudformation wait` caps at 30 min. Also stopped issuing `cancel-update-stack` on `UPDATE_COMPLETE_CLEANUP_IN_PROGRESS` and `UPDATE_ROLLBACK_IN_PROGRESS` since CFN rejects it on those states anyway.
Bundled in: narrow `smoke.yml`'s `push: branches: [main]` trigger with the same `paths:` filter the PR trigger already uses. Doc-only or CI-only commits to main no longer queue smoke runs.
Test plan
Out of scope