fix(smoke): wait for CFN to stabilise before deploying; narrow push trigger by chrisns · Pull Request #257 · co-cddo/ndx_try_aws_scenarios

chrisns · 2026-05-19T11:41:57Z

Summary

Concurrent smoke runs (e.g. the Renovate-PR burst this morning) kept failing with

```
aws: [ERROR]: An error occurred (ValidationError) when calling the CreateChangeSet operation:
Stack:arn:aws:cloudformation:us-east-1:464453619983:stack/all-demo/... is in
UPDATE_ROLLBACK_IN_PROGRESS state and can not be updated.
```

The `concurrency: smoke / cancel-in-progress: false` lock is already correct — only one smoke run hits AWS at a time. The actual bug was in `scripts/smoke-pre-deploy-state.sh`:

`UPDATE_ROLLBACK_FAILED` ran `continue-update-rollback` (an async op that flips the stack into `UPDATE_ROLLBACK_IN_PROGRESS`) then wrote `stack_name` and returned immediately. The deploy step raced straight back into the in-flight rollback.
`*_IN_PROGRESS` only slept 60s before giving up. An `all-demo` umbrella rollback takes 20–40 min, so the script either fired up a useless `*-recovery-<run_id>` stack or — more often — the next deploy hit the same ValidationError.

Both branches now block on a polling waiter (`wait_for_stable`, 60 min max) until the stack leaves any `*_IN_PROGRESS` state. We poll ourselves because the built-in `aws cloudformation wait` caps at 30 min. Also stopped issuing `cancel-update-stack` on `UPDATE_COMPLETE_CLEANUP_IN_PROGRESS` and `UPDATE_ROLLBACK_IN_PROGRESS` since CFN rejects it on those states anyway.

Bundled in: narrow `smoke.yml`'s `push: branches: [main]` trigger with the same `paths:` filter the PR trigger already uses. Doc-only or CI-only commits to main no longer queue smoke runs.

Test plan

`bash -n scripts/smoke-pre-deploy-state.sh` — clean
Workflow YAML parses (`js-yaml`)
After merge: next push to main with a no-op-for-smoke path (e.g. README edit) does not trigger Smoke Pack
Next time the umbrella is in `UPDATE_ROLLBACK_IN_PROGRESS` at the start of a run, pre-deploy waits it out instead of failing immediately

Out of scope

The underlying cause of the original rollback (separate investigation; pre-existing umbrella churn)
Cross-workflow concurrency (only `smoke.yml` touches the smoke account, so the existing per-workflow group is enough)

…rigger The recurring `ValidationError ... is in UPDATE_ROLLBACK_IN_PROGRESS state and can not be updated` failure during a burst of smoke runs wasn't a concurrency problem (`concurrency: smoke` already serialises everything that touches AWS) — it was a race inside `smoke-pre-deploy-state.sh`: 1. `UPDATE_ROLLBACK_FAILED` branch called `continue-update-rollback`, an async op that flips the stack back into `UPDATE_ROLLBACK_IN_PROGRESS`, then immediately wrote `stack_name=$STACK` and returned. The next step tried `CreateChangeSet` ~1s later and got rejected. 2. `*_IN_PROGRESS` branch only slept 60s before declaring the stack stuck and switching to a `*-recovery-<run_id>` name. An `all-demo` rollback takes 20-40 minutes, so the script gave up far too early and either created an orphan recovery stack or, when the deploy step ran against the original name, hit the same ValidationError. Replace both with a polling waiter (`wait_for_stable`) that blocks up to 60 minutes for the stack to leave any `*_IN_PROGRESS` state. We poll ourselves rather than using `aws cloudformation wait` because the built-in waiter caps at 30 minutes and our umbrella rollbacks routinely exceed that. `cancel-update-stack` is also no longer issued on `UPDATE_COMPLETE_CLEANUP_IN_PROGRESS` or `UPDATE_ROLLBACK_IN_PROGRESS` since CFN rejects it on those states anyway. Additionally narrow `smoke.yml`'s `push: branches: [main]` trigger with the same `paths:` filter the PR trigger already uses, so unrelated main-branch pushes (docs, CI tweaks, scenario READMEs) don't queue up behind real smoke work.

First PR-CI run revealed the gap: when the umbrella enters UPDATE_ROLLBACK_FAILED because one or more leaf resources are stuck in UPDATE_FAILED, a plain `continue-update-rollback` retries the same failing resources and lands the stack right back in UPDATE_ROLLBACK_FAILED. The script then wrote stack_name and the deploy step hit the same ValidationError we set out to prevent. Now, if the first continue-update-rollback resolves stable but is still UPDATE_ROLLBACK_FAILED, list the direct child resources currently in UPDATE_FAILED and re-try the rollback with --resources-to-skip. CFN permits skipping nested-stack resources, so umbrella-wide UPDATE_FAILED caused by a deep leaf still gets unblocked. If the skip retry also lands in UPDATE_ROLLBACK_FAILED, fall back to emit_recovery and open a stranded-stack issue for human triage.

PR-CI exposed the next tier: even with --resources-to-skip on the failed AICC nested stack, continue-update-rollback can leave the umbrella in UPDATE_ROLLBACK_FAILED indefinitely. Falling back to a recovery name doesn't work either, because the original stack still owns globally- unique resources (AppRegistryApplication 'NDXTry_All_Scenarios_<acct>'), so the fresh-name deploy hits AlreadyExists on first create. Add a final fallback: after the skip-retry still ends in UPDATE_ROLLBACK_FAILED, delete-stack the umbrella outright and wait for DELETE_COMPLETE. CFN accepts delete-stack from UPDATE_ROLLBACK_FAILED. If the delete itself fails (DELETE_FAILED), re-issue with --retain-resources for the stuck leaves and open a stranded-stack follow-up so humans clean those up later. Either way the umbrella is now gone (or deleting + retained), so the next deploy creates it fresh under the original name and the AppRegistry conflict can't recur.

When delete-stack of the umbrella succeeds via --retain-resources for stuck nested stacks, those nested stacks live on as top-level orphans (StackName all-demo-<LogicalId>-<random>) and still own globally-unique child resources — AppRegistryApplication names in particular. The fresh umbrella deploy then collides on those names when its own nested stack (e.g. PaperlessNgx) tries to create its AppRegistryApplication. After the umbrella reaches DOES_NOT_EXIST/DELETE_COMPLETE, list all top-level stacks whose name starts with ${STACK}- and try delete-stack on each. The orphan delete usually succeeds because the parent-child race that originally blocked it is gone. If an orphan still ends in DELETE_FAILED, retain its stuck leaves and open a stranded-stack issue for human triage; we don't recurse further to keep the script bounded.

Yesterday's UPDATE_ROLLBACK_FAILED → delete-with-retain path left orphan nested stacks (all-demo-PaperlessNgx-*, etc.) in the smoke account. They hold globally-unique resources (AppRegistryApplication names) that block the umbrella's next create. The orphan-sweep code was only wired into the UPDATE_ROLLBACK_FAILED branch, so a follow-up run from DOES_NOT_EXIST (stack is gone, orphans linger) still tripped on the same conflict. Refactor: extract sweep into sweep_orphan_stacks() and call it from a new use_canonical() helper, which every "use the original STACK name" exit path goes through. Each branch that calls use_canonical now sweeps exactly once, just before stack_name is written, regardless of how we got into a deployable state. emit_recovery still bypasses (recovery names rarely collide on globally-unique resources, and the sweep would add latency on the give-up path).

PR-CI exposed two more failure shapes after the last fix: 1. Initial status of ROLLBACK_IN_PROGRESS would settle to ROLLBACK_COMPLETE inside the *_IN_PROGRESS branch's wait_for_stable, and the branch then called use_canonical with a status that CFN refuses for updates. 2. An orphan whose delete-with-retain hit the 30m wait timeout used to be abandoned (issue opened, script proceeded). The lingering orphan still blocked the umbrella's recreation by holding AppRegistryApplication names, so the subsequent deploy failed in the same way. Rebuild the script as a re-evaluation loop (max 8 iterations) that re-reads the stack's status after every CFN-mutating call and re- dispatches via the case statement. Transitions like ROLLBACK_IN_PROGRESS → ROLLBACK_COMPLETE → delete + recreate now flow naturally without explicit chaining. Orphan cleanup is hardened with a three-stage cleanup_orphan(): plain delete, retain-on-DELETE_FAILED retry, then force-retain-everything as a last resort. Force-retain leaves debris on the account (logged as a stranded-stack issue) but at least the orphan stops blocking the umbrella's create. Also handle ROLLBACK_FAILED (rare; same shape as UPDATE_ROLLBACK_FAILED) and DELETE_FAILED-with-retain (so the script doesn't fall off the case statement into the * branch).

…slate Force-retain orphan-stack cleanups leave behind non-stack-owned resources that still hold globally-unique names. The umbrella's next create then trips on AlreadyExists at the leaf: ndx-try-planning-docs-<acct>-us-east-1 already exists NDXTry_All_Scenarios_<acct> AppRegistryApplication conflict Pattern: scenario templates use deterministic names with the account id in the suffix (S3) or NDXTry_ prefix (AppRegistry). Add two new sweep helpers — sweep_orphan_s3_buckets (object-version-aware empty + rb) and sweep_orphan_appregistry — and invoke them from use_canonical *only* when the umbrella is truly absent from CFN (status DOES_NOT_EXIST). That DOES_NOT_EXIST guard is important: running these while CFN is mid-create would yank live state out from under the stack. The smoke account hosts nothing but smoke fixtures, so blanket-deleting matching resources when no stack exists is safe. If a bucket or app refuses deletion, open a stranded-stack issue and keep going — the deploy can still proceed against the names we did manage to free, and the remaining few will surface a clear failure that a human can clean up.

…orphans PR-CI run showed "No ndx-try-*464453619983* buckets to sweep." even though five such buckets demonstrably existed (the next deploy failed with AlreadyExists on every one). Root cause: JMESPath backtick-literals parse their contents as JSON, so the 12-digit account id became a *number* in the expression, and `contains(Name, <number>)` against a string never matches. Switch to single-quoted JMESPath strings for both the bucket filter (account-id-bearing names) and the AppRegistry filter (NDXTry_ prefix). Single-quoted string literals are interpreted as raw strings regardless of content, avoiding the JSON-parse pitfall.

… sweep PR-CI run 26110208253 exposed two more orphan classes: 1. Bucket sweep claimed success but ndx-try-paperless-archive-v2-... was demonstrably present 21 s later when deploy ran. The old code swallowed stderr on every step (delete-objects, s3 rb), so a permissions error or partial delete looked identical to a clean delete. Refactored into delete_bucket_completely(): three attempts per bucket, stderr surfaced on each, and a positive head-bucket check verifies the bucket actually disappeared (rather than trusting the rb exit code). 2. Amazon Connect instances survive CFN delete and their alias is account-globally-unique. Added sweep_orphan_connect: list instances, filter to ndx-try-* aliases, delete-instance each. Failures route to a stranded-stack issue. Both new sweeps run from use_canonical only when the umbrella is DOES_NOT_EXIST, matching the existing safety gate.

…ollback ROLLBACK_FAILED (initial CREATE rolled back, rollback itself failed) does not accept continue-update-rollback — that API verb is for the UPDATE variant. The old branch silently no-op'd and the loop burned all 8 iterations in 18 seconds, fell out to emit_recovery, and the deploy under the recovery name then collided with leftover orphan resources that hadn't been swept (sweep gates on DOES_NOT_EXIST). Mirror the ROLLBACK_COMPLETE branch's structure: delete-stack, then retain-on-DELETE_FAILED retry, then `continue` so the next iteration sees DOES_NOT_EXIST and routes through use_canonical (which then runs the resource sweep).

The Paperless-ngx scenario uses AWS::S3Files::FileSystem (the newer mountpoint-style S3 filesystem service, ARN namespace s3files) attached to its archive bucket. CFN delete with --retain-resources leaves the file system orphaned, and the bucket then refuses every subsequent delete with BucketHasS3FileSystemAttached. Add sweep_orphan_s3files: `aws s3files list-file-systems`, filter by bucket name matching our ndx-try-*${acct}* pattern, delete-file-system --force-delete each match. The s3files API was added to AWS CLI in 2025-05, so newer than the local CLI but available on the GH Actions runner. If list-file-systems is missing, log + continue rather than crash. Call sweep_orphan_s3files BEFORE sweep_orphan_s3_buckets in use_canonical so the buckets have a chance to delete cleanly. Delete is async — added a 60s grace period before bucket sweep runs.

Pre-deploy failed with: aws: [ERROR]: An error occurred (ParamValidation): Error parsing parameter '--delete': Expected: '=', received: '"' for input Root cause: I was building --delete as CLI shorthand prefixed with `Objects=` and then concatenating a JSON array. The CLI parses shorthand character-by-character and rejects JSON's double-quoted keys. Two fixes: 1. Build the payload as full JSON via `jq -n --argjson o "$versions" '{Objects: $o}'`; the CLI auto-detects values starting with `{` as JSON and parses correctly. 2. Stop piping stderr through sed — with set -euo pipefail, a non-zero exit anywhere in the pipeline kills the script (and prior aws CLI parse error rate sent us through that path). Capture stderr to a var and echo it indented on a separate line. The S3 Files sweep in the same run did its job: deleted fs-0ba2dd6e6d16e2fa8 and the dependent paperless-archive bucket got deleted on attempt 1. The next bucket then tripped the parse error.

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 11:42 — with GitHub Actions Failure

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 12:14 — with GitHub Actions Failure

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 12:34 — with GitHub Actions Failure

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 12:58 — with GitHub Actions Error

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 13:01 — with GitHub Actions Failure

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 13:44 — with GitHub Actions Failure

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 15:09 — with GitHub Actions Failure

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 16:19 — with GitHub Actions Failure

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 16:50 — with GitHub Actions Failure

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 16:55 — with GitHub Actions Failure

chrisns had a problem deploying to smoke-test-deploy May 19, 2026 17:50 — with GitHub Actions Failure

chrisns temporarily deployed to smoke-test-deploy May 19, 2026 18:11 — with GitHub Actions Inactive

chrisns added this pull request to the merge queue May 19, 2026

Merged via the queue into main with commit 5d87af3 May 19, 2026
12 checks passed

chrisns deleted the fix/smoke-pre-deploy-wait branch May 19, 2026 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(smoke): wait for CFN to stabilise before deploying; narrow push trigger#257

fix(smoke): wait for CFN to stabilise before deploying; narrow push trigger#257
chrisns merged 12 commits into
mainfrom
fix/smoke-pre-deploy-wait

chrisns commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrisns commented May 19, 2026

Summary

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant