Skip to content

fix(smoke): wait for CFN to stabilise before deploying; narrow push trigger#257

Merged
chrisns merged 12 commits into
mainfrom
fix/smoke-pre-deploy-wait
May 19, 2026
Merged

fix(smoke): wait for CFN to stabilise before deploying; narrow push trigger#257
chrisns merged 12 commits into
mainfrom
fix/smoke-pre-deploy-wait

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 19, 2026

Summary

Concurrent smoke runs (e.g. the Renovate-PR burst this morning) kept failing with

```
aws: [ERROR]: An error occurred (ValidationError) when calling the CreateChangeSet operation:
Stack:arn:aws:cloudformation:us-east-1:464453619983:stack/all-demo/... is in
UPDATE_ROLLBACK_IN_PROGRESS state and can not be updated.
```

The `concurrency: smoke / cancel-in-progress: false` lock is already correct — only one smoke run hits AWS at a time. The actual bug was in `scripts/smoke-pre-deploy-state.sh`:

  • `UPDATE_ROLLBACK_FAILED` ran `continue-update-rollback` (an async op that flips the stack into `UPDATE_ROLLBACK_IN_PROGRESS`) then wrote `stack_name` and returned immediately. The deploy step raced straight back into the in-flight rollback.
  • `*_IN_PROGRESS` only slept 60s before giving up. An `all-demo` umbrella rollback takes 20–40 min, so the script either fired up a useless `*-recovery-<run_id>` stack or — more often — the next deploy hit the same ValidationError.

Both branches now block on a polling waiter (`wait_for_stable`, 60 min max) until the stack leaves any `*_IN_PROGRESS` state. We poll ourselves because the built-in `aws cloudformation wait` caps at 30 min. Also stopped issuing `cancel-update-stack` on `UPDATE_COMPLETE_CLEANUP_IN_PROGRESS` and `UPDATE_ROLLBACK_IN_PROGRESS` since CFN rejects it on those states anyway.

Bundled in: narrow `smoke.yml`'s `push: branches: [main]` trigger with the same `paths:` filter the PR trigger already uses. Doc-only or CI-only commits to main no longer queue smoke runs.

Test plan

  • `bash -n scripts/smoke-pre-deploy-state.sh` — clean
  • Workflow YAML parses (`js-yaml`)
  • After merge: next push to main with a no-op-for-smoke path (e.g. README edit) does not trigger Smoke Pack
  • Next time the umbrella is in `UPDATE_ROLLBACK_IN_PROGRESS` at the start of a run, pre-deploy waits it out instead of failing immediately

Out of scope

  • The underlying cause of the original rollback (separate investigation; pre-existing umbrella churn)
  • Cross-workflow concurrency (only `smoke.yml` touches the smoke account, so the existing per-workflow group is enough)

…rigger

The recurring `ValidationError ... is in UPDATE_ROLLBACK_IN_PROGRESS state
and can not be updated` failure during a burst of smoke runs wasn't a
concurrency problem (`concurrency: smoke` already serialises everything
that touches AWS) — it was a race inside `smoke-pre-deploy-state.sh`:

1. `UPDATE_ROLLBACK_FAILED` branch called `continue-update-rollback`, an
   async op that flips the stack back into `UPDATE_ROLLBACK_IN_PROGRESS`,
   then immediately wrote `stack_name=$STACK` and returned. The next step
   tried `CreateChangeSet` ~1s later and got rejected.
2. `*_IN_PROGRESS` branch only slept 60s before declaring the stack stuck
   and switching to a `*-recovery-<run_id>` name. An `all-demo` rollback
   takes 20-40 minutes, so the script gave up far too early and either
   created an orphan recovery stack or, when the deploy step ran against
   the original name, hit the same ValidationError.

Replace both with a polling waiter (`wait_for_stable`) that blocks up to
60 minutes for the stack to leave any `*_IN_PROGRESS` state. We poll
ourselves rather than using `aws cloudformation wait` because the
built-in waiter caps at 30 minutes and our umbrella rollbacks routinely
exceed that. `cancel-update-stack` is also no longer issued on
`UPDATE_COMPLETE_CLEANUP_IN_PROGRESS` or `UPDATE_ROLLBACK_IN_PROGRESS`
since CFN rejects it on those states anyway.

Additionally narrow `smoke.yml`'s `push: branches: [main]` trigger with
the same `paths:` filter the PR trigger already uses, so unrelated
main-branch pushes (docs, CI tweaks, scenario READMEs) don't queue up
behind real smoke work.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 11:42 — with GitHub Actions Failure
First PR-CI run revealed the gap: when the umbrella enters
UPDATE_ROLLBACK_FAILED because one or more leaf resources are stuck in
UPDATE_FAILED, a plain `continue-update-rollback` retries the same
failing resources and lands the stack right back in
UPDATE_ROLLBACK_FAILED. The script then wrote stack_name and the deploy
step hit the same ValidationError we set out to prevent.

Now, if the first continue-update-rollback resolves stable but is still
UPDATE_ROLLBACK_FAILED, list the direct child resources currently in
UPDATE_FAILED and re-try the rollback with --resources-to-skip. CFN
permits skipping nested-stack resources, so umbrella-wide UPDATE_FAILED
caused by a deep leaf still gets unblocked. If the skip retry also lands
in UPDATE_ROLLBACK_FAILED, fall back to emit_recovery and open a
stranded-stack issue for human triage.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 12:14 — with GitHub Actions Failure
PR-CI exposed the next tier: even with --resources-to-skip on the failed
AICC nested stack, continue-update-rollback can leave the umbrella in
UPDATE_ROLLBACK_FAILED indefinitely. Falling back to a recovery name
doesn't work either, because the original stack still owns globally-
unique resources (AppRegistryApplication 'NDXTry_All_Scenarios_<acct>'),
so the fresh-name deploy hits AlreadyExists on first create.

Add a final fallback: after the skip-retry still ends in
UPDATE_ROLLBACK_FAILED, delete-stack the umbrella outright and wait for
DELETE_COMPLETE. CFN accepts delete-stack from UPDATE_ROLLBACK_FAILED.
If the delete itself fails (DELETE_FAILED), re-issue with
--retain-resources for the stuck leaves and open a stranded-stack
follow-up so humans clean those up later. Either way the umbrella is
now gone (or deleting + retained), so the next deploy creates it fresh
under the original name and the AppRegistry conflict can't recur.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 12:34 — with GitHub Actions Failure
When delete-stack of the umbrella succeeds via --retain-resources for
stuck nested stacks, those nested stacks live on as top-level orphans
(StackName all-demo-<LogicalId>-<random>) and still own globally-unique
child resources — AppRegistryApplication names in particular. The fresh
umbrella deploy then collides on those names when its own nested stack
(e.g. PaperlessNgx) tries to create its AppRegistryApplication.

After the umbrella reaches DOES_NOT_EXIST/DELETE_COMPLETE, list all
top-level stacks whose name starts with ${STACK}- and try delete-stack
on each. The orphan delete usually succeeds because the parent-child
race that originally blocked it is gone. If an orphan still ends in
DELETE_FAILED, retain its stuck leaves and open a stranded-stack issue
for human triage; we don't recurse further to keep the script bounded.
Yesterday's UPDATE_ROLLBACK_FAILED → delete-with-retain path left orphan
nested stacks (all-demo-PaperlessNgx-*, etc.) in the smoke account. They
hold globally-unique resources (AppRegistryApplication names) that block
the umbrella's next create. The orphan-sweep code was only wired into
the UPDATE_ROLLBACK_FAILED branch, so a follow-up run from DOES_NOT_EXIST
(stack is gone, orphans linger) still tripped on the same conflict.

Refactor: extract sweep into sweep_orphan_stacks() and call it from a
new use_canonical() helper, which every "use the original STACK name"
exit path goes through. Each branch that calls use_canonical now sweeps
exactly once, just before stack_name is written, regardless of how we
got into a deployable state. emit_recovery still bypasses (recovery
names rarely collide on globally-unique resources, and the sweep would
add latency on the give-up path).
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 13:01 — with GitHub Actions Failure
PR-CI exposed two more failure shapes after the last fix:

1. Initial status of ROLLBACK_IN_PROGRESS would settle to ROLLBACK_COMPLETE
   inside the *_IN_PROGRESS branch's wait_for_stable, and the branch then
   called use_canonical with a status that CFN refuses for updates.

2. An orphan whose delete-with-retain hit the 30m wait timeout used to be
   abandoned (issue opened, script proceeded). The lingering orphan still
   blocked the umbrella's recreation by holding AppRegistryApplication
   names, so the subsequent deploy failed in the same way.

Rebuild the script as a re-evaluation loop (max 8 iterations) that
re-reads the stack's status after every CFN-mutating call and re-
dispatches via the case statement. Transitions like
ROLLBACK_IN_PROGRESS → ROLLBACK_COMPLETE → delete + recreate now flow
naturally without explicit chaining.

Orphan cleanup is hardened with a three-stage cleanup_orphan(): plain
delete, retain-on-DELETE_FAILED retry, then force-retain-everything as
a last resort. Force-retain leaves debris on the account (logged as a
stranded-stack issue) but at least the orphan stops blocking the
umbrella's create.

Also handle ROLLBACK_FAILED (rare; same shape as UPDATE_ROLLBACK_FAILED)
and DELETE_FAILED-with-retain (so the script doesn't fall off the case
statement into the * branch).
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 13:44 — with GitHub Actions Failure
…slate

Force-retain orphan-stack cleanups leave behind non-stack-owned resources
that still hold globally-unique names. The umbrella's next create then
trips on AlreadyExists at the leaf:

  ndx-try-planning-docs-<acct>-us-east-1  already exists
  NDXTry_All_Scenarios_<acct>             AppRegistryApplication conflict

Pattern: scenario templates use deterministic names with the account id
in the suffix (S3) or NDXTry_ prefix (AppRegistry). Add two new sweep
helpers — sweep_orphan_s3_buckets (object-version-aware empty + rb) and
sweep_orphan_appregistry — and invoke them from use_canonical *only* when
the umbrella is truly absent from CFN (status DOES_NOT_EXIST). That
DOES_NOT_EXIST guard is important: running these while CFN is mid-create
would yank live state out from under the stack. The smoke account hosts
nothing but smoke fixtures, so blanket-deleting matching resources when
no stack exists is safe.

If a bucket or app refuses deletion, open a stranded-stack issue and
keep going — the deploy can still proceed against the names we did
manage to free, and the remaining few will surface a clear failure that
a human can clean up.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 15:09 — with GitHub Actions Failure
…orphans

PR-CI run showed "No ndx-try-*464453619983* buckets to sweep." even though
five such buckets demonstrably existed (the next deploy failed with
AlreadyExists on every one). Root cause: JMESPath backtick-literals
parse their contents as JSON, so the 12-digit account id became a
*number* in the expression, and `contains(Name, <number>)` against a
string never matches.

Switch to single-quoted JMESPath strings for both the bucket filter
(account-id-bearing names) and the AppRegistry filter (NDXTry_ prefix).
Single-quoted string literals are interpreted as raw strings regardless
of content, avoiding the JSON-parse pitfall.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 16:19 — with GitHub Actions Failure
… sweep

PR-CI run 26110208253 exposed two more orphan classes:

1. Bucket sweep claimed success but ndx-try-paperless-archive-v2-... was
   demonstrably present 21 s later when deploy ran. The old code swallowed
   stderr on every step (delete-objects, s3 rb), so a permissions error or
   partial delete looked identical to a clean delete. Refactored into
   delete_bucket_completely(): three attempts per bucket, stderr surfaced
   on each, and a positive head-bucket check verifies the bucket actually
   disappeared (rather than trusting the rb exit code).

2. Amazon Connect instances survive CFN delete and their alias is
   account-globally-unique. Added sweep_orphan_connect: list instances,
   filter to ndx-try-* aliases, delete-instance each. Failures route to
   a stranded-stack issue.

Both new sweeps run from use_canonical only when the umbrella is
DOES_NOT_EXIST, matching the existing safety gate.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 16:50 — with GitHub Actions Failure
…ollback

ROLLBACK_FAILED (initial CREATE rolled back, rollback itself failed) does
not accept continue-update-rollback — that API verb is for the UPDATE
variant. The old branch silently no-op'd and the loop burned all 8
iterations in 18 seconds, fell out to emit_recovery, and the deploy
under the recovery name then collided with leftover orphan resources
that hadn't been swept (sweep gates on DOES_NOT_EXIST).

Mirror the ROLLBACK_COMPLETE branch's structure: delete-stack, then
retain-on-DELETE_FAILED retry, then `continue` so the next iteration
sees DOES_NOT_EXIST and routes through use_canonical (which then runs
the resource sweep).
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 16:55 — with GitHub Actions Failure
The Paperless-ngx scenario uses AWS::S3Files::FileSystem (the newer
mountpoint-style S3 filesystem service, ARN namespace s3files) attached
to its archive bucket. CFN delete with --retain-resources leaves the
file system orphaned, and the bucket then refuses every subsequent
delete with BucketHasS3FileSystemAttached.

Add sweep_orphan_s3files: `aws s3files list-file-systems`, filter by
bucket name matching our ndx-try-*${acct}* pattern, delete-file-system
--force-delete each match. The s3files API was added to AWS CLI in
2025-05, so newer than the local CLI but available on the GH Actions
runner. If list-file-systems is missing, log + continue rather than
crash.

Call sweep_orphan_s3files BEFORE sweep_orphan_s3_buckets in
use_canonical so the buckets have a chance to delete cleanly. Delete is
async — added a 60s grace period before bucket sweep runs.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 19, 2026 17:50 — with GitHub Actions Failure
Pre-deploy failed with:
  aws: [ERROR]: An error occurred (ParamValidation): Error parsing
  parameter '--delete': Expected: '=', received: '"' for input

Root cause: I was building --delete as CLI shorthand prefixed with
`Objects=` and then concatenating a JSON array. The CLI parses shorthand
character-by-character and rejects JSON's double-quoted keys.

Two fixes:
1. Build the payload as full JSON via `jq -n --argjson o "$versions"
   '{Objects: $o}'`; the CLI auto-detects values starting with `{` as
   JSON and parses correctly.
2. Stop piping stderr through sed — with set -euo pipefail, a non-zero
   exit anywhere in the pipeline kills the script (and prior aws CLI
   parse error rate sent us through that path). Capture stderr to a var
   and echo it indented on a separate line.

The S3 Files sweep in the same run did its job: deleted
fs-0ba2dd6e6d16e2fa8 and the dependent paperless-archive bucket got
deleted on attempt 1. The next bucket then tripped the parse error.
@chrisns chrisns temporarily deployed to smoke-test-deploy May 19, 2026 18:11 — with GitHub Actions Inactive
@chrisns chrisns added this pull request to the merge queue May 19, 2026
Merged via the queue into main with commit 5d87af3 May 19, 2026
12 checks passed
@chrisns chrisns deleted the fix/smoke-pre-deploy-wait branch May 19, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant