Skip to content

fix(smoke): cover S3 Files access points + async-delete propagation#344

Merged
chrisns merged 2 commits into
mainfrom
fix/smoke-s3files-access-points
May 20, 2026
Merged

fix(smoke): cover S3 Files access points + async-delete propagation#344
chrisns merged 2 commits into
mainfrom
fix/smoke-s3files-access-points

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 20, 2026

Summary

PR 321 retry surfaced three more pre-deploy orphan-cleanup gaps. This PR closes all three plus bundles in the mount-targets work that went directly onto renovate/node-26.x (PR #321):

  1. `delete-file-system` fails with `has access points` even after mount targets are gone. Order: access points → mount targets → file system. Each step with its own settle delay (15s / 30s).
  2. AppRegistry delete-application is async. Sweep returned success, deploy fired immediately, CFN's "already own application" check rejected. 30s grace period added.
  3. Orphan `${STACK}-*` stacks in `DELETE_IN_PROGRESS` were invisible to the sweep's status filter. Deploy raced their AppRegistryAssociation children. Added second pass that polls + `wait_for_stable` on each in-progress orphan.

Bundled in:

  • `|| true` on `cleanup_orphan` and `delete_bucket_completely` calls inside their loops so per-item failures don't kill the whole pre-deploy via `set -e`
  • s3files delete stderr captured to a var instead of piped through sed (avoids the pipefail trap that bit earlier iterations)

Test plan

  • `bash -n` clean
  • Post-merge smoke on main from torn-down state reaches success
  • Re-trigger of PR 321 with this branch's content reaches success

PR 321's smoke retry surfaced three more layers of orphan-cleanup gaps:

1. delete-file-system fails with ConflictException "has access points"
   even after mount targets are gone. AWS::S3Files::AccessPoint must be
   deleted before mount targets, which must be deleted before the file
   system. Update the s3files sweep to delete in that order:
   access points (15s settle) → mount targets (30s settle) → file system.

2. AppRegistry delete-application is async on the server side. The
   sweep returned success, deploy fired immediately, and CFN's "already
   own application <NDXTry_BOPS_Planning_…>" check rejected the create.
   Add a 30s grace period at the end of sweep_orphan_appregistry,
   matching the s3files pattern.

3. Orphan ${STACK}-* stacks left over from prior runs were sometimes
   in DELETE_IN_PROGRESS state — invisible to the sweep's status filter
   (which only listed terminal states). The deploy raced their
   AppRegistryAssociation children and hit
   "stack status as DELETE_IN_PROGRESS … is not allowed". Add a second
   pass at the end of sweep_orphan_stacks that polls list-stacks for
   in-progress matches (ParentId==null) and wait_for_stable on each.

Also bundled in: `|| true` on cleanup_orphan and delete_bucket_completely
calls inside their respective sweep loops so a per-item failure no
longer kills the whole pre-deploy via set -e. (The stranded-stack issues
remain the audit trail.) Plus s3files delete stderr captured to var
instead of piped through sed — avoids the pipefail trap that bit us
earlier.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 20, 2026 10:24 — with GitHub Actions Failure
When iterative pre-deploy cleanup has accumulated more debris than the
script can sweep in 90 minutes, manual recovery is needed. Add a
workflow-dispatch workflow + script that runs under the existing
smoke-test-deploy role and clears every:

  - ${STACK}* CloudFormation stack (top-level)
  - AWS::S3Files::FileSystem (with access points + mount targets first)
  - NDXTry_* AppRegistry applications
  - ndx-try-* Connect instances
  - ndx-try-*${ACCOUNT_ID}* S3 buckets

Same identity, same concurrency group as smoke.yml, so the two can't
run simultaneously. Requires typing NUKE as the confirm input to fire,
to avoid accidental destruction of the smoke account.

This is the script side of "manual cleanup" we kept reaching for during
PR #321's CI iterations. After running this, the next smoke run starts
from a true clean slate.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 20, 2026 13:28 — with GitHub Actions Failure
@chrisns chrisns added this pull request to the merge queue May 20, 2026
Merged via the queue into main with commit 0919f63 May 20, 2026
7 of 8 checks passed
@chrisns chrisns deleted the fix/smoke-s3files-access-points branch May 20, 2026 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant