Skip to content

Commit ff5acf3

Browse files
committed
fix(smoke): treat UPDATE_ROLLBACK_COMPLETE as needs-recreate, not deployable
Post-merge smoke on main found all-demo in UPDATE_ROLLBACK_COMPLETE, called use_canonical, and the subsequent CFN update failed because the internal rollback (triggered by a leaf failure) tried to delete the StorageFileSystem61EA7B3D — which has pending S3 export data and needs forceDelete=true. The AWS::S3Files::FileSystem CFN handler does not pass forceDelete, so the rollback fails and the deploy reports "Failed to create/update the stack". CFN technically accepts updates from UPDATE_ROLLBACK_COMPLETE, but for the all-demo umbrella that state always hides this kind of half-cleaned S3Files / nested-stack debris. Safer to mirror the ROLLBACK_COMPLETE branch: delete-stack, wait, retain-on-DELETE_FAILED, then `continue` the loop so the next iteration hits DOES_NOT_EXIST → use_canonical → full resource sweep (which DOES force-delete file systems). Cost: one extra ~60m recreate cycle when CFN rolled back. Benefit: the umbrella self-recovers from S3Files-stuck rollbacks instead of needing human cleanup.
1 parent 47c3fba commit ff5acf3

1 file changed

Lines changed: 30 additions & 1 deletion

File tree

scripts/smoke-pre-deploy-state.sh

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -323,9 +323,38 @@ for ITER in $(seq 1 $MAX_ITER); do
323323
echo "[iter ${ITER}/${MAX_ITER}] $STACK status: $STATUS"
324324

325325
case "$STATUS" in
326-
DOES_NOT_EXIST|CREATE_COMPLETE|UPDATE_COMPLETE|UPDATE_ROLLBACK_COMPLETE)
326+
DOES_NOT_EXIST|CREATE_COMPLETE|UPDATE_COMPLETE)
327327
use_canonical
328328
;;
329+
UPDATE_ROLLBACK_COMPLETE)
330+
# Technically deployable per CFN, but the post-merge smoke run on
331+
# main proved this is fragile for the all-demo umbrella: a stale
332+
# UPDATE_ROLLBACK_COMPLETE often hides nested resources (S3Files
333+
# file systems with pending exports, half-deleted nested stacks)
334+
# that fail the next update's own internal rollback. Treat it the
335+
# same as ROLLBACK_COMPLETE — nuke and recreate from clean slate.
336+
echo "Deleting from UPDATE_ROLLBACK_COMPLETE for clean recreate"
337+
aws cloudformation delete-stack --stack-name "$STACK" 2>/dev/null || true
338+
STATUS_NOW=$(wait_for_stable "$STACK" 3600) || \
339+
emit_recovery "delete from UPDATE_ROLLBACK_COMPLETE still running after 60m"
340+
if [ "$STATUS_NOW" = "DELETE_FAILED" ]; then
341+
RETAIN=$(aws cloudformation list-stack-resources --stack-name "$STACK" \
342+
--query 'StackResourceSummaries[?ResourceStatus==`DELETE_FAILED`].LogicalResourceId' \
343+
--output text | tr '\t' ' ')
344+
if [ -n "$RETAIN" ]; then
345+
echo "Retrying delete-stack from UPDATE_ROLLBACK_COMPLETE retaining: $RETAIN"
346+
# shellcheck disable=SC2086
347+
aws cloudformation delete-stack --stack-name "$STACK" \
348+
--retain-resources $RETAIN 2>/dev/null || true
349+
wait_for_stable "$STACK" 3600 || \
350+
emit_recovery "UPDATE_ROLLBACK_COMPLETE retain-delete still running after 60m"
351+
gh issue create --title "smoke: retained resources after $STACK delete (UPDATE_ROLLBACK_COMPLETE)" \
352+
--label stranded-stack \
353+
--body "Retained on delete: $RETAIN. Run ${GITHUB_RUN_ID}." || true
354+
fi
355+
fi
356+
continue
357+
;;
329358
CREATE_FAILED|UPDATE_FAILED)
330359
# Fix-forward: CFN's `update-stack` (which `aws cloudformation deploy`
331360
# uses) accepts both *_FAILED states and replaces failed resources

0 commit comments

Comments
 (0)