fix(ci-lease): cover all sandbox lifecycle OUs + retry safety#394
Merged
Conversation
Two correctness bugs that surfaced once we got past the WAF and into real lease acquisition. 1) StackSet wasn't following lease lifecycle. SERVICE_MANAGED StackSets with AutoDeployment don't recurse into descendant OUs — they only deploy to accounts DIRECTLY in the listed OUs. Original template targeted just the parent (ou-2laj-4dyae1oa), which has zero direct accounts, plus an out-of-band create-stack-instances I ran yesterday against the Available OU. Result: when an account moved Available -> Active for a lease, the StackSet treated it as "left the targeted OU", REMOVED the stack (retainStacksOnAccountRemoval=false), and the workflow's subsequent AssumeRole into the leased account hit a missing role. Confirmed via list-stack-instances against account=736822755656 in CleanUp OU returning 0 results. Fix: enumerate every ISB lifecycle OU explicitly (Entry, Available, Active, Frozen, CleanUp, Quarantine, Exit). The role now persists through the full Available -> Active -> CleanUp -> Available cycle. The CFN stack instance is also protected from aws-nuke by the upstream nuke-config.yaml `StackSet-Isb-*` glob rule. 2) Lambda's make_isb_api_request retried 5xx errors on POST /leases, which is non-idempotent. A transient 500 after the server actually committed the record would create a duplicate lease, eating into the maxLeasesPerUser quota (currently 1) and leaking a pool account. Added an `idempotent` flag (default GET/HEAD/DELETE/PUT yes, POST no). Acquire's POST explicitly opts out; terminate's POST explicitly opts in (the API returns 404/409 for already-terminated leases, which we treat as success). The StackSet update will re-deploy to all 7 OUs — adds ~425 instances spread across Active/Exit/etc. Wallclock ~15-25 min at MaxConcurrentPercentage=100.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two fixes uncovered by getting past the WAF: (1) StackSet now enumerates every ISB lifecycle OU (Entry/Available/Active/Frozen/CleanUp/Quarantine/Exit) so the role persists through the full lease cycle. (2) POST /leases no longer retries on 5xx (was non-idempotent — created duplicate leases that ate into the quota).