Skip to content

fix(ci-lease): cover all sandbox lifecycle OUs + retry safety#394

Merged
chrisns merged 1 commit into
mainfrom
chore/ci-lease-lambda-timeout
May 22, 2026
Merged

fix(ci-lease): cover all sandbox lifecycle OUs + retry safety#394
chrisns merged 1 commit into
mainfrom
chore/ci-lease-lambda-timeout

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 22, 2026

Two fixes uncovered by getting past the WAF: (1) StackSet now enumerates every ISB lifecycle OU (Entry/Available/Active/Frozen/CleanUp/Quarantine/Exit) so the role persists through the full lease cycle. (2) POST /leases no longer retries on 5xx (was non-idempotent — created duplicate leases that ate into the quota).

Two correctness bugs that surfaced once we got past the WAF and into
real lease acquisition.

1) StackSet wasn't following lease lifecycle. SERVICE_MANAGED StackSets
   with AutoDeployment don't recurse into descendant OUs — they only
   deploy to accounts DIRECTLY in the listed OUs. Original template
   targeted just the parent (ou-2laj-4dyae1oa), which has zero direct
   accounts, plus an out-of-band create-stack-instances I ran yesterday
   against the Available OU. Result: when an account moved Available
   -> Active for a lease, the StackSet treated it as "left the targeted
   OU", REMOVED the stack (retainStacksOnAccountRemoval=false), and the
   workflow's subsequent AssumeRole into the leased account hit a
   missing role. Confirmed via list-stack-instances against
   account=736822755656 in CleanUp OU returning 0 results.

   Fix: enumerate every ISB lifecycle OU explicitly (Entry, Available,
   Active, Frozen, CleanUp, Quarantine, Exit). The role now persists
   through the full Available -> Active -> CleanUp -> Available cycle.
   The CFN stack instance is also protected from aws-nuke by the
   upstream nuke-config.yaml `StackSet-Isb-*` glob rule.

2) Lambda's make_isb_api_request retried 5xx errors on POST /leases,
   which is non-idempotent. A transient 500 after the server actually
   committed the record would create a duplicate lease, eating into
   the maxLeasesPerUser quota (currently 1) and leaking a pool
   account. Added an `idempotent` flag (default GET/HEAD/DELETE/PUT
   yes, POST no). Acquire's POST explicitly opts out; terminate's POST
   explicitly opts in (the API returns 404/409 for already-terminated
   leases, which we treat as success).

The StackSet update will re-deploy to all 7 OUs — adds ~425 instances
spread across Active/Exit/etc. Wallclock ~15-25 min at
MaxConcurrentPercentage=100.
@chrisns chrisns added this pull request to the merge queue May 22, 2026
Merged via the queue into main with commit 98cd41e May 22, 2026
5 checks passed
@chrisns chrisns deleted the chore/ci-lease-lambda-timeout branch May 22, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant