feat(ci): clean up stale stacks with global vitest setup hook#1499
Conversation
|
Claude Security Review: no high-confidence findings. (run) |
Package TarballHow to installgh release download pr-1499-tarball --repo aws/agentcore-cli --pattern "*.tgz" --dir /tmp/pr-tarball
npm install -g /tmp/pr-tarball/aws-agentcore-0.18.0.tgz |
agentcore-cli-automation
left a comment
There was a problem hiding this comment.
Nice safety net for the orphaned-stack problem. A few things worth addressing before merge — main concerns are around the production behavior diverging from what was actually tested manually, and unbounded parallelism against the CloudFormation API.
|
|
||
| const cfn = new CloudFormationClient({ region: REGION, maxAttempts: 10 }); | ||
| try { | ||
| await cleanUpOldStacks(cfn); |
There was a problem hiding this comment.
The PR description says this was tested with the retry flag set very high to clean up the existing backlog, but the production call on line 118 invokes cleanUpOldStacks(cfn) with no options — so options?.retries is undefined and the retry block on line 100 never fires. As written, the hook is effectively single-shot.
If single-shot is intentional, that's fine but worth a comment so it doesn't drift. If you want retries in CI:
- Pass an explicit
retries(e.g.cleanUpOldStacks(cfn, { retries: 2 })), or - Default
retriesinsidecleanUpOldStacksso the production path matches what was tested.
Either way the manually-tested configuration isn't what will run in CI on every e2e invocation, which is a gap worth closing.
There was a problem hiding this comment.
this is intentional. If we retry, it will loop and take a significantly longer amount of time. This one shot is a best effort attempt of cleanup, and if an operator needs to, they can adjust the retries on local runs to make more powerful.
| const names = stacks.map(s => s.StackName!); | ||
|
|
||
| logger.info(`deleting ${names.length} stacks with names=${names.join(',')}`); | ||
| const results = await Promise.allSettled(names.map(name => deleteStackAndVerify(client, name))); |
There was a problem hiding this comment.
With no maxStacksDeleted cap passed from the production call site (line 118), listStacks returns every matching stack, and then Promise.allSettled(names.map(...)) (line 96) fans out a DeleteStackCommand plus a waitUntilStackDeleteComplete polling loop for every single one in parallel. Given the PR description mentions a 1700+ stack backlog, that's potentially 1700 concurrent waiters each calling DescribeStacks every 15s — this will hammer the CFN API and almost certainly hit throttling that even maxAttempts: 10 won't absorb cleanly. It will also drag out the e2e setup time meaningfully.
A couple of options:
- Set a sensible default
maxStacksDeleted(e.g. 50) so each CI run nibbles at the backlog instead of trying to drain it. - Process in batches of N concurrent deletes (e.g. with
p-limitor a simple chunked loop) instead of unboundedPromise.allSettled. - Both — cap total per run and limit concurrency.
Option 3 gives the best behavior: bounded blast radius on the API and bounded time impact on e2e setup.
There was a problem hiding this comment.
It does get throttled, this is what the retries were for. However, in practice, if this is running each time they shouldn't build up to the point where we get throttled.
| const startTime = Date.now(); | ||
| try { | ||
| const result = await waitUntilStackDeleteComplete( | ||
| { client: cfn, maxWaitTime: 60 * 3, minDelay: 15 }, |
There was a problem hiding this comment.
maxWaitTime: 60 * 3 is 180 seconds. AgentCore stacks contain ECR repos, IAM roles, log groups, CodeBuild projects, etc. Stack deletion under throttling (which is likely given the unbounded parallelism above) will frequently exceed 3 minutes, especially when many stacks are deleting concurrently and each waiter is also being throttled on its DescribeStacks polls.
When the waiter times out, deleteStackAndVerify returns false even though the underlying DeleteStackCommand was accepted by CFN — so the stack will still get deleted asynchronously, but this hook's accounting (deleted X of Y) and the recursive retry decision will be misled.
Suggestions:
- Bump
maxWaitTimeto something like60 * 10or60 * 15. - Or: skip the waiter entirely. The point of this hook is to issue
DeleteStackcalls; CFN will finish them asynchronously and the next CI run will clean up anything that didn't finish. That also dramatically reduces API call volume from this hook.
There was a problem hiding this comment.
experimentally verified it took 30-45 seconds for existing stacks. If a stack takes longer than 3 minutes we skip it to avoid bloating the e2e test run with unnecessary clean up time.
|
Claude Security Review: no high-confidence findings. (run) |
|
Claude Security Review: no high-confidence findings. (run) |
|
Claude Security Review: no high-confidence findings. (run) |
| bedrockCPClient.destroy(); | ||
| } | ||
|
|
||
| logger.info(`setup finished in ${Date.now() - startTime / 1000} seconds`); |
There was a problem hiding this comment.
Operator precedence bug in the duration: / binds tighter than -, so this evaluates as Date.now() - (startTime / 1000) — it subtracts ~1.7M from Date.now() and logs a nonsense duration (~1.7 billion "seconds"). Should be:
logger.info(`setup finished in ${(Date.now() - startTime) / 1000} seconds`);Log-only (no functional impact), but the two timing logs just above (lines 34 and the stack-cleanup one) already parenthesize correctly, so this one reads inconsistently.
There was a problem hiding this comment.
good catch, lemme fix this.
|
Claude Security Review: no high-confidence findings. (run) |
Problem
Stacks are being orphaned by e2e tests, resulting in 1700+ deployed cloud-formation stacks. This wastes resources, and risks hitting resource limits.
#1493
Solution
Note: we use a global hook, instead of beforeAll on e2e-helper to ensure this runs ONCE per e2e test invoke, instead of ONCE per test suite invoked to reduce noisy API calls.
Testing
Ran this with the retry flag set very high to delete most of the existing old stacks.
Also verified this doesn't affect subsequent test runs by running with a single file: