Skip to content

feat(ci): clean up stale stacks with global vitest setup hook#1499

Merged
Hweinstock merged 5 commits into
aws:mainfrom
Hweinstock:feat/avoid-stale-stacks
Jun 10, 2026
Merged

feat(ci): clean up stale stacks with global vitest setup hook#1499
Hweinstock merged 5 commits into
aws:mainfrom
Hweinstock:feat/avoid-stale-stacks

Conversation

@Hweinstock

@Hweinstock Hweinstock commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Problem

Stacks are being orphaned by e2e tests, resulting in 1700+ deployed cloud-formation stacks. This wastes resources, and risks hitting resource limits.
#1493

Solution

  • Add a pre-test hook that cleans up stacks older than 3 hours.
  • hook is resilient to failures, throttling, and does not fail tests if it throws.
  • create a utils folder for this functionality so that it can be re-used.
  • move deleteCredentialProvider in to utilts.

Note: we use a global hook, instead of beforeAll on e2e-helper to ensure this runs ONCE per e2e test invoke, instead of ONCE per test suite invoked to reduce noisy API calls.

Testing

Ran this with the retry flag set very high to delete most of the existing old stacks.

Also verified this doesn't affect subsequent test runs by running with a single file:

Running: e2e-tests/byo-custom-jwt.test.ts

 RUN  v4.1.8 **/agentcore-cli

[global-setup]:starting global setup in region: us-east-1
[global-setup]:cleaning up stale stacks...
[global-setup:stack-cleanup]:listing stacks with cutoff=2026-06-10T10:21:01.520Z, prefix=AgentCore-E2e
(node:1515769) Warning: NodeVersionSupportWarning: The AWS SDK for JavaScript (v3)
versions published after the first week of January 2027
will require node >=22. You are running node v20.20.2.

To continue receiving updates to AWS services, bug fixes,
and security updates please upgrade to node >=22.

More information can be found at: https://a.co/c895JFp
(Use `node --trace-warnings ...` to show where the warning was created)
[global-setup:stack-cleanup]:found 0 stacks
[global-setup:stack-cleanup]:no stacks found!
[global-setup]:done cleaning up stacks after 0.108 seconds
[global-setup]:cleaning up stale credential providers...
(node:1515868) Warning: NodeVersionSupportWarning: The AWS SDK for JavaScript (v3)
versions published after the first week of January 2027
will require node >=22. You are running node v20.20.2.

To continue receiving updates to AWS services, bug fixes,
and security updates please upgrade to node >=22.

More information can be found at: https://a.co/c895JFp
(Use `node --trace-warnings ...` to show where the warning was created)

@github-actions github-actions Bot added size/m PR size: M agentcore-harness-reviewing AgentCore Harness review in progress labels Jun 9, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label Jun 9, 2026
@agentcore-devx-automation

Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Package Tarball

aws-agentcore-0.18.0.tgz

How to install

gh release download pr-1499-tarball --repo aws/agentcore-cli --pattern "*.tgz" --dir /tmp/pr-tarball
npm install -g /tmp/pr-tarball/aws-agentcore-0.18.0.tgz

@agentcore-cli-automation agentcore-cli-automation left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice safety net for the orphaned-stack problem. A few things worth addressing before merge — main concerns are around the production behavior diverging from what was actually tested manually, and unbounded parallelism against the CloudFormation API.

Comment thread e2e-tests/global-setup.ts Outdated

const cfn = new CloudFormationClient({ region: REGION, maxAttempts: 10 });
try {
await cleanUpOldStacks(cfn);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says this was tested with the retry flag set very high to clean up the existing backlog, but the production call on line 118 invokes cleanUpOldStacks(cfn) with no options — so options?.retries is undefined and the retry block on line 100 never fires. As written, the hook is effectively single-shot.

If single-shot is intentional, that's fine but worth a comment so it doesn't drift. If you want retries in CI:

  1. Pass an explicit retries (e.g. cleanUpOldStacks(cfn, { retries: 2 })), or
  2. Default retries inside cleanUpOldStacks so the production path matches what was tested.

Either way the manually-tested configuration isn't what will run in CI on every e2e invocation, which is a gap worth closing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is intentional. If we retry, it will loop and take a significantly longer amount of time. This one shot is a best effort attempt of cleanup, and if an operator needs to, they can adjust the retries on local runs to make more powerful.

Comment thread e2e-tests/global-setup.ts Outdated
const names = stacks.map(s => s.StackName!);

logger.info(`deleting ${names.length} stacks with names=${names.join(',')}`);
const results = await Promise.allSettled(names.map(name => deleteStackAndVerify(client, name)));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With no maxStacksDeleted cap passed from the production call site (line 118), listStacks returns every matching stack, and then Promise.allSettled(names.map(...)) (line 96) fans out a DeleteStackCommand plus a waitUntilStackDeleteComplete polling loop for every single one in parallel. Given the PR description mentions a 1700+ stack backlog, that's potentially 1700 concurrent waiters each calling DescribeStacks every 15s — this will hammer the CFN API and almost certainly hit throttling that even maxAttempts: 10 won't absorb cleanly. It will also drag out the e2e setup time meaningfully.

A couple of options:

  1. Set a sensible default maxStacksDeleted (e.g. 50) so each CI run nibbles at the backlog instead of trying to drain it.
  2. Process in batches of N concurrent deletes (e.g. with p-limit or a simple chunked loop) instead of unbounded Promise.allSettled.
  3. Both — cap total per run and limit concurrency.

Option 3 gives the best behavior: bounded blast radius on the API and bounded time impact on e2e setup.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does get throttled, this is what the retries were for. However, in practice, if this is running each time they shouldn't build up to the point where we get throttled.

Comment thread e2e-tests/global-setup.ts Outdated
const startTime = Date.now();
try {
const result = await waitUntilStackDeleteComplete(
{ client: cfn, maxWaitTime: 60 * 3, minDelay: 15 },

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxWaitTime: 60 * 3 is 180 seconds. AgentCore stacks contain ECR repos, IAM roles, log groups, CodeBuild projects, etc. Stack deletion under throttling (which is likely given the unbounded parallelism above) will frequently exceed 3 minutes, especially when many stacks are deleting concurrently and each waiter is also being throttled on its DescribeStacks polls.

When the waiter times out, deleteStackAndVerify returns false even though the underlying DeleteStackCommand was accepted by CFN — so the stack will still get deleted asynchronously, but this hook's accounting (deleted X of Y) and the recursive retry decision will be misled.

Suggestions:

  • Bump maxWaitTime to something like 60 * 10 or 60 * 15.
  • Or: skip the waiter entirely. The point of this hook is to issue DeleteStack calls; CFN will finish them asynchronously and the next CI run will clean up anything that didn't finish. That also dramatically reduces API call volume from this hook.

@Hweinstock Hweinstock Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experimentally verified it took 30-45 seconds for existing stacks. If a stack takes longer than 3 minutes we skip it to avoid bloating the e2e test run with unnecessary clean up time.

@github-actions github-actions Bot removed the agentcore-harness-reviewing AgentCore Harness review in progress label Jun 9, 2026
@github-actions github-actions Bot removed the size/m PR size: M label Jun 10, 2026
@github-actions github-actions Bot added the size/m PR size: M label Jun 10, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label Jun 10, 2026
@agentcore-devx-automation

Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label Jun 10, 2026
@Hweinstock Hweinstock changed the title feat(ci): clean up stale stacks with global vest setup hook feat(ci): clean up stale stacks with global vitest setup hook Jun 10, 2026
@github-actions github-actions Bot removed the size/m PR size: M label Jun 10, 2026
@github-actions github-actions Bot added the size/m PR size: M label Jun 10, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label Jun 10, 2026
@agentcore-devx-automation

Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label Jun 10, 2026
@github-actions github-actions Bot added size/m PR size: M and removed size/m PR size: M labels Jun 10, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label Jun 10, 2026
@agentcore-devx-automation

Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label Jun 10, 2026
@Hweinstock Hweinstock marked this pull request as ready for review June 10, 2026 13:39
@Hweinstock Hweinstock requested a review from a team June 10, 2026 13:39
Comment thread e2e-tests/global-setup.ts Outdated
bedrockCPClient.destroy();
}

logger.info(`setup finished in ${Date.now() - startTime / 1000} seconds`);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operator precedence bug in the duration: / binds tighter than -, so this evaluates as Date.now() - (startTime / 1000) — it subtracts ~1.7M from Date.now() and logs a nonsense duration (~1.7 billion "seconds"). Should be:

logger.info(`setup finished in ${(Date.now() - startTime) / 1000} seconds`);

Log-only (no functional impact), but the two timing logs just above (lines 34 and the stack-cleanup one) already parenthesize correctly, so this one reads inconsistently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, lemme fix this.

@github-actions github-actions Bot added size/m PR size: M and removed size/m PR size: M labels Jun 10, 2026
@agentcore-devx-automation agentcore-devx-automation Bot added the claude-security-reviewing Claude Code /security-review in progress label Jun 10, 2026
@agentcore-devx-automation

Copy link
Copy Markdown
Contributor

Claude Security Review: no high-confidence findings. (run)

@agentcore-devx-automation agentcore-devx-automation Bot removed the claude-security-reviewing Claude Code /security-review in progress label Jun 10, 2026
@Hweinstock Hweinstock merged commit 9966e9d into aws:main Jun 10, 2026
29 of 30 checks passed
@Hweinstock Hweinstock deleted the feat/avoid-stale-stacks branch June 10, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/m PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants