Replace per-run CI cleanup with capacity-aware LRU eviction by ghostwriternr · Pull Request #529 · cloudflare/sandbox-sdk

ghostwriternr · 2026-03-26T16:35:19Z

Replace the per-run cleanup model from #512 with a pre-deploy capacity guard that evicts resources only when the account approaches its container app limit.

#512 added a cleanup job that deleted workers, containers, and image tags after every CI run. This defeated two key optimizations from #447: the content-addressed Docker cache (ci-* tags invalidated via shared manifest GC, hit rate dropped from 100% to 30%) and deploy-skip (workers deleted so health checks always fail). Average CI time doubled from 5.4 to 10.4 minutes.

The root cause is a mismatch between resource lifecycle and cleanup policy. #447 designed PR resources to be reused across runs — images persist for cache hits, workers persist for deploy-skip, and cleanup happens only at PR close or via daily sweep. #512 treated all per-run artifacts as disposable, which is correct for some resources (GHA caches) but destructive for others (Docker manifests, deployed workers).

This change removes per-run cleanup entirely and instead adds a capacity guard to the deploy step in reusable-e2e.yml. Before deploying, it checks the account's container app count against a budget of 84 (out of ~100 limit, with 12 reserved for main). If over budget, it evicts the least-recently-active PR's resources as a whole unit, using GitHub PR updatedAt as the recency signal. It protects the current PR, main workers, and any PR with in-progress CI from eviction via the GitHub Actions workflow run API. Queued runs are not protected since they have not started using workers and can redeploy if evicted. If no evictable candidate exists, the step fails fast with a clear error rather than silently breaking another PR's test run.

PR-close cleanup is preserved for workers, containers, and GHA caches, but image tag deletion is removed. Deleting pr-* tags was observed to break the ci-* Docker cache, and since pr-* and ci-* tags are created via crane copy and share the same manifest, the same risk applies at PR close. The daily sweep still deletes pr-* images for closed and stale PRs where no future CI run will reference them. The cleanup trigger is changed to pull_request_target so fork PRs that ran privileged CI via ok-to-test also get cleaned at close. Grep pipelines that previously failed under bash's default set -eo pipefail when no matches existed are fixed with || true guards.

Deploy-skip is restored for PRs whose workers survive between runs. In practice, with 7 PR slots (84 budget / 12 apps per PR) and typical concurrency of 3-5 active PRs, most PRs will keep their workers warm across multiple pushes.

PR #512 added per-run cleanup that deleted pr-* image tags after every CI run. This broke the content-addressed Docker cache because ci-* and pr-* tags share underlying manifests in the CF registry. Cache hit rate dropped from 100% to 30%, doubling average CI time from 5.4 to 10.4 minutes. Remove image tag deletion from per-run cleanup while keeping worker/container deletion for the container app platform limit. Decouple the cleanup job from gate so it no longer blocks the required status check. Move image cleanup to PR close only. Also fix four pre-existing issues in cleanup.yml: grep pipelines failing under set -eo pipefail on empty results, fork PR resources leaking because pull_request event lacks secrets, the daily sweep being blind to image-only orphans, and ci-* cache tags accumulating without eviction against the 50GB storage limit.

The previous commit removed image deletion but still deleted workers per-run, which defeated deploy-skip. This replaces the entire per-run cleanup model with a pre-deploy capacity guard that evicts only when the account approaches its ~100 container app limit. Workers and containers now persist across runs, restoring both the Docker cache and deploy-skip optimizations from PR #447. The capacity guard evicts whole-PR resources in least-recently-active order, using GitHub API for reliable active-run protection. PR-close cleanup is scoped to workers/containers/caches only — image tags are left alone to avoid poisoning the shared ci-* Docker cache manifests.

Only protect PRs with in-progress CI from capacity eviction. Queued runs have not started using workers yet and can redeploy if evicted. This widens the effective eviction pool and makes the fail-fast scenario less likely. Remove nightly ci-* image tag eviction from the daily sweep. The 50GB storage limit is not an immediate concern and the eviction was costing ~6 minutes on the first build each day. Can be re-added if storage pressure becomes a problem.

changeset-bot · 2026-03-26T16:35:24Z

⚠️ No Changeset found

Latest commit: 9723314

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

devin-ai-integration

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-03-26T16:41:23Z

.github/workflows/reusable-e2e.yml

+          CONTAINERS=$(wrangler containers list --json 2>/dev/null)
+          if [ $? -ne 0 ] || [ -z "$CONTAINERS" ]; then
+            echo "::error::Failed to list container apps — cannot verify capacity"
+            exit 1
+          fi


🟡 $? check is unreachable due to set -e in GitHub Actions default shell

GitHub Actions runs bash with set -eo pipefail by default. When wrangler containers list --json exits non-zero at line 93, the command substitution's exit code propagates to the variable assignment, and set -e terminates the script immediately — before line 94 is reached. The if [ $? -ne 0 ] check and its descriptive error message (Failed to list container apps — cannot verify capacity) are dead code for the failure case. The step still fails correctly, but the operator sees only a generic "Process completed with exit code 1" with stderr suppressed by 2>/dev/null, making failures hard to diagnose.

The fix is to suppress set -e for this command, e.g., CONTAINERS=$(wrangler containers list --json 2>/dev/null) || true, and then check only [ -z "$CONTAINERS" ].

Suggested change

CONTAINERS=$(wrangler containers list --json 2>/dev/null)

if [ $? -ne 0 ] || [ -z "$CONTAINERS" ]; then

echo "::error::Failed to list container apps — cannot verify capacity"

exit 1

fi

CONTAINERS=$(wrangler containers list --json 2>/dev/null) || true

if [ -z "$CONTAINERS" ] || ! echo "$CONTAINERS" | jq empty 2>/dev/null; then

echo "::error::Failed to list container apps — cannot verify capacity"

exit 1

fi

Was this helpful? React with 👍 or 👎 to provide feedback.

The reusable-e2e workflow's deploy job requires actions:read for the capacity guard added in #529, but the Release caller didn't grant it, causing startup_failure on every push to main.

ghostwriternr added 3 commits March 26, 2026 14:05

ghostwriternr requested review from aron-cf, scuffi and whoiskatrin as code owners March 26, 2026 16:35

whoiskatrin approved these changes Mar 26, 2026

View reviewed changes

ghostwriternr merged commit 40ef275 into main Mar 26, 2026
15 checks passed

ghostwriternr deleted the fix-ci-cleanup-cache-regression branch March 26, 2026 16:37

devin-ai-integration bot reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace per-run CI cleanup with capacity-aware LRU eviction#529

Replace per-run CI cleanup with capacity-aware LRU eviction#529
ghostwriternr merged 3 commits intomainfrom
fix-ci-cleanup-cache-regression

ghostwriternr commented Mar 26, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

changeset-bot bot commented Mar 26, 2026

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ghostwriternr commented Mar 26, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Mar 26, 2026

⚠️ No Changeset found

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ghostwriternr commented Mar 26, 2026 •

edited by devin-ai-integration bot

Loading