Skip to content

Replace per-run CI cleanup with capacity-aware LRU eviction#529

Merged
ghostwriternr merged 3 commits intomainfrom
fix-ci-cleanup-cache-regression
Mar 26, 2026
Merged

Replace per-run CI cleanup with capacity-aware LRU eviction#529
ghostwriternr merged 3 commits intomainfrom
fix-ci-cleanup-cache-regression

Conversation

@ghostwriternr
Copy link
Copy Markdown
Member

@ghostwriternr ghostwriternr commented Mar 26, 2026

Replace the per-run cleanup model from #512 with a pre-deploy capacity guard that evicts resources only when the account approaches its container app limit.

#512 added a cleanup job that deleted workers, containers, and image tags after every CI run. This defeated two key optimizations from #447: the content-addressed Docker cache (ci-* tags invalidated via shared manifest GC, hit rate dropped from 100% to 30%) and deploy-skip (workers deleted so health checks always fail). Average CI time doubled from 5.4 to 10.4 minutes.

The root cause is a mismatch between resource lifecycle and cleanup policy. #447 designed PR resources to be reused across runs — images persist for cache hits, workers persist for deploy-skip, and cleanup happens only at PR close or via daily sweep. #512 treated all per-run artifacts as disposable, which is correct for some resources (GHA caches) but destructive for others (Docker manifests, deployed workers).

This change removes per-run cleanup entirely and instead adds a capacity guard to the deploy step in reusable-e2e.yml. Before deploying, it checks the account's container app count against a budget of 84 (out of ~100 limit, with 12 reserved for main). If over budget, it evicts the least-recently-active PR's resources as a whole unit, using GitHub PR updatedAt as the recency signal. It protects the current PR, main workers, and any PR with in-progress CI from eviction via the GitHub Actions workflow run API. Queued runs are not protected since they have not started using workers and can redeploy if evicted. If no evictable candidate exists, the step fails fast with a clear error rather than silently breaking another PR's test run.

PR-close cleanup is preserved for workers, containers, and GHA caches, but image tag deletion is removed. Deleting pr-* tags was observed to break the ci-* Docker cache, and since pr-* and ci-* tags are created via crane copy and share the same manifest, the same risk applies at PR close. The daily sweep still deletes pr-* images for closed and stale PRs where no future CI run will reference them. The cleanup trigger is changed to pull_request_target so fork PRs that ran privileged CI via ok-to-test also get cleaned at close. Grep pipelines that previously failed under bash's default set -eo pipefail when no matches existed are fixed with || true guards.

Deploy-skip is restored for PRs whose workers survive between runs. In practice, with 7 PR slots (84 budget / 12 apps per PR) and typical concurrency of 3-5 active PRs, most PRs will keep their workers warm across multiple pushes.


Open with Devin

PR #512 added per-run cleanup that deleted pr-* image tags after
every CI run. This broke the content-addressed Docker cache because
ci-* and pr-* tags share underlying manifests in the CF registry.
Cache hit rate dropped from 100% to 30%, doubling average CI time
from 5.4 to 10.4 minutes.

Remove image tag deletion from per-run cleanup while keeping
worker/container deletion for the container app platform limit.
Decouple the cleanup job from gate so it no longer blocks the
required status check. Move image cleanup to PR close only.

Also fix four pre-existing issues in cleanup.yml: grep pipelines
failing under set -eo pipefail on empty results, fork PR resources
leaking because pull_request event lacks secrets, the daily sweep
being blind to image-only orphans, and ci-* cache tags accumulating
without eviction against the 50GB storage limit.
The previous commit removed image deletion but still deleted workers
per-run, which defeated deploy-skip. This replaces the entire per-run
cleanup model with a pre-deploy capacity guard that evicts only when
the account approaches its ~100 container app limit.

Workers and containers now persist across runs, restoring both the
Docker cache and deploy-skip optimizations from PR #447. The capacity
guard evicts whole-PR resources in least-recently-active order, using
GitHub API for reliable active-run protection. PR-close cleanup is
scoped to workers/containers/caches only — image tags are left alone
to avoid poisoning the shared ci-* Docker cache manifests.
Only protect PRs with in-progress CI from capacity eviction. Queued
runs have not started using workers yet and can redeploy if evicted.
This widens the effective eviction pool and makes the fail-fast
scenario less likely.

Remove nightly ci-* image tag eviction from the daily sweep. The
50GB storage limit is not an immediate concern and the eviction was
costing ~6 minutes on the first build each day. Can be re-added if
storage pressure becomes a problem.
@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 26, 2026

⚠️ No Changeset found

Latest commit: 9723314

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@ghostwriternr ghostwriternr merged commit 40ef275 into main Mar 26, 2026
15 checks passed
@ghostwriternr ghostwriternr deleted the fix-ci-cleanup-cache-regression branch March 26, 2026 16:37
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +93 to +97
CONTAINERS=$(wrangler containers list --json 2>/dev/null)
if [ $? -ne 0 ] || [ -z "$CONTAINERS" ]; then
echo "::error::Failed to list container apps — cannot verify capacity"
exit 1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 $? check is unreachable due to set -e in GitHub Actions default shell

GitHub Actions runs bash with set -eo pipefail by default. When wrangler containers list --json exits non-zero at line 93, the command substitution's exit code propagates to the variable assignment, and set -e terminates the script immediately — before line 94 is reached. The if [ $? -ne 0 ] check and its descriptive error message (Failed to list container apps — cannot verify capacity) are dead code for the failure case. The step still fails correctly, but the operator sees only a generic "Process completed with exit code 1" with stderr suppressed by 2>/dev/null, making failures hard to diagnose.

The fix is to suppress set -e for this command, e.g., CONTAINERS=$(wrangler containers list --json 2>/dev/null) || true, and then check only [ -z "$CONTAINERS" ].

Suggested change
CONTAINERS=$(wrangler containers list --json 2>/dev/null)
if [ $? -ne 0 ] || [ -z "$CONTAINERS" ]; then
echo "::error::Failed to list container apps — cannot verify capacity"
exit 1
fi
CONTAINERS=$(wrangler containers list --json 2>/dev/null) || true
if [ -z "$CONTAINERS" ] || ! echo "$CONTAINERS" | jq empty 2>/dev/null; then
echo "::error::Failed to list container apps — cannot verify capacity"
exit 1
fi
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

ghostwriternr added a commit that referenced this pull request Mar 27, 2026
The reusable-e2e workflow's deploy job requires actions:read for
the capacity guard added in #529, but the Release caller didn't
grant it, causing startup_failure on every push to main.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants