Replace per-run CI cleanup with capacity-aware LRU eviction#529
Replace per-run CI cleanup with capacity-aware LRU eviction#529ghostwriternr merged 3 commits intomainfrom
Conversation
PR #512 added per-run cleanup that deleted pr-* image tags after every CI run. This broke the content-addressed Docker cache because ci-* and pr-* tags share underlying manifests in the CF registry. Cache hit rate dropped from 100% to 30%, doubling average CI time from 5.4 to 10.4 minutes. Remove image tag deletion from per-run cleanup while keeping worker/container deletion for the container app platform limit. Decouple the cleanup job from gate so it no longer blocks the required status check. Move image cleanup to PR close only. Also fix four pre-existing issues in cleanup.yml: grep pipelines failing under set -eo pipefail on empty results, fork PR resources leaking because pull_request event lacks secrets, the daily sweep being blind to image-only orphans, and ci-* cache tags accumulating without eviction against the 50GB storage limit.
The previous commit removed image deletion but still deleted workers per-run, which defeated deploy-skip. This replaces the entire per-run cleanup model with a pre-deploy capacity guard that evicts only when the account approaches its ~100 container app limit. Workers and containers now persist across runs, restoring both the Docker cache and deploy-skip optimizations from PR #447. The capacity guard evicts whole-PR resources in least-recently-active order, using GitHub API for reliable active-run protection. PR-close cleanup is scoped to workers/containers/caches only — image tags are left alone to avoid poisoning the shared ci-* Docker cache manifests.
Only protect PRs with in-progress CI from capacity eviction. Queued runs have not started using workers yet and can redeploy if evicted. This widens the effective eviction pool and makes the fail-fast scenario less likely. Remove nightly ci-* image tag eviction from the daily sweep. The 50GB storage limit is not an immediate concern and the eviction was costing ~6 minutes on the first build each day. Can be re-added if storage pressure becomes a problem.
|
| CONTAINERS=$(wrangler containers list --json 2>/dev/null) | ||
| if [ $? -ne 0 ] || [ -z "$CONTAINERS" ]; then | ||
| echo "::error::Failed to list container apps — cannot verify capacity" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
🟡 $? check is unreachable due to set -e in GitHub Actions default shell
GitHub Actions runs bash with set -eo pipefail by default. When wrangler containers list --json exits non-zero at line 93, the command substitution's exit code propagates to the variable assignment, and set -e terminates the script immediately — before line 94 is reached. The if [ $? -ne 0 ] check and its descriptive error message (Failed to list container apps — cannot verify capacity) are dead code for the failure case. The step still fails correctly, but the operator sees only a generic "Process completed with exit code 1" with stderr suppressed by 2>/dev/null, making failures hard to diagnose.
The fix is to suppress set -e for this command, e.g., CONTAINERS=$(wrangler containers list --json 2>/dev/null) || true, and then check only [ -z "$CONTAINERS" ].
| CONTAINERS=$(wrangler containers list --json 2>/dev/null) | |
| if [ $? -ne 0 ] || [ -z "$CONTAINERS" ]; then | |
| echo "::error::Failed to list container apps — cannot verify capacity" | |
| exit 1 | |
| fi | |
| CONTAINERS=$(wrangler containers list --json 2>/dev/null) || true | |
| if [ -z "$CONTAINERS" ] || ! echo "$CONTAINERS" | jq empty 2>/dev/null; then | |
| echo "::error::Failed to list container apps — cannot verify capacity" | |
| exit 1 | |
| fi |
Was this helpful? React with 👍 or 👎 to provide feedback.
The reusable-e2e workflow's deploy job requires actions:read for the capacity guard added in #529, but the Release caller didn't grant it, causing startup_failure on every push to main.
Replace the per-run cleanup model from #512 with a pre-deploy capacity guard that evicts resources only when the account approaches its container app limit.
#512 added a cleanup job that deleted workers, containers, and image tags after every CI run. This defeated two key optimizations from #447: the content-addressed Docker cache (ci-* tags invalidated via shared manifest GC, hit rate dropped from 100% to 30%) and deploy-skip (workers deleted so health checks always fail). Average CI time doubled from 5.4 to 10.4 minutes.
The root cause is a mismatch between resource lifecycle and cleanup policy. #447 designed PR resources to be reused across runs — images persist for cache hits, workers persist for deploy-skip, and cleanup happens only at PR close or via daily sweep. #512 treated all per-run artifacts as disposable, which is correct for some resources (GHA caches) but destructive for others (Docker manifests, deployed workers).
This change removes per-run cleanup entirely and instead adds a capacity guard to the deploy step in reusable-e2e.yml. Before deploying, it checks the account's container app count against a budget of 84 (out of ~100 limit, with 12 reserved for main). If over budget, it evicts the least-recently-active PR's resources as a whole unit, using GitHub PR updatedAt as the recency signal. It protects the current PR, main workers, and any PR with in-progress CI from eviction via the GitHub Actions workflow run API. Queued runs are not protected since they have not started using workers and can redeploy if evicted. If no evictable candidate exists, the step fails fast with a clear error rather than silently breaking another PR's test run.
PR-close cleanup is preserved for workers, containers, and GHA caches, but image tag deletion is removed. Deleting pr-* tags was observed to break the ci-* Docker cache, and since pr-* and ci-* tags are created via crane copy and share the same manifest, the same risk applies at PR close. The daily sweep still deletes pr-* images for closed and stale PRs where no future CI run will reference them. The cleanup trigger is changed to pull_request_target so fork PRs that ran privileged CI via ok-to-test also get cleaned at close. Grep pipelines that previously failed under bash's default set -eo pipefail when no matches existed are fixed with || true guards.
Deploy-skip is restored for PRs whose workers survive between runs. In practice, with 7 PR slots (84 budget / 12 apps per PR) and typical concurrency of 3-5 active PRs, most PRs will keep their workers warm across multiple pushes.