Overhaul CI pipeline and eliminate E2E test flakiness#447
Overhaul CI pipeline and eliminate E2E test flakiness#447ghostwriternr merged 43 commits intomainfrom
Conversation
🦋 Changeset detectedLatest commit: 36f4e47 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Removed: - pullrequest.yml (replaced by pr.yml) - pkg-pr-new.yml (absorbed into pr.yml publish-preview job) - sync-docs.yml (dead code since Feb 2026) - cleanup-stale.yml (absorbed into cleanup.yml daily schedule) - scripts/cleanup-test-deployment.sh (logic inlined into cleanup.yml)
6665417 to
23171af
Compare
There was a problem hiding this comment.
OpenCode Review
This is an impressive CI/CD architecture overhaul that transforms a monolithic 18-25 minute pipeline into a modular, intelligent system with change detection and parallelism. The compositional approach using reusable workflows is excellent.
Architectural highlights:
- Change detection via
dorny/paths-filterprevents unnecessary work (docs-only PRs run nothing, SDK changes skip Docker entirely) - Docker Buildx Bake replaces 4 separate builds with unified configuration and shared cache layers
- Reusable workflows enable proper job parallelism (quality checks, E2E suites run concurrently)
- Smart cleanup consolidates scattered logic into focused jobs with proper timeouts
Technical improvements:
- Bun version pinning via
.bun-versionfile ensures consistency crane copyfor release publishing eliminates rebuild drift- Turbo remote caching for build acceleration
- Container instance type standardization (standard-4)
Two concerns:
-
Missing concurrency safeguards in
reusable-e2e.yml: The deploy job creates both HTTP and WebSocket workers sequentially, but both reference the same image tag. If a container rollout is slow, the second deploy might use stale images. Consider adding a brief wait or status check between deployments whenimages_changedis true. -
Hardcoded Bun version fallback in
docker-bake.hcl: The defaultBUN_VERSION=1seems inconsistent with the new.bun-versionfile approach. Should this default match the file content or be removed entirely?
Overall: This modernizes the CI pipeline significantly and should dramatically improve developer experience. The change detection alone will save substantial CI time.
| with: | ||
| node-version: 24 | ||
| cache: 'npm' | ||
|
|
There was a problem hiding this comment.
Both workers deploy the same image tag but run sequentially. If container rollout is slow, the WebSocket worker might deploy before HTTP worker's containers are fully rolled out, potentially using stale images. Consider adding a brief wait between deploys when images_changed is true.
There was a problem hiding this comment.
This sequential deployment pattern still exists in the new reusable workflow. Consider adding a brief sleep or container status check between the HTTP and WebSocket deployments when images_changed is true to ensure container rollout completion.
There was a problem hiding this comment.
This sequential deployment pattern still exists in the new reusable workflow. Consider adding a brief sleep or container status check between the HTTP and WebSocket deployments when images_changed is true to ensure container rollout completion.
There was a problem hiding this comment.
This sequential deployment pattern still exists in the new reusable workflow. Consider adding a brief sleep or container status check between the HTTP and WebSocket deployments when images_changed is true to ensure container rollout completion.
There was a problem hiding this comment.
This sequential deployment issue persists after the rewrite. The risk remains: if container rollout is slow, the WebSocket worker could deploy with stale images before the HTTP worker's containers complete rollout. Consider adding a brief wait or container status check between deployments when images_changed is true.
| inherits = ["_common"] | ||
| target = "musl" | ||
| tags = ["sandbox-musl:${TAG}"] | ||
| cache-from = CACHE_REPO != "" ? ["type=registry,ref=${CACHE_REPO}:musl"] : [] |
There was a problem hiding this comment.
Default BUN_VERSION=1 seems inconsistent with the new pinned .bun-version file approach. Should this match .bun-version content or require explicit override?
There was a problem hiding this comment.
The inconsistency persists. Since .bun-version now pins to 1.3, the docker-bake.hcl default should either match this (BUN_VERSION=1.3) or require explicit override. The current setup could lead to version mismatches if someone doesn't pass the build arg.
There was a problem hiding this comment.
The inconsistency persists. Since .bun-version now pins to 1.3, the docker-bake.hcl default should either match this (BUN_VERSION=1.3) or require explicit override. The current setup could lead to version mismatches if someone doesn't pass the build arg.
There was a problem hiding this comment.
The inconsistency persists. Since .bun-version now pins to 1.3, the docker-bake.hcl default should either match this (BUN_VERSION=1.3) or require explicit override. The current setup could lead to version mismatches if someone doesn't pass the build arg.
There was a problem hiding this comment.
The version inconsistency still exists after the rewrite. With .bun-version pinned to 1.3, the docker-bake.hcl default of BUN_VERSION=1 creates a mismatch. This should either default to 1.3 or explicitly require the build arg to avoid version drift.
There was a problem hiding this comment.
This version mismatch persists after the major workflow restructure. The new .bun-version file pins to 1.3, but docker-bake.hcl still defaults to BUN_VERSION=1. This inconsistency could lead to different Bun versions being used when the build arg isn't explicitly passed.
There was a problem hiding this comment.
This version mismatch persists after the major workflow restructure. The new .bun-version file pins to 1.3, but docker-bake.hcl still defaults to BUN_VERSION=1. This inconsistency could lead to different Bun versions being used when the build arg isn't explicitly passed.
/global/health returns 200 unconditionally before OpenCode initializes. /path goes through the instance middleware, ensuring InstanceBootstrap completes before declaring ready.
All cleanup now goes through stopWatch() which is idempotent via stopPromise, awaits graceful exit with SIGKILL fallback, and cleans up stale same-path watches before spawning new ones.
Replace github/gitignore with octocat/Spoon-Knife to reduce clone time and network failure surface in CI.
All 23 test files shared a single sandbox, causing flaky failures from cross-test contamination of filesystem, ports, and processes. Each file now creates and tears down its own sandbox, and the automatic retry workaround is removed.
|
Love this! |
The crane copy loop and PR comment template now cover all five published variants, matching the release workflow.
The container's handleStream registered output listeners AFTER replaying buffered logs, creating a window where output emitted between the buffer snapshot and listener registration was lost. This caused intermittent waitForLog timeouts on HTTP transport while WebSocket was unaffected. Register listeners before replay so no output can slip through the gap. Remove the now-dead streamProcessLogs service method the handler no longer calls, and switch test-worker process lookups from listProcesses().find() to getProcess() to match real SDK usage patterns.
Add biome.json and changesets to path filters so config and changeset- only PRs trigger the right jobs. Include workflows in needs-docker so E2E has images available on pipeline-only changes. Replace bare wait with per-PID waits to surface individual crane copy failures. Catch cancelled jobs in the gate check alongside failures. Remove workflows from publish-preview trigger to avoid noisy npm previews.
0eb098e to
52ab6d7
Compare
Set fetch-depth: 1 on 11 checkout steps that only need the current commit (build, quality, e2e, performance). The two checkouts that need full history (publish-preview, release) already set fetch-depth: 0 and are left unchanged.
Build job populates the cache on first run, downstream jobs restore it and skip npm ci entirely. Saves ~85-100s aggregate compute per PR push by eliminating 7 redundant installs.
Cache workspace-local node_modules directories alongside root to ensure packages like @cloudflare/kumo (in examples/desktop/) survive cache restore. Use restore-only action in downstream jobs since only the build job needs to save. Add missing zod peer dep for kumo.
Document the reusable workflow architecture, four optimization layers, Docker image topology, and key design decisions.
Accept main's removal of streamProcessLogs from ProcessService, which was deleted in PR #447.
| await existingProcess.waitForPort(port, { | ||
| mode: 'http', | ||
| path: '/global/health', | ||
| path: '/path', |
There was a problem hiding this comment.
🔴 OpenCode readiness probe path changed to invalid endpoint /path
The waitForPort readiness probe path was changed from /global/health to /path in all three call sites within opencode.ts. The endpoint /path is not a valid OpenCode server endpoint — it's a nonsensical path that will likely return 404, causing the readiness check to time out after 180 seconds (the OPENCODE_STARTUP_TIMEOUT_MS). The original /global/health was a real OpenCode API endpoint that returns {healthy: true, version: ...}, as confirmed by the E2E test at tests/e2e/opencode-workflow.test.ts:116 which still uses /global/health, and the CHANGELOG which explicitly documents this endpoint choice. The test at packages/sandbox/tests/opencode/opencode.test.ts:149 was updated to expect /path, masking the bug in unit tests while the E2E test still validates the real endpoint.
Prompt for agents
In packages/sandbox/src/opencode/opencode.ts, change all three occurrences of path: '/path' back to path: '/global/health' (lines 98, 138, and 232). Also revert the corresponding test expectations in packages/sandbox/tests/opencode/opencode.test.ts lines 149 and 214 from path: '/path' back to path: '/global/health'.
Was this helpful? React with 👍 or 👎 to provide feedback.
| const stream = new ReadableStream({ | ||
| start(controller) { | ||
| const encoder = new TextEncoder(); | ||
|
|
||
| const process = processResult.data; | ||
| const enqueueSSE = (payload: Record<string, unknown>) => { | ||
| controller.enqueue( | ||
| encoder.encode(`data: ${JSON.stringify(payload)}\n\n`) | ||
| ); | ||
| }; | ||
|
|
||
| const stream = new ReadableStream({ | ||
| start(controller) { | ||
| // Send initial process info | ||
| const initialData = `data: ${JSON.stringify({ | ||
| type: 'process_info', | ||
| enqueueSSE({ | ||
| type: 'process_info', | ||
| processId: process.id, | ||
| command: process.command, | ||
| status: process.status, | ||
| timestamp: new Date().toISOString() | ||
| }); | ||
|
|
||
| // Register listeners BEFORE replaying buffered output so no | ||
| // chunk emitted between the snapshot read and registration is | ||
| // lost. Duplicates from replay + listener overlap are harmless; | ||
| // missing data is not. | ||
| const outputListener = (stream: 'stdout' | 'stderr', data: string) => { | ||
| enqueueSSE({ | ||
| type: stream, | ||
| data, | ||
| processId: process.id, | ||
| command: process.command, | ||
| status: process.status, | ||
| timestamp: new Date().toISOString() | ||
| })}\n\n`; | ||
| controller.enqueue(new TextEncoder().encode(initialData)); | ||
|
|
||
| // Send existing logs | ||
| if (process.stdout) { | ||
| const stdoutData = `data: ${JSON.stringify({ | ||
| type: 'stdout', | ||
| data: process.stdout, | ||
| processId: process.id, | ||
| timestamp: new Date().toISOString() | ||
| })}\n\n`; | ||
| controller.enqueue(new TextEncoder().encode(stdoutData)); | ||
| } | ||
|
|
||
| if (process.stderr) { | ||
| const stderrData = `data: ${JSON.stringify({ | ||
| type: 'stderr', | ||
| data: process.stderr, | ||
| processId: process.id, | ||
| timestamp: new Date().toISOString() | ||
| })}\n\n`; | ||
| controller.enqueue(new TextEncoder().encode(stderrData)); | ||
| } | ||
| }); | ||
| }; | ||
|
|
||
| // Set up listeners for new output | ||
| const outputListener = ( | ||
| stream: 'stdout' | 'stderr', | ||
| data: string | ||
| ) => { | ||
| const eventData = `data: ${JSON.stringify({ | ||
| type: stream, // 'stdout' or 'stderr' directly | ||
| data, | ||
| processId: process.id, | ||
| timestamp: new Date().toISOString() | ||
| })}\n\n`; | ||
| controller.enqueue(new TextEncoder().encode(eventData)); | ||
| }; | ||
|
|
||
| const statusListener = (status: string) => { | ||
| // Close stream when process completes | ||
| if (['completed', 'failed', 'killed', 'error'].includes(status)) { | ||
| const finalData = `data: ${JSON.stringify({ | ||
| type: 'exit', | ||
| processId: process.id, | ||
| exitCode: process.exitCode, | ||
| data: `Process ${status} with exit code ${process.exitCode}`, | ||
| timestamp: new Date().toISOString() | ||
| })}\n\n`; | ||
| controller.enqueue(new TextEncoder().encode(finalData)); | ||
| controller.close(); | ||
| } | ||
| }; | ||
|
|
||
| // Add listeners | ||
| process.outputListeners.add(outputListener); | ||
| process.statusListeners.add(statusListener); | ||
|
|
||
| // If process already completed, send exit event immediately | ||
| if ( | ||
| ['completed', 'failed', 'killed', 'error'].includes(process.status) | ||
| ) { | ||
| const finalData = `data: ${JSON.stringify({ | ||
| const statusListener = (status: string) => { | ||
| if (['completed', 'failed', 'killed', 'error'].includes(status)) { | ||
| enqueueSSE({ | ||
| type: 'exit', | ||
| processId: process.id, | ||
| exitCode: process.exitCode, | ||
| data: `Process ${process.status} with exit code ${process.exitCode}`, | ||
| data: `Process ${status} with exit code ${process.exitCode}`, | ||
| timestamp: new Date().toISOString() | ||
| })}\n\n`; | ||
| controller.enqueue(new TextEncoder().encode(finalData)); | ||
| }); | ||
| controller.close(); | ||
| } | ||
| }; | ||
|
|
||
| process.outputListeners.add(outputListener); | ||
| process.statusListeners.add(statusListener); | ||
|
|
||
| // Cleanup when stream is cancelled | ||
| return () => { | ||
| process.outputListeners.delete(outputListener); | ||
| process.statusListeners.delete(statusListener); | ||
| }; | ||
| // Replay buffered output collected before the listener was added | ||
| if (process.stdout) { | ||
| enqueueSSE({ | ||
| type: 'stdout', | ||
| data: process.stdout, | ||
| processId: process.id, | ||
| timestamp: new Date().toISOString() | ||
| }); | ||
| } | ||
| }); | ||
|
|
||
| return new Response(stream, { | ||
| status: 200, | ||
| headers: { | ||
| 'Content-Type': 'text/event-stream', | ||
| 'Cache-Control': 'no-cache', | ||
| Connection: 'keep-alive', | ||
| ...context.corsHeaders | ||
|
|
||
| if (process.stderr) { | ||
| enqueueSSE({ | ||
| type: 'stderr', | ||
| data: process.stderr, | ||
| processId: process.id, | ||
| timestamp: new Date().toISOString() | ||
| }); | ||
| } | ||
| }); | ||
| } else { | ||
| return this.createErrorResponse(result.error, context); | ||
| } | ||
|
|
||
| if ( | ||
| ['completed', 'failed', 'killed', 'error'].includes(process.status) | ||
| ) { | ||
| enqueueSSE({ | ||
| type: 'exit', | ||
| processId: process.id, | ||
| exitCode: process.exitCode, | ||
| data: `Process ${process.status} with exit code ${process.exitCode}`, | ||
| timestamp: new Date().toISOString() | ||
| }); | ||
| controller.close(); | ||
| } | ||
|
|
||
| return () => { | ||
| process.outputListeners.delete(outputListener); | ||
| process.statusListeners.delete(statusListener); | ||
| }; | ||
| } | ||
| }); |
There was a problem hiding this comment.
🔴 SSE process stream leaks listeners on client disconnect — missing cancel() handler
The refactored handleStream method creates a ReadableStream with listener registration in start() and returns a cleanup function from start(). However, the ReadableStream spec ignores the return value of start() — it is not a cleanup mechanism. When a client disconnects (stream cancellation), there is no cancel() callback to remove the outputListener and statusListener from the process's listener sets. The old code in process-service.ts:streamProcessLogs (now deleted at line 467) had a proper cancel() handler on its ReadableStream. The replacement code in the handler should add a cancel() method to the ReadableStream constructor options to delete the listeners, matching the pattern used in packages/sandbox-container/src/services/watch-service.ts:467-471.
Prompt for agents
In packages/sandbox-container/src/handlers/process-handler.ts, the ReadableStream created in handleStream (starting around line 232) needs a cancel() method added alongside the start() method. Move the cleanup logic from the return statement inside start() (lines 311-314) into a cancel() callback on the ReadableStream options object:
Change the ReadableStream constructor from:
new ReadableStream({
start(controller) {
...
return () => {
process.outputListeners.delete(outputListener);
process.statusListeners.delete(statusListener);
};
}
});
To:
new ReadableStream({
start(controller) {
...
// Remove the return () => { ... } from here
},
cancel() {
process.outputListeners.delete(outputListener);
process.statusListeners.delete(statusListener);
}
});
Note: The outputListener and statusListener variables need to be hoisted to the outer scope of the ReadableStream constructor so cancel() can reference them, or you can use a shared closure variable.
Was this helpful? React with 👍 or 👎 to provide feedback.
The previous commit removed image deletion but still deleted workers per-run, which defeated deploy-skip. This replaces the entire per-run cleanup model with a pre-deploy capacity guard that evicts only when the account approaches its ~100 container app limit. Workers and containers now persist across runs, restoring both the Docker cache and deploy-skip optimizations from PR #447. The capacity guard evicts whole-PR resources in least-recently-active order, using GitHub API for reliable active-run protection. PR-close cleanup is scoped to workers/containers/caches only — image tags are left alone to avoid poisoning the shared ci-* Docker cache manifests.
* Stop per-run image deletion to restore Docker cache PR #512 added per-run cleanup that deleted pr-* image tags after every CI run. This broke the content-addressed Docker cache because ci-* and pr-* tags share underlying manifests in the CF registry. Cache hit rate dropped from 100% to 30%, doubling average CI time from 5.4 to 10.4 minutes. Remove image tag deletion from per-run cleanup while keeping worker/container deletion for the container app platform limit. Decouple the cleanup job from gate so it no longer blocks the required status check. Move image cleanup to PR close only. Also fix four pre-existing issues in cleanup.yml: grep pipelines failing under set -eo pipefail on empty results, fork PR resources leaking because pull_request event lacks secrets, the daily sweep being blind to image-only orphans, and ci-* cache tags accumulating without eviction against the 50GB storage limit. * Replace per-run cleanup with capacity-aware LRU eviction The previous commit removed image deletion but still deleted workers per-run, which defeated deploy-skip. This replaces the entire per-run cleanup model with a pre-deploy capacity guard that evicts only when the account approaches its ~100 container app limit. Workers and containers now persist across runs, restoring both the Docker cache and deploy-skip optimizations from PR #447. The capacity guard evicts whole-PR resources in least-recently-active order, using GitHub API for reliable active-run protection. PR-close cleanup is scoped to workers/containers/caches only — image tags are left alone to avoid poisoning the shared ci-* Docker cache manifests. * Drop queued-run protection and remove nightly cache eviction Only protect PRs with in-progress CI from capacity eviction. Queued runs have not started using workers yet and can redeploy if evicted. This widens the effective eviction pool and makes the fail-fast scenario less likely. Remove nightly ci-* image tag eviction from the daily sweep. The 50GB storage limit is not an immediate concern and the eviction was costing ~6 minutes on the first build each day. Can be re-added if storage pressure becomes a problem.
CI was both slow and unreliable. Every PR ran the full pipeline regardless of what changed (~18-25 min wall clock), Docker images were rebuilt from scratch in both PR and release paths, and quality checks ran sequentially. E2E tests compounded this — 17 of 23 test files shared a single sandbox, leaking filesystem/port/process state between tests. The OpenCode readiness probe hit a no-op endpoint. File watch cleanup had race conditions. The SSE stream handler had a listener registration ordering bug.
retry: 1papered over all of it.This PR replaces the monolithic CI with a compositional architecture and fixes the structural causes of E2E flakiness, so that a green pipeline actually means something.
Pipeline architecture
Three reusable workflows (
reusable-build,reusable-quality,reusable-e2e) composed by bothpr.ymlandrelease.yml. Quality checks (lint+types, SDK tests, container tests) and E2E suites (HTTP, WebSocket, browser) run in parallel.dorny/paths-filterat the entry point skips entire workflow trees when changes don't affect them — a docs-only PR runs nothing; an SDK-only change skips Docker entirely.Docker Buildx Bake with a single HCL config. 4 image variants built in one invocation with shared base layers and a single registry cache round-trip, replacing 4 separate
docker buildcommands with independent cache pulls.crane copyfor releases. The old release rebuilt Docker images from scratch despite identical images already existing from E2E. Now we retag the tested image — no rebuild, no drift between what was tested and what ships.Unified cleanup with explicit container teardown. Worker deletion does not delete containers.
cleanup.ymlnow handles both PR-close and daily-sweep triggers, with explicit container deletion and per-step timeouts.E2E test isolation
Every test file now creates and destroys its own sandbox via
createTestSandbox()/cleanupTestSandbox(), with correctX-Sandbox-Typeheader routing for typed variants (python, opencode, desktop, musl, standalone).retry: 1is removed — tests pass on first attempt or they fail.SDK and container fixes caught by the audit
OpenCode readiness probe.
/global/healthunconditionally returns{ healthy: true }regardless of init state. Changed to probe/path, which only succeeds after the binary is actually ready. 3 call sites inopencode.ts.Watch process lifecycle.
watchDirectory()had a race where the 4-hop cancel chain (SDK → DO → container → shell) was slower than a new watch starting. Added centralizedstopWatch()with SIGTERM→SIGKILL escalation and stale watcher cleanup before spawning.SSE stream listener ordering. The container's
handleStreamregistered output listeners after replaying buffered logs, creating a window where output emitted between the buffer snapshot and listener registration was lost. This caused intermittentwaitForLogtimeouts on HTTP transport while WebSocket was unaffected (different streaming path). Fixed by registering listeners before replay. Also removed the now-deadstreamProcessLogsservice method the handler no longer calls.Test worker process lookup. The
waitForLog,waitForPort, andwaitForExitroutes in the E2E test worker usedlistProcesses().find()to locate a process by ID. Changed togetProcess(id)— a direct lookup that matches how real SDK users would use the API.Git clone test repo. Replaced
github/gitignore(~15MB) withoctocat/Spoon-Knife(~1KB). The test validates clone mechanics, not repo size.What was removed
pullrequest.yml,cleanup-stale.yml,pkg-pr-new.yml,sync-docs.yml,scripts/cleanup-test-deployment.sh— replaced or dead code.Post-merge
E2E worker naming changes from
sandbox-e2e-test-worker-maintosandbox-e2e-test-worker-main-http+sandbox-e2e-test-worker-main-websocket. The old worker needs manual cleanup.