Overhaul CI pipeline and eliminate E2E test flakiness by ghostwriternr · Pull Request #447 · cloudflare/sandbox-sdk

ghostwriternr · 2026-03-03T22:52:43Z

CI was both slow and unreliable. Every PR ran the full pipeline regardless of what changed (~18-25 min wall clock), Docker images were rebuilt from scratch in both PR and release paths, and quality checks ran sequentially. E2E tests compounded this — 17 of 23 test files shared a single sandbox, leaking filesystem/port/process state between tests. The OpenCode readiness probe hit a no-op endpoint. File watch cleanup had race conditions. The SSE stream handler had a listener registration ordering bug. retry: 1 papered over all of it.

This PR replaces the monolithic CI with a compositional architecture and fixes the structural causes of E2E flakiness, so that a green pipeline actually means something.

Pipeline architecture

Three reusable workflows (reusable-build, reusable-quality, reusable-e2e) composed by both pr.yml and release.yml. Quality checks (lint+types, SDK tests, container tests) and E2E suites (HTTP, WebSocket, browser) run in parallel. dorny/paths-filter at the entry point skips entire workflow trees when changes don't affect them — a docs-only PR runs nothing; an SDK-only change skips Docker entirely.

Docker Buildx Bake with a single HCL config. 4 image variants built in one invocation with shared base layers and a single registry cache round-trip, replacing 4 separate docker build commands with independent cache pulls.

crane copy for releases. The old release rebuilt Docker images from scratch despite identical images already existing from E2E. Now we retag the tested image — no rebuild, no drift between what was tested and what ships.

Unified cleanup with explicit container teardown. Worker deletion does not delete containers. cleanup.yml now handles both PR-close and daily-sweep triggers, with explicit container deletion and per-step timeouts.

E2E test isolation

Every test file now creates and destroys its own sandbox via createTestSandbox() / cleanupTestSandbox(), with correct X-Sandbox-Type header routing for typed variants (python, opencode, desktop, musl, standalone). retry: 1 is removed — tests pass on first attempt or they fail.

SDK and container fixes caught by the audit

OpenCode readiness probe. /global/health unconditionally returns { healthy: true } regardless of init state. Changed to probe /path, which only succeeds after the binary is actually ready. 3 call sites in opencode.ts.

Watch process lifecycle. watchDirectory() had a race where the 4-hop cancel chain (SDK → DO → container → shell) was slower than a new watch starting. Added centralized stopWatch() with SIGTERM→SIGKILL escalation and stale watcher cleanup before spawning.

SSE stream listener ordering. The container's handleStream registered output listeners after replaying buffered logs, creating a window where output emitted between the buffer snapshot and listener registration was lost. This caused intermittent waitForLog timeouts on HTTP transport while WebSocket was unaffected (different streaming path). Fixed by registering listeners before replay. Also removed the now-dead streamProcessLogs service method the handler no longer calls.

Test worker process lookup. The waitForLog, waitForPort, and waitForExit routes in the E2E test worker used listProcesses().find() to locate a process by ID. Changed to getProcess(id) — a direct lookup that matches how real SDK users would use the API.

Git clone test repo. Replaced github/gitignore (~15MB) with octocat/Spoon-Knife (~1KB). The test validates clone mechanics, not repo size.

What was removed

pullrequest.yml, cleanup-stale.yml, pkg-pr-new.yml, sync-docs.yml, scripts/cleanup-test-deployment.sh — replaced or dead code.

Post-merge

E2E worker naming changes from sandbox-e2e-test-worker-main to sandbox-e2e-test-worker-main-http + sandbox-e2e-test-worker-main-websocket. The old worker needs manual cleanup.

changeset-bot · 2026-03-03T22:52:47Z

🦋 Changeset detected

Latest commit: 36f4e47

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@cloudflare/sandbox	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

.github/workflows/reusable-build.yml

.github/workflows/reusable-e2e.yml

.github/workflows/reusable-quality.yml

…imeouts

…orker name

Removed: - pullrequest.yml (replaced by pr.yml) - pkg-pr-new.yml (absorbed into pr.yml publish-preview job) - sync-docs.yml (dead code since Feb 2026) - cleanup-stale.yml (absorbed into cleanup.yml daily schedule) - scripts/cleanup-test-deployment.sh (logic inlined into cleanup.yml)

…missions

github-actions

OpenCode Review

This is an impressive CI/CD architecture overhaul that transforms a monolithic 18-25 minute pipeline into a modular, intelligent system with change detection and parallelism. The compositional approach using reusable workflows is excellent.

Architectural highlights:

Change detection via dorny/paths-filter prevents unnecessary work (docs-only PRs run nothing, SDK changes skip Docker entirely)
Docker Buildx Bake replaces 4 separate builds with unified configuration and shared cache layers
Reusable workflows enable proper job parallelism (quality checks, E2E suites run concurrently)
Smart cleanup consolidates scattered logic into focused jobs with proper timeouts

Technical improvements:

Bun version pinning via .bun-version file ensures consistency
crane copy for release publishing eliminates rebuild drift
Turbo remote caching for build acceleration
Container instance type standardization (standard-4)

Two concerns:

Missing concurrency safeguards in reusable-e2e.yml: The deploy job creates both HTTP and WebSocket workers sequentially, but both reference the same image tag. If a container rollout is slow, the second deploy might use stale images. Consider adding a brief wait or status check between deployments when images_changed is true.
Hardcoded Bun version fallback in docker-bake.hcl: The default BUN_VERSION=1 seems inconsistent with the new .bun-version file approach. Should this default match the file content or be removed entirely?

Overall: This modernizes the CI pipeline significantly and should dramatically improve developer experience. The change detection alone will save substantial CI time.

github-actions · 2026-03-03T23:48:57Z

.github/workflows/reusable-e2e.yml

+        with:
+          node-version: 24
+          cache: 'npm'
+


Both workers deploy the same image tag but run sequentially. If container rollout is slow, the WebSocket worker might deploy before HTTP worker's containers are fully rolled out, potentially using stale images. Consider adding a brief wait between deploys when images_changed is true.

This sequential deployment pattern still exists in the new reusable workflow. Consider adding a brief sleep or container status check between the HTTP and WebSocket deployments when images_changed is true to ensure container rollout completion.

This sequential deployment issue persists after the rewrite. The risk remains: if container rollout is slow, the WebSocket worker could deploy with stale images before the HTTP worker's containers complete rollout. Consider adding a brief wait or container status check between deployments when images_changed is true.

github-actions · 2026-03-03T23:48:58Z

docker-bake.hcl

+  inherits   = ["_common"]
+  target     = "musl"
+  tags       = ["sandbox-musl:${TAG}"]
+  cache-from = CACHE_REPO != "" ? ["type=registry,ref=${CACHE_REPO}:musl"] : []


Default BUN_VERSION=1 seems inconsistent with the new pinned .bun-version file approach. Should this match .bun-version content or require explicit override?

The inconsistency persists. Since .bun-version now pins to 1.3, the docker-bake.hcl default should either match this (BUN_VERSION=1.3) or require explicit override. The current setup could lead to version mismatches if someone doesn't pass the build arg.

The version inconsistency still exists after the rewrite. With .bun-version pinned to 1.3, the docker-bake.hcl default of BUN_VERSION=1 creates a mismatch. This should either default to 1.3 or explicitly require the build arg to avoid version drift.

This version mismatch persists after the major workflow restructure. The new .bun-version file pins to 1.3, but docker-bake.hcl still defaults to BUN_VERSION=1. This inconsistency could lead to different Bun versions being used when the build arg isn't explicitly passed.

/global/health returns 200 unconditionally before OpenCode initializes. /path goes through the instance middleware, ensuring InstanceBootstrap completes before declaring ready.

All cleanup now goes through stopWatch() which is idempotent via stopPromise, awaits graceful exit with SIGKILL fallback, and cleans up stale same-path watches before spawning new ones.

Replace github/gitignore with octocat/Spoon-Knife to reduce clone time and network failure surface in CI.

All 23 test files shared a single sandbox, causing flaky failures from cross-test contamination of filesystem, ports, and processes. Each file now creates and tears down its own sandbox, and the automatic retry workaround is removed.

whoiskatrin · 2026-03-04T05:26:39Z

Love this!

The crane copy loop and PR comment template now cover all five published variants, matching the release workflow.

The container's handleStream registered output listeners AFTER replaying buffered logs, creating a window where output emitted between the buffer snapshot and listener registration was lost. This caused intermittent waitForLog timeouts on HTTP transport while WebSocket was unaffected. Register listeners before replay so no output can slip through the gap. Remove the now-dead streamProcessLogs service method the handler no longer calls, and switch test-worker process lookups from listProcesses().find() to getProcess() to match real SDK usage patterns.

Add biome.json and changesets to path filters so config and changeset- only PRs trigger the right jobs. Include workflows in needs-docker so E2E has images available on pipeline-only changes. Replace bare wait with per-PID waits to surface individual crane copy failures. Catch cancelled jobs in the gate check alongside failures. Remove workflows from publish-preview trigger to avoid noisy npm previews.

Set fetch-depth: 1 on 11 checkout steps that only need the current commit (build, quality, e2e, performance). The two checkouts that need full history (publish-preview, release) already set fetch-depth: 0 and are left unchanged.

Build job populates the cache on first run, downstream jobs restore it and skip npm ci entirely. Saves ~85-100s aggregate compute per PR push by eliminating 7 redundant installs.

Cache workspace-local node_modules directories alongside root to ensure packages like @cloudflare/kumo (in examples/desktop/) survive cache restore. Use restore-only action in downstream jobs since only the build job needs to save. Add missing zod peer dep for kumo.

Document the reusable workflow architecture, four optimization layers, Docker image topology, and key design decisions.

Accept main's removal of streamProcessLogs from ProcessService, which was deleted in PR #447.

devin-ai-integration

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-03-09T15:55:24Z

packages/sandbox/src/opencode/opencode.ts

        await existingProcess.waitForPort(port, {
          mode: 'http',
-          path: '/global/health',
+          path: '/path',


🔴 OpenCode readiness probe path changed to invalid endpoint /path

The waitForPort readiness probe path was changed from /global/health to /path in all three call sites within opencode.ts. The endpoint /path is not a valid OpenCode server endpoint — it's a nonsensical path that will likely return 404, causing the readiness check to time out after 180 seconds (the OPENCODE_STARTUP_TIMEOUT_MS). The original /global/health was a real OpenCode API endpoint that returns {healthy: true, version: ...}, as confirmed by the E2E test at tests/e2e/opencode-workflow.test.ts:116 which still uses /global/health, and the CHANGELOG which explicitly documents this endpoint choice. The test at packages/sandbox/tests/opencode/opencode.test.ts:149 was updated to expect /path, masking the bug in unit tests while the E2E test still validates the real endpoint.

Prompt for agents

In packages/sandbox/src/opencode/opencode.ts, change all three occurrences of path: '/path' back to path: '/global/health' (lines 98, 138, and 232). Also revert the corresponding test expectations in packages/sandbox/tests/opencode/opencode.test.ts lines 149 and 214 from path: '/path' back to path: '/global/health'.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-09T15:55:26Z

packages/sandbox-container/src/handlers/process-handler.ts

+    const stream = new ReadableStream({
+      start(controller) {
+        const encoder = new TextEncoder();

-      const process = processResult.data;
+        const enqueueSSE = (payload: Record<string, unknown>) => {
+          controller.enqueue(
+            encoder.encode(`data: ${JSON.stringify(payload)}\n\n`)
+          );
+        };

-      const stream = new ReadableStream({
-        start(controller) {
-          // Send initial process info
-          const initialData = `data: ${JSON.stringify({
-            type: 'process_info',
+        enqueueSSE({
+          type: 'process_info',
+          processId: process.id,
+          command: process.command,
+          status: process.status,
+          timestamp: new Date().toISOString()
+        });
+
+        // Register listeners BEFORE replaying buffered output so no
+        // chunk emitted between the snapshot read and registration is
+        // lost. Duplicates from replay + listener overlap are harmless;
+        // missing data is not.
+        const outputListener = (stream: 'stdout' | 'stderr', data: string) => {
+          enqueueSSE({
+            type: stream,
+            data,
            processId: process.id,
-            command: process.command,
-            status: process.status,
            timestamp: new Date().toISOString()
-          })}\n\n`;
-          controller.enqueue(new TextEncoder().encode(initialData));
-
-          // Send existing logs
-          if (process.stdout) {
-            const stdoutData = `data: ${JSON.stringify({
-              type: 'stdout',
-              data: process.stdout,
-              processId: process.id,
-              timestamp: new Date().toISOString()
-            })}\n\n`;
-            controller.enqueue(new TextEncoder().encode(stdoutData));
-          }
-
-          if (process.stderr) {
-            const stderrData = `data: ${JSON.stringify({
-              type: 'stderr',
-              data: process.stderr,
-              processId: process.id,
-              timestamp: new Date().toISOString()
-            })}\n\n`;
-            controller.enqueue(new TextEncoder().encode(stderrData));
-          }
+          });
+        };

-          // Set up listeners for new output
-          const outputListener = (
-            stream: 'stdout' | 'stderr',
-            data: string
-          ) => {
-            const eventData = `data: ${JSON.stringify({
-              type: stream, // 'stdout' or 'stderr' directly
-              data,
-              processId: process.id,
-              timestamp: new Date().toISOString()
-            })}\n\n`;
-            controller.enqueue(new TextEncoder().encode(eventData));
-          };
-
-          const statusListener = (status: string) => {
-            // Close stream when process completes
-            if (['completed', 'failed', 'killed', 'error'].includes(status)) {
-              const finalData = `data: ${JSON.stringify({
-                type: 'exit',
-                processId: process.id,
-                exitCode: process.exitCode,
-                data: `Process ${status} with exit code ${process.exitCode}`,
-                timestamp: new Date().toISOString()
-              })}\n\n`;
-              controller.enqueue(new TextEncoder().encode(finalData));
-              controller.close();
-            }
-          };
-
-          // Add listeners
-          process.outputListeners.add(outputListener);
-          process.statusListeners.add(statusListener);
-
-          // If process already completed, send exit event immediately
-          if (
-            ['completed', 'failed', 'killed', 'error'].includes(process.status)
-          ) {
-            const finalData = `data: ${JSON.stringify({
+        const statusListener = (status: string) => {
+          if (['completed', 'failed', 'killed', 'error'].includes(status)) {
+            enqueueSSE({
              type: 'exit',
              processId: process.id,
              exitCode: process.exitCode,
-              data: `Process ${process.status} with exit code ${process.exitCode}`,
+              data: `Process ${status} with exit code ${process.exitCode}`,
              timestamp: new Date().toISOString()
-            })}\n\n`;
-            controller.enqueue(new TextEncoder().encode(finalData));
+            });
            controller.close();
          }
+        };
+
+        process.outputListeners.add(outputListener);
+        process.statusListeners.add(statusListener);

-          // Cleanup when stream is cancelled
-          return () => {
-            process.outputListeners.delete(outputListener);
-            process.statusListeners.delete(statusListener);
-          };
+        // Replay buffered output collected before the listener was added
+        if (process.stdout) {
+          enqueueSSE({
+            type: 'stdout',
+            data: process.stdout,
+            processId: process.id,
+            timestamp: new Date().toISOString()
+          });
        }
-      });
-
-      return new Response(stream, {
-        status: 200,
-        headers: {
-          'Content-Type': 'text/event-stream',
-          'Cache-Control': 'no-cache',
-          Connection: 'keep-alive',
-          ...context.corsHeaders
+
+        if (process.stderr) {
+          enqueueSSE({
+            type: 'stderr',
+            data: process.stderr,
+            processId: process.id,
+            timestamp: new Date().toISOString()
+          });
        }
-      });
-    } else {
-      return this.createErrorResponse(result.error, context);
-    }
+
+        if (
+          ['completed', 'failed', 'killed', 'error'].includes(process.status)
+        ) {
+          enqueueSSE({
+            type: 'exit',
+            processId: process.id,
+            exitCode: process.exitCode,
+            data: `Process ${process.status} with exit code ${process.exitCode}`,
+            timestamp: new Date().toISOString()
+          });
+          controller.close();
+        }
+
+        return () => {
+          process.outputListeners.delete(outputListener);
+          process.statusListeners.delete(statusListener);
+        };
+      }
+    });


🔴 SSE process stream leaks listeners on client disconnect — missing cancel() handler

The refactored handleStream method creates a ReadableStream with listener registration in start() and returns a cleanup function from start(). However, the ReadableStream spec ignores the return value of start() — it is not a cleanup mechanism. When a client disconnects (stream cancellation), there is no cancel() callback to remove the outputListener and statusListener from the process's listener sets. The old code in process-service.ts:streamProcessLogs (now deleted at line 467) had a proper cancel() handler on its ReadableStream. The replacement code in the handler should add a cancel() method to the ReadableStream constructor options to delete the listeners, matching the pattern used in packages/sandbox-container/src/services/watch-service.ts:467-471.

Prompt for agents

In packages/sandbox-container/src/handlers/process-handler.ts, the ReadableStream created in handleStream (starting around line 232) needs a cancel() method added alongside the start() method. Move the cleanup logic from the return statement inside start() (lines 311-314) into a cancel() callback on the ReadableStream options object: Change the ReadableStream constructor from: new ReadableStream({ start(controller) { ... return () => { process.outputListeners.delete(outputListener); process.statusListeners.delete(statusListener); }; } }); To: new ReadableStream({ start(controller) { ... // Remove the return () => { ... } from here }, cancel() { process.outputListeners.delete(outputListener); process.statusListeners.delete(statusListener); } }); Note: The outputListener and statusListener variables need to be hoisted to the outer scope of the ReadableStream constructor so cancel() can reference them, or you can use a shared closure variable.

Was this helpful? React with 👍 or 👎 to provide feedback.

The previous commit removed image deletion but still deleted workers per-run, which defeated deploy-skip. This replaces the entire per-run cleanup model with a pre-deploy capacity guard that evicts only when the account approaches its ~100 container app limit. Workers and containers now persist across runs, restoring both the Docker cache and deploy-skip optimizations from PR #447. The capacity guard evicts whole-PR resources in least-recently-active order, using GitHub API for reliable active-run protection. PR-close cleanup is scoped to workers/containers/caches only — image tags are left alone to avoid poisoning the shared ci-* Docker cache manifests.

* Stop per-run image deletion to restore Docker cache PR #512 added per-run cleanup that deleted pr-* image tags after every CI run. This broke the content-addressed Docker cache because ci-* and pr-* tags share underlying manifests in the CF registry. Cache hit rate dropped from 100% to 30%, doubling average CI time from 5.4 to 10.4 minutes. Remove image tag deletion from per-run cleanup while keeping worker/container deletion for the container app platform limit. Decouple the cleanup job from gate so it no longer blocks the required status check. Move image cleanup to PR close only. Also fix four pre-existing issues in cleanup.yml: grep pipelines failing under set -eo pipefail on empty results, fork PR resources leaking because pull_request event lacks secrets, the daily sweep being blind to image-only orphans, and ci-* cache tags accumulating without eviction against the 50GB storage limit. * Replace per-run cleanup with capacity-aware LRU eviction The previous commit removed image deletion but still deleted workers per-run, which defeated deploy-skip. This replaces the entire per-run cleanup model with a pre-deploy capacity guard that evicts only when the account approaches its ~100 container app limit. Workers and containers now persist across runs, restoring both the Docker cache and deploy-skip optimizations from PR #447. The capacity guard evicts whole-PR resources in least-recently-active order, using GitHub API for reliable active-run protection. PR-close cleanup is scoped to workers/containers/caches only — image tags are left alone to avoid poisoning the shared ci-* Docker cache manifests. * Drop queued-run protection and remove nightly cache eviction Only protect PRs with in-progress CI from capacity eviction. Queued runs have not started using workers yet and can redeploy if evicted. This widens the effective eviction pool and makes the fail-fast scenario less likely. Remove nightly ci-* image tag eviction from the daily sweep. The 50GB storage limit is not an immediate concern and the eviction was costing ~6 minutes on the first build each day. Can be re-added if storage pressure becomes a problem.

ghostwriternr requested a review from whoiskatrin as a code owner March 3, 2026 22:52

github-advanced-security bot found potential problems Mar 3, 2026

View reviewed changes

ghostwriternr added 13 commits March 3, 2026 22:54

add docker-bake.hcl for multi-image build configuration

f36c471

set instance_type standard-4 for E2E test worker containers

9a8f27f

add reusable-build workflow with bake and registry cache

b56cbee

add reusable-quality workflow with 3 parallel jobs

11be041

add reusable-e2e workflow with parallel test jobs

fe486ae

add pr.yml composing reusable workflows with conditional execution

714c4ee

rewrite release.yml to compose reusable workflows with crane copy

c941f8d

rewrite cleanup.yml to unify PR-close and daily sweep with per-step t…

422715a

…imeouts

rewrite performance.yml with Docker cache, shared artifacts, stable w…

edf4e73

…orker name

add changeset for CI restructure

7ec9c8d

remove unnecessary changeset (CI-only change)

6293068

add desktop image variant to bake config and build pipeline

23171af

ghostwriternr force-pushed the ci-restructure branch from 6665417 to 23171af Compare March 3, 2026 23:01