Skip to content

Overhaul CI pipeline and eliminate E2E test flakiness#447

Merged
ghostwriternr merged 43 commits intomainfrom
ci-restructure
Mar 4, 2026
Merged

Overhaul CI pipeline and eliminate E2E test flakiness#447
ghostwriternr merged 43 commits intomainfrom
ci-restructure

Conversation

@ghostwriternr
Copy link
Copy Markdown
Member

@ghostwriternr ghostwriternr commented Mar 3, 2026

CI was both slow and unreliable. Every PR ran the full pipeline regardless of what changed (~18-25 min wall clock), Docker images were rebuilt from scratch in both PR and release paths, and quality checks ran sequentially. E2E tests compounded this — 17 of 23 test files shared a single sandbox, leaking filesystem/port/process state between tests. The OpenCode readiness probe hit a no-op endpoint. File watch cleanup had race conditions. The SSE stream handler had a listener registration ordering bug. retry: 1 papered over all of it.

This PR replaces the monolithic CI with a compositional architecture and fixes the structural causes of E2E flakiness, so that a green pipeline actually means something.

Pipeline architecture

Three reusable workflows (reusable-build, reusable-quality, reusable-e2e) composed by both pr.yml and release.yml. Quality checks (lint+types, SDK tests, container tests) and E2E suites (HTTP, WebSocket, browser) run in parallel. dorny/paths-filter at the entry point skips entire workflow trees when changes don't affect them — a docs-only PR runs nothing; an SDK-only change skips Docker entirely.

Docker Buildx Bake with a single HCL config. 4 image variants built in one invocation with shared base layers and a single registry cache round-trip, replacing 4 separate docker build commands with independent cache pulls.

crane copy for releases. The old release rebuilt Docker images from scratch despite identical images already existing from E2E. Now we retag the tested image — no rebuild, no drift between what was tested and what ships.

Unified cleanup with explicit container teardown. Worker deletion does not delete containers. cleanup.yml now handles both PR-close and daily-sweep triggers, with explicit container deletion and per-step timeouts.

E2E test isolation

Every test file now creates and destroys its own sandbox via createTestSandbox() / cleanupTestSandbox(), with correct X-Sandbox-Type header routing for typed variants (python, opencode, desktop, musl, standalone). retry: 1 is removed — tests pass on first attempt or they fail.

SDK and container fixes caught by the audit

OpenCode readiness probe. /global/health unconditionally returns { healthy: true } regardless of init state. Changed to probe /path, which only succeeds after the binary is actually ready. 3 call sites in opencode.ts.

Watch process lifecycle. watchDirectory() had a race where the 4-hop cancel chain (SDK → DO → container → shell) was slower than a new watch starting. Added centralized stopWatch() with SIGTERM→SIGKILL escalation and stale watcher cleanup before spawning.

SSE stream listener ordering. The container's handleStream registered output listeners after replaying buffered logs, creating a window where output emitted between the buffer snapshot and listener registration was lost. This caused intermittent waitForLog timeouts on HTTP transport while WebSocket was unaffected (different streaming path). Fixed by registering listeners before replay. Also removed the now-dead streamProcessLogs service method the handler no longer calls.

Test worker process lookup. The waitForLog, waitForPort, and waitForExit routes in the E2E test worker used listProcesses().find() to locate a process by ID. Changed to getProcess(id) — a direct lookup that matches how real SDK users would use the API.

Git clone test repo. Replaced github/gitignore (~15MB) with octocat/Spoon-Knife (~1KB). The test validates clone mechanics, not repo size.

What was removed

pullrequest.yml, cleanup-stale.yml, pkg-pr-new.yml, sync-docs.yml, scripts/cleanup-test-deployment.sh — replaced or dead code.

Post-merge

E2E worker naming changes from sandbox-e2e-test-worker-main to sandbox-e2e-test-worker-main-http + sandbox-e2e-test-worker-main-websocket. The old worker needs manual cleanup.


Open with Devin

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 3, 2026

🦋 Changeset detected

Latest commit: 36f4e47

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/sandbox Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCode Review

This is an impressive CI/CD architecture overhaul that transforms a monolithic 18-25 minute pipeline into a modular, intelligent system with change detection and parallelism. The compositional approach using reusable workflows is excellent.

Architectural highlights:

  • Change detection via dorny/paths-filter prevents unnecessary work (docs-only PRs run nothing, SDK changes skip Docker entirely)
  • Docker Buildx Bake replaces 4 separate builds with unified configuration and shared cache layers
  • Reusable workflows enable proper job parallelism (quality checks, E2E suites run concurrently)
  • Smart cleanup consolidates scattered logic into focused jobs with proper timeouts

Technical improvements:

  • Bun version pinning via .bun-version file ensures consistency
  • crane copy for release publishing eliminates rebuild drift
  • Turbo remote caching for build acceleration
  • Container instance type standardization (standard-4)

Two concerns:

  1. Missing concurrency safeguards in reusable-e2e.yml: The deploy job creates both HTTP and WebSocket workers sequentially, but both reference the same image tag. If a container rollout is slow, the second deploy might use stale images. Consider adding a brief wait or status check between deployments when images_changed is true.

  2. Hardcoded Bun version fallback in docker-bake.hcl: The default BUN_VERSION=1 seems inconsistent with the new .bun-version file approach. Should this default match the file content or be removed entirely?

Overall: This modernizes the CI pipeline significantly and should dramatically improve developer experience. The change detection alone will save substantial CI time.

with:
node-version: 24
cache: 'npm'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both workers deploy the same image tag but run sequentially. If container rollout is slow, the WebSocket worker might deploy before HTTP worker's containers are fully rolled out, potentially using stale images. Consider adding a brief wait between deploys when images_changed is true.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequential deployment pattern still exists in the new reusable workflow. Consider adding a brief sleep or container status check between the HTTP and WebSocket deployments when images_changed is true to ensure container rollout completion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequential deployment pattern still exists in the new reusable workflow. Consider adding a brief sleep or container status check between the HTTP and WebSocket deployments when images_changed is true to ensure container rollout completion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequential deployment pattern still exists in the new reusable workflow. Consider adding a brief sleep or container status check between the HTTP and WebSocket deployments when images_changed is true to ensure container rollout completion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sequential deployment issue persists after the rewrite. The risk remains: if container rollout is slow, the WebSocket worker could deploy with stale images before the HTTP worker's containers complete rollout. Consider adding a brief wait or container status check between deployments when images_changed is true.

inherits = ["_common"]
target = "musl"
tags = ["sandbox-musl:${TAG}"]
cache-from = CACHE_REPO != "" ? ["type=registry,ref=${CACHE_REPO}:musl"] : []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default BUN_VERSION=1 seems inconsistent with the new pinned .bun-version file approach. Should this match .bun-version content or require explicit override?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconsistency persists. Since .bun-version now pins to 1.3, the docker-bake.hcl default should either match this (BUN_VERSION=1.3) or require explicit override. The current setup could lead to version mismatches if someone doesn't pass the build arg.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconsistency persists. Since .bun-version now pins to 1.3, the docker-bake.hcl default should either match this (BUN_VERSION=1.3) or require explicit override. The current setup could lead to version mismatches if someone doesn't pass the build arg.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inconsistency persists. Since .bun-version now pins to 1.3, the docker-bake.hcl default should either match this (BUN_VERSION=1.3) or require explicit override. The current setup could lead to version mismatches if someone doesn't pass the build arg.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version inconsistency still exists after the rewrite. With .bun-version pinned to 1.3, the docker-bake.hcl default of BUN_VERSION=1 creates a mismatch. This should either default to 1.3 or explicitly require the build arg to avoid version drift.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version mismatch persists after the major workflow restructure. The new .bun-version file pins to 1.3, but docker-bake.hcl still defaults to BUN_VERSION=1. This inconsistency could lead to different Bun versions being used when the build arg isn't explicitly passed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version mismatch persists after the major workflow restructure. The new .bun-version file pins to 1.3, but docker-bake.hcl still defaults to BUN_VERSION=1. This inconsistency could lead to different Bun versions being used when the build arg isn't explicitly passed.

github-actions[bot]

This comment was marked as outdated.

/global/health returns 200 unconditionally before OpenCode
initializes. /path goes through the instance middleware,
ensuring InstanceBootstrap completes before declaring ready.
All cleanup now goes through stopWatch() which is idempotent
via stopPromise, awaits graceful exit with SIGKILL fallback,
and cleans up stale same-path watches before spawning new ones.
Replace github/gitignore with octocat/Spoon-Knife to reduce
clone time and network failure surface in CI.
github-actions[bot]

This comment was marked as outdated.

All 23 test files shared a single sandbox, causing flaky failures
from cross-test contamination of filesystem, ports, and processes.
Each file now creates and tears down its own sandbox, and the
automatic retry workaround is removed.
@cloudflare cloudflare deleted a comment from github-actions bot Mar 4, 2026
@cloudflare cloudflare deleted a comment from github-actions bot Mar 4, 2026
@whoiskatrin
Copy link
Copy Markdown
Collaborator

Love this!

The crane copy loop and PR comment template now cover all five
published variants, matching the release workflow.
The container's handleStream registered output listeners AFTER replaying
buffered logs, creating a window where output emitted between the buffer
snapshot and listener registration was lost. This caused intermittent
waitForLog timeouts on HTTP transport while WebSocket was unaffected.

Register listeners before replay so no output can slip through the gap.
Remove the now-dead streamProcessLogs service method the handler no
longer calls, and switch test-worker process lookups from
listProcesses().find() to getProcess() to match real SDK usage patterns.
github-actions[bot]

This comment was marked as outdated.

Add biome.json and changesets to path filters so config and changeset-
only PRs trigger the right jobs. Include workflows in needs-docker so
E2E has images available on pipeline-only changes. Replace bare wait
with per-PID waits to surface individual crane copy failures. Catch
cancelled jobs in the gate check alongside failures. Remove workflows
from publish-preview trigger to avoid noisy npm previews.
github-actions[bot]

This comment was marked as outdated.

Set fetch-depth: 1 on 11 checkout steps that only need the
current commit (build, quality, e2e, performance). The two
checkouts that need full history (publish-preview, release)
already set fetch-depth: 0 and are left unchanged.
github-actions[bot]

This comment was marked as outdated.

Build job populates the cache on first run, downstream jobs
restore it and skip npm ci entirely. Saves ~85-100s aggregate
compute per PR push by eliminating 7 redundant installs.
Cache workspace-local node_modules directories alongside root to
ensure packages like @cloudflare/kumo (in examples/desktop/) survive
cache restore. Use restore-only action in downstream jobs since only
the build job needs to save. Add missing zod peer dep for kumo.
github-actions[bot]

This comment was marked as outdated.

Document the reusable workflow architecture, four optimization
layers, Docker image topology, and key design decisions.
github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

@ghostwriternr ghostwriternr merged commit 4088435 into main Mar 4, 2026
5 checks passed
@ghostwriternr ghostwriternr deleted the ci-restructure branch March 4, 2026 13:29
@github-actions github-actions bot mentioned this pull request Mar 4, 2026
whoiskatrin added a commit that referenced this pull request Mar 4, 2026
Accept main's removal of streamProcessLogs from ProcessService,
which was deleted in PR #447.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

await existingProcess.waitForPort(port, {
mode: 'http',
path: '/global/health',
path: '/path',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 OpenCode readiness probe path changed to invalid endpoint /path

The waitForPort readiness probe path was changed from /global/health to /path in all three call sites within opencode.ts. The endpoint /path is not a valid OpenCode server endpoint — it's a nonsensical path that will likely return 404, causing the readiness check to time out after 180 seconds (the OPENCODE_STARTUP_TIMEOUT_MS). The original /global/health was a real OpenCode API endpoint that returns {healthy: true, version: ...}, as confirmed by the E2E test at tests/e2e/opencode-workflow.test.ts:116 which still uses /global/health, and the CHANGELOG which explicitly documents this endpoint choice. The test at packages/sandbox/tests/opencode/opencode.test.ts:149 was updated to expect /path, masking the bug in unit tests while the E2E test still validates the real endpoint.

Prompt for agents
In packages/sandbox/src/opencode/opencode.ts, change all three occurrences of path: '/path' back to path: '/global/health' (lines 98, 138, and 232). Also revert the corresponding test expectations in packages/sandbox/tests/opencode/opencode.test.ts lines 149 and 214 from path: '/path' back to path: '/global/health'.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +232 to +316
const stream = new ReadableStream({
start(controller) {
const encoder = new TextEncoder();

const process = processResult.data;
const enqueueSSE = (payload: Record<string, unknown>) => {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify(payload)}\n\n`)
);
};

const stream = new ReadableStream({
start(controller) {
// Send initial process info
const initialData = `data: ${JSON.stringify({
type: 'process_info',
enqueueSSE({
type: 'process_info',
processId: process.id,
command: process.command,
status: process.status,
timestamp: new Date().toISOString()
});

// Register listeners BEFORE replaying buffered output so no
// chunk emitted between the snapshot read and registration is
// lost. Duplicates from replay + listener overlap are harmless;
// missing data is not.
const outputListener = (stream: 'stdout' | 'stderr', data: string) => {
enqueueSSE({
type: stream,
data,
processId: process.id,
command: process.command,
status: process.status,
timestamp: new Date().toISOString()
})}\n\n`;
controller.enqueue(new TextEncoder().encode(initialData));

// Send existing logs
if (process.stdout) {
const stdoutData = `data: ${JSON.stringify({
type: 'stdout',
data: process.stdout,
processId: process.id,
timestamp: new Date().toISOString()
})}\n\n`;
controller.enqueue(new TextEncoder().encode(stdoutData));
}

if (process.stderr) {
const stderrData = `data: ${JSON.stringify({
type: 'stderr',
data: process.stderr,
processId: process.id,
timestamp: new Date().toISOString()
})}\n\n`;
controller.enqueue(new TextEncoder().encode(stderrData));
}
});
};

// Set up listeners for new output
const outputListener = (
stream: 'stdout' | 'stderr',
data: string
) => {
const eventData = `data: ${JSON.stringify({
type: stream, // 'stdout' or 'stderr' directly
data,
processId: process.id,
timestamp: new Date().toISOString()
})}\n\n`;
controller.enqueue(new TextEncoder().encode(eventData));
};

const statusListener = (status: string) => {
// Close stream when process completes
if (['completed', 'failed', 'killed', 'error'].includes(status)) {
const finalData = `data: ${JSON.stringify({
type: 'exit',
processId: process.id,
exitCode: process.exitCode,
data: `Process ${status} with exit code ${process.exitCode}`,
timestamp: new Date().toISOString()
})}\n\n`;
controller.enqueue(new TextEncoder().encode(finalData));
controller.close();
}
};

// Add listeners
process.outputListeners.add(outputListener);
process.statusListeners.add(statusListener);

// If process already completed, send exit event immediately
if (
['completed', 'failed', 'killed', 'error'].includes(process.status)
) {
const finalData = `data: ${JSON.stringify({
const statusListener = (status: string) => {
if (['completed', 'failed', 'killed', 'error'].includes(status)) {
enqueueSSE({
type: 'exit',
processId: process.id,
exitCode: process.exitCode,
data: `Process ${process.status} with exit code ${process.exitCode}`,
data: `Process ${status} with exit code ${process.exitCode}`,
timestamp: new Date().toISOString()
})}\n\n`;
controller.enqueue(new TextEncoder().encode(finalData));
});
controller.close();
}
};

process.outputListeners.add(outputListener);
process.statusListeners.add(statusListener);

// Cleanup when stream is cancelled
return () => {
process.outputListeners.delete(outputListener);
process.statusListeners.delete(statusListener);
};
// Replay buffered output collected before the listener was added
if (process.stdout) {
enqueueSSE({
type: 'stdout',
data: process.stdout,
processId: process.id,
timestamp: new Date().toISOString()
});
}
});

return new Response(stream, {
status: 200,
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
...context.corsHeaders

if (process.stderr) {
enqueueSSE({
type: 'stderr',
data: process.stderr,
processId: process.id,
timestamp: new Date().toISOString()
});
}
});
} else {
return this.createErrorResponse(result.error, context);
}

if (
['completed', 'failed', 'killed', 'error'].includes(process.status)
) {
enqueueSSE({
type: 'exit',
processId: process.id,
exitCode: process.exitCode,
data: `Process ${process.status} with exit code ${process.exitCode}`,
timestamp: new Date().toISOString()
});
controller.close();
}

return () => {
process.outputListeners.delete(outputListener);
process.statusListeners.delete(statusListener);
};
}
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 SSE process stream leaks listeners on client disconnect — missing cancel() handler

The refactored handleStream method creates a ReadableStream with listener registration in start() and returns a cleanup function from start(). However, the ReadableStream spec ignores the return value of start() — it is not a cleanup mechanism. When a client disconnects (stream cancellation), there is no cancel() callback to remove the outputListener and statusListener from the process's listener sets. The old code in process-service.ts:streamProcessLogs (now deleted at line 467) had a proper cancel() handler on its ReadableStream. The replacement code in the handler should add a cancel() method to the ReadableStream constructor options to delete the listeners, matching the pattern used in packages/sandbox-container/src/services/watch-service.ts:467-471.

Prompt for agents
In packages/sandbox-container/src/handlers/process-handler.ts, the ReadableStream created in handleStream (starting around line 232) needs a cancel() method added alongside the start() method. Move the cleanup logic from the return statement inside start() (lines 311-314) into a cancel() callback on the ReadableStream options object:

Change the ReadableStream constructor from:
  new ReadableStream({
    start(controller) {
      ...
      return () => {
        process.outputListeners.delete(outputListener);
        process.statusListeners.delete(statusListener);
      };
    }
  });

To:
  new ReadableStream({
    start(controller) {
      ...
      // Remove the return () => { ... } from here
    },
    cancel() {
      process.outputListeners.delete(outputListener);
      process.statusListeners.delete(statusListener);
    }
  });

Note: The outputListener and statusListener variables need to be hoisted to the outer scope of the ReadableStream constructor so cancel() can reference them, or you can use a shared closure variable.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

ghostwriternr added a commit that referenced this pull request Mar 26, 2026
The previous commit removed image deletion but still deleted workers
per-run, which defeated deploy-skip. This replaces the entire per-run
cleanup model with a pre-deploy capacity guard that evicts only when
the account approaches its ~100 container app limit.

Workers and containers now persist across runs, restoring both the
Docker cache and deploy-skip optimizations from PR #447. The capacity
guard evicts whole-PR resources in least-recently-active order, using
GitHub API for reliable active-run protection. PR-close cleanup is
scoped to workers/containers/caches only — image tags are left alone
to avoid poisoning the shared ci-* Docker cache manifests.
ghostwriternr added a commit that referenced this pull request Mar 26, 2026
* Stop per-run image deletion to restore Docker cache

PR #512 added per-run cleanup that deleted pr-* image tags after
every CI run. This broke the content-addressed Docker cache because
ci-* and pr-* tags share underlying manifests in the CF registry.
Cache hit rate dropped from 100% to 30%, doubling average CI time
from 5.4 to 10.4 minutes.

Remove image tag deletion from per-run cleanup while keeping
worker/container deletion for the container app platform limit.
Decouple the cleanup job from gate so it no longer blocks the
required status check. Move image cleanup to PR close only.

Also fix four pre-existing issues in cleanup.yml: grep pipelines
failing under set -eo pipefail on empty results, fork PR resources
leaking because pull_request event lacks secrets, the daily sweep
being blind to image-only orphans, and ci-* cache tags accumulating
without eviction against the 50GB storage limit.

* Replace per-run cleanup with capacity-aware LRU eviction

The previous commit removed image deletion but still deleted workers
per-run, which defeated deploy-skip. This replaces the entire per-run
cleanup model with a pre-deploy capacity guard that evicts only when
the account approaches its ~100 container app limit.

Workers and containers now persist across runs, restoring both the
Docker cache and deploy-skip optimizations from PR #447. The capacity
guard evicts whole-PR resources in least-recently-active order, using
GitHub API for reliable active-run protection. PR-close cleanup is
scoped to workers/containers/caches only — image tags are left alone
to avoid poisoning the shared ci-* Docker cache manifests.

* Drop queued-run protection and remove nightly cache eviction

Only protect PRs with in-progress CI from capacity eviction. Queued
runs have not started using workers yet and can redeploy if evicted.
This widens the effective eviction pool and makes the fail-fast
scenario less likely.

Remove nightly ci-* image tag eviction from the daily sweep. The
50GB storage limit is not an immediate concern and the eviction was
costing ~6 minutes on the first build each day. Can be re-added if
storage pressure becomes a problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants