Fix parallel context creation serialization in process pool by Muhammad-Bin-Ali · Pull Request #515 · cloudflare/sandbox-sdk

Muhammad-Bin-Ali · 2026-03-23T18:14:29Z

Fix parallel context creation serialization in process pool

Fixes #276

Problem

reserveExecutorForContext() and borrowExecutor() hold the per-language mutex while spawning child processes. Spawning blocks for 300-500ms waiting for a "ready" signal, during which all other requests for the same language queue behind the lock. With 10 parallel context creations against a pre-warmed pool of 3, requests 4-10 staircase — each waiting for the previous spawn to finish before starting its own:

Request 1:    443ms   (from pool)
Request 2:  1,016ms   (spawn, serialized)
Request 3:  1,477ms   (queued behind #2)
...
Request 10: 5,509ms   (queued behind all previous)

Wall time: 5.5s. Should be ~500ms.

Fix

Move createProcess() outside the mutex via a new spawnAndRegister() method that all spawn paths funnel through:

Check available pool under lock — claim a pre-warmed executor if one exists (fast path)
Acquire a semaphore permit under lock — enforces maxProcesses limit, throws if at capacity
Spawn outside lock — parallel requests spawn concurrently
Register under lock — onSpawned callback adds the process to tracking structures, stores a one-shot release function in processReleasers

A per-language Semaphore (from async-mutex) replaces the previous manual pool.length check inside createProcess. Permits are acquired before spawning and released when a process leaves the pool. Three removal paths exist — context deletion, idle cleanup, and unexpected exit — and processReleasers (a Map<processId, releaseFunction>) ensures exactly-once release: whichever path fires first consumes the entry, subsequent paths find nothing.

Also fixed:

createUnassignedExecutor() previously mutated pool data structures without holding the mutex
cleanupIdleProcesses() did not remove exit handlers before killing, which could fire the handler redundantly

Changes

packages/sandbox-container/src/runtime/process-pool.ts
- Extract spawnAndRegister() — shared spawn-outside-mutex skeleton with onSpawned callback
- Add releaseProcessSlot() — one-shot semaphore release via processReleasers map
- Replace pool.length check in createProcess with per-language Semaphore for maxProcesses enforcement
- Refactor reserveExecutorForContext(), borrowExecutor(), createUnassignedExecutor() to use spawnAndRegister()
- Release semaphore permits at all process removal points (exit handler, context release, idle cleanup)
- Remove exit handler before killing in cleanupIdleProcesses() (consistency with releaseExecutorForContext())
- Kill leaked process and release permit if onSpawned callback fails
packages/sandbox-container/tests/runtime/process-pool-concurrency.test.ts — 17 unit tests
tests/e2e/parallel-context-creation.test.ts — 4 E2E tests

Test coverage

Area	Tests	What's verified
Parallelism	3	6/10 concurrent spawns complete in ~1 cycle, no staircase
Fast path	2	Pre-warmed executors assigned without spawning
Correctness	3	Unique executors, pool drain, cross-language isolation
maxProcesses	5	Exact boundary, rejection, off-by-one, permit release on context deletion, permit release on spawn failure
Permit accounting	4	Single release frees exactly one permit, full cycle restores all, no drift over 3 cycles, partial releases
E2E	4	Parallel create/delete against real container, cross-language, timing ratio

Full container unit suite: 598 pass, 0 fail.

changeset-bot · 2026-03-23T18:14:32Z

🦋 Changeset detected

Latest commit: c04d438

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@cloudflare/sandbox	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

pkg-pr-new · 2026-03-23T18:22:00Z

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@515

commit: c04d438

github-actions · 2026-03-23T18:22:02Z

🐳 Docker Images Published

Variant	Image
Default	`cloudflare/sandbox:0.0.0-pr-515-c04d438`
Python	`cloudflare/sandbox:0.0.0-pr-515-c04d438-python`
OpenCode	`cloudflare/sandbox:0.0.0-pr-515-c04d438-opencode`
Musl	`cloudflare/sandbox:0.0.0-pr-515-c04d438-musl`
Desktop	`cloudflare/sandbox:0.0.0-pr-515-c04d438-desktop`

Usage:

FROM cloudflare/sandbox:0.0.0-pr-515-c04d438

Version: 0.0.0-pr-515-c04d438

📦 Standalone Binary

For arbitrary Dockerfiles:

COPY --from=cloudflare/sandbox:0.0.0-pr-515-c04d438 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]

Download via GitHub CLI:

gh run download 23850388228 -n sandbox-binary

Extract from Docker:

docker run --rm cloudflare/sandbox:0.0.0-pr-515-c04d438 cat /container-server/sandbox > sandbox && chmod +x sandbox

Muhammad-Bin-Ali · 2026-03-23T18:26:45Z

/bonk review this PR

ask-bonk

Review

The core idea is sound — moving createProcess() outside the mutex to allow parallel spawning is the right fix for the serialization bottleneck. The pendingSpawns counter approach is a reasonable way to track in-flight spawns for limit enforcement.

However, there are two correctness issues that need fixing before merge:

1. Off-by-one in `maxProcesses` enforcement (bug)

Devin's review caught this correctly. In createProcess() at line 407:

pool.length + pending >= config.maxProcesses

The caller increments pendingSpawns inside the mutex before calling createProcess(). So pending already includes the caller's own reservation. Using >= means the limit is effectively reduced by 1.

Example: maxProcesses = 3, pool.length = 0, one caller increments pendingSpawns to 1. createProcess checks 0 + 1 >= 3 → false → OK. Second caller makes it pendingSpawns = 2, checks 0 + 2 >= 3 → false → OK. Third caller makes it pendingSpawns = 3, checks 0 + 3 >= 3 → rejects. But this should be allowed — the old code would check pool.length >= 3 → 0 >= 3 → false → allowed.

Fix: change >= to >.

2. `createUnassignedExecutor` doesn't participate in `pendingSpawns` (bug)

createUnassignedExecutor() (line 880) calls createProcess() directly without incrementing pendingSpawns first. Meanwhile, createProcess() now reads pendingSpawns for the limit check. This means:

Pre-warming and ensureMinimumPool spawns are invisible to pendingSpawns, so concurrent borrowExecutor/reserveExecutorForContext calls could over-count available slots
More importantly, createUnassignedExecutor also doesn't hold the mutex when it mutates availableExecutors and pools arrays (lines 886-892). Before this PR it was called the same way, but now that other callers explicitly protect these mutations with the mutex, the inconsistency is more concerning.

This is a pre-existing issue made slightly worse by the new code reading pendingSpawns in createProcess. In practice, createUnassignedExecutor is only called during startup pre-warming and after releaseExecutorForContext, so the window for conflict is narrow. But it should be addressed for correctness. Consider either having createUnassignedExecutor follow the same pendingSpawns+mutex pattern, or at minimum protecting its pool mutations with the mutex.

Tests

The tests are well-structured and clearly validate the parallelism improvement. The mock approach is appropriate for unit-testing concurrency behavior. One minor note: none of the tests exercise the maxProcesses limit path, which is where the off-by-one bug lives. Adding a test with maxProcesses configured would catch the regression.

ask-bonk · 2026-03-23T18:30:02Z

Note on createUnassignedExecutor (line 880): This method calls createProcess without incrementing pendingSpawns, making these spawns invisible to the limit check. Since createProcess now reads pendingSpawns for enforcement, concurrent calls from borrowExecutor/reserveExecutorForContext and ensureMinimumPool could exceed maxProcesses.

Additionally, the pool mutations in createUnassignedExecutor (lines 886-892) happen without holding the mutex, while the refactored borrowExecutor and reserveExecutorForContext now explicitly protect these same mutations under the lock. Consider applying the same pendingSpawns+mutex pattern there for consistency.

ask-bonk · 2026-03-23T18:30:18Z

Review posted on PR #515 with "request changes". Summary of findings:

Off-by-one bug in maxProcesses check (line 407): >= should be > because pendingSpawns already includes the current caller's reservation. This effectively reduces the process limit by 1 — and with maxProcesses = 1, no processes can ever be created. Added an inline comment with a suggested fix. This confirms Devin's automated review finding.
createUnassignedExecutor doesn't participate in pendingSpawns tracking: This method calls createProcess() without incrementing pendingSpawns and mutates pool arrays without holding the mutex, creating an inconsistency with the refactored borrowExecutor/reserveExecutorForContext which now protect these operations. Posted a general comment since this code wasn't in the diff.
Missing test coverage for maxProcesses: None of the new tests configure maxProcesses, so the off-by-one bug isn't caught. Recommended adding a test case.

The core approach (moving spawns outside the mutex, tracking with pendingSpawns) is correct and well-motivated.

github run

aron-cf · 2026-03-26T11:44:29Z

@ask-bonk can you please review this pr

ask-bonk

Review

The switch from pendingSpawns counters to Semaphore was the right call — it eliminated the off-by-one and double-counting bugs from the earlier approach. The overall design is sound and well-structured. CI is green.

Correctness analysis of `spawnAndRegister`

I traced through all the failure and race-condition paths in the current code:

Semaphore check + acquire — The getValue() === 0 check followed by acquire() under the mutex is correct. Because the mutex serializes these, no other caller can consume the permit between the check and the acquire. No off-by-one: if maxProcesses = 3 and 3 permits exist, all 3 callers successfully acquire. The 4th sees getValue() === 0 and throws.

createProcess rejection — The catch block calls release?.() directly. The exit handler fires but calls releaseProcessSlot(id) which finds no entry in processReleasers (never populated since createProcess rejected). No double-release.

Process dies between createProcess resolution and registration — processReleasers is set immediately after createProcess resolves. If the process dies before the registration mutex is acquired, the exit handler fires and consumes the releaser. The registration mutex then detects the dead process via exitCode/signalCode check and throws. The catch block's releaseProcessSlot finds nothing (already consumed). No double-release, no permit leak.

onSpawned callback throws — The catch block removes the exit handler, kills the process, and calls releaseProcessSlot. Clean.

Three removal paths — Exit handler, releaseExecutorForContext, and cleanupIdleProcesses all correctly handle semaphore release: the latter two remove the exit handler before killing (preventing its releaseProcessSlot call), then call releaseProcessSlot themselves. The one-shot processReleasers map ensures exactly-once release.

`createUnassignedExecutor` — now correct

The earlier concern about this method not participating in pendingSpawns tracking is resolved: it now goes through spawnAndRegister, which handles semaphore acquisition, mutex-protected registration, and all the error paths. The pre-existing issue of unprotected pool mutations is fixed.

Previous review findings — status

The earlier "changes requested" review from bonk and the Devin findings were against the pendingSpawns counter approach (commit d124ecb). All three issues identified there are no longer applicable:

Off-by-one in maxProcesses — eliminated by switching to Semaphore (no manual counting)
createUnassignedExecutor not participating in tracking — fixed, now uses spawnAndRegister
Double semaphore release on spawn failure — not possible with the one-shot processReleasers map pattern

Two minor items (non-blocking)

Exit handler pool mutations aren't mutex-protected. The exit handler modifies pools, availableExecutors, and contextExecutors without holding poolLocks. In JavaScript's single-threaded model this is safe against parallel mutation (the handler is synchronous), but it means pool state could be temporarily inconsistent from the perspective of code running after an await that allowed the handler to interleave. This is a pre-existing pattern from the base code and not something this PR introduced — fine to address separately if at all.
ensureMinimumPool calls createUnassignedExecutor in parallel (Promise.all(spawnPromises)). Under maxProcesses, if the pool is fully depleted and minSize replenishment spawns compete with user requests for semaphore permits, user requests could get rejected with "Maximum limit reached" while background replenishment is consuming permits. This is an edge case and also pre-existing behavior — just worth noting for future consideration.

Tests

The 17 unit tests and 4 E2E tests cover the important cases well: parallelism verification, fast-path pool hits, unique executor assignment, cross-language isolation, maxProcesses enforcement (exact boundary, rejection, permit release on context deletion, permit release on spawn failure), permit accounting across cycles, and dead process detection. The mock approach with controlled spawn delays is appropriate for testing concurrency behavior deterministically.

ask-bonk · 2026-03-26T11:49:01Z

Review posted and approved on PR #515. Here's a summary of the findings:

Verdict: Approve. The code has been substantially rewritten since the earlier "changes requested" reviews. The switch from pendingSpawns counters to async-mutex Semaphore eliminated the off-by-one and double-counting bugs. All three issues from the earlier bonk and Devin reviews are no longer applicable in the current code.

I traced through every failure and race-condition path in spawnAndRegister:

Semaphore check+acquire under mutex — correct, no off-by-one
createProcess rejection — no double-release (exit handler's releaseProcessSlot is a no-op since processReleasers was never populated)
Process death between spawn and registration — handled by the exitCode/signalCode check and one-shot processReleasers map
All three removal paths (exit handler, context release, idle cleanup) correctly coordinate via the one-shot releaser pattern

Two non-blocking observations noted: exit handler pool mutations aren't mutex-protected (pre-existing pattern), and ensureMinimumPool replenishment could compete with user requests for semaphore permits under maxProcesses (edge case, pre-existing).

github run

spawnAndRegister() did not delete the executorLocks entry when a process died between spawn and registration, leaking a stale Mutex.

Seven unit tests and one E2E test overlapped with other tests in the suite without guarding distinct behavior.

…ide-of-mutex

ghostwriternr

nice, looks pretty clean! reserve/spawn/register seems like a good pattern to use under the mutex while allowing the slow spawns to happen in parallel.

i made 2 small changes: cleaned out some redundant tests and fixed 1-line cleanup miss under spawnAndRegister.

Muhammad-Bin-Ali self-assigned this Mar 23, 2026

Muhammad-Bin-Ali requested review from aron-cf, ghostwriternr, scuffi and whoiskatrin as code owners March 23, 2026 18:14

This comment was marked as resolved.

Sign in to view

Muhammad-Bin-Ali force-pushed the create-executors-outside-of-mutex branch from c4f634c to d124ecb Compare March 23, 2026 18:26

ask-bonk bot requested changes Mar 23, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

ask-bonk bot previously approved these changes Mar 26, 2026

View reviewed changes

Muhammad Ali added 7 commits March 26, 2026 10:38

Fix parallel context creation serialization in process pool

1fb6b4f

Use semaphores to clean up approach

4c0f19e

Add more tests

0266caf

Updatw test to pass CI

ca88885

Add more tests and address Devin's comments

7668e40

Solve race condition

eedc9a1

Update e2e to prevent flakiness

26a655c

Muhammad-Bin-Ali dismissed ask-bonk[bot]’s stale review via 26a655c March 26, 2026 14:45

Muhammad-Bin-Ali force-pushed the create-executors-outside-of-mutex branch from 6bcd0fd to 26a655c Compare March 26, 2026 14:45

Merge branch 'main' into create-executors-outside-of-mutex

fc774fe

ghostwriternr added 4 commits April 1, 2026 13:56

Clean up executor lock on registration failure

c9950b5

spawnAndRegister() did not delete the executorLocks entry when a process died between spawn and registration, leaking a stale Mutex.

Add changeset

2beee77

Remove redundant concurrency tests

035e54d

Seven unit tests and one E2E test overlapped with other tests in the suite without guarding distinct behavior.

Merge remote-tracking branch 'origin/main' into create-executors-outs…

c04d438

…ide-of-mutex

ghostwriternr approved these changes Apr 1, 2026

View reviewed changes

ghostwriternr enabled auto-merge (squash) April 1, 2026 13:15

ghostwriternr merged commit bf54f69 into main Apr 1, 2026
16 checks passed

ghostwriternr deleted the create-executors-outside-of-mutex branch April 1, 2026 13:22

sandy-bonk bot mentioned this pull request Apr 1, 2026

Version Packages #548

Merged

Conversation

Muhammad-Bin-Ali commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!