Skip to content

Fix parallel context creation serialization in process pool#515

Merged
ghostwriternr merged 12 commits intomainfrom
create-executors-outside-of-mutex
Apr 1, 2026
Merged

Fix parallel context creation serialization in process pool#515
ghostwriternr merged 12 commits intomainfrom
create-executors-outside-of-mutex

Conversation

@Muhammad-Bin-Ali
Copy link
Copy Markdown
Contributor

@Muhammad-Bin-Ali Muhammad-Bin-Ali commented Mar 23, 2026

Fix parallel context creation serialization in process pool

Fixes #276

Problem

reserveExecutorForContext() and borrowExecutor() hold the per-language mutex while spawning child processes. Spawning blocks for 300-500ms waiting for a "ready" signal, during which all other requests for the same language queue behind the lock. With 10 parallel context creations against a pre-warmed pool of 3, requests 4-10 staircase — each waiting for the previous spawn to finish before starting its own:

Request 1:    443ms   (from pool)
Request 2:  1,016ms   (spawn, serialized)
Request 3:  1,477ms   (queued behind #2)
...
Request 10: 5,509ms   (queued behind all previous)

Wall time: 5.5s. Should be ~500ms.

Fix

Move createProcess() outside the mutex via a new spawnAndRegister() method that all spawn paths funnel through:

  1. Check available pool under lock — claim a pre-warmed executor if one exists (fast path)
  2. Acquire a semaphore permit under lock — enforces maxProcesses limit, throws if at capacity
  3. Spawn outside lock — parallel requests spawn concurrently
  4. Register under lockonSpawned callback adds the process to tracking structures, stores a one-shot release function in processReleasers

A per-language Semaphore (from async-mutex) replaces the previous manual pool.length check inside createProcess. Permits are acquired before spawning and released when a process leaves the pool. Three removal paths exist — context deletion, idle cleanup, and unexpected exit — and processReleasers (a Map<processId, releaseFunction>) ensures exactly-once release: whichever path fires first consumes the entry, subsequent paths find nothing.

Also fixed:

  • createUnassignedExecutor() previously mutated pool data structures without holding the mutex
  • cleanupIdleProcesses() did not remove exit handlers before killing, which could fire the handler redundantly

Changes

  • packages/sandbox-container/src/runtime/process-pool.ts
    • Extract spawnAndRegister() — shared spawn-outside-mutex skeleton with onSpawned callback
    • Add releaseProcessSlot() — one-shot semaphore release via processReleasers map
    • Replace pool.length check in createProcess with per-language Semaphore for maxProcesses enforcement
    • Refactor reserveExecutorForContext(), borrowExecutor(), createUnassignedExecutor() to use spawnAndRegister()
    • Release semaphore permits at all process removal points (exit handler, context release, idle cleanup)
    • Remove exit handler before killing in cleanupIdleProcesses() (consistency with releaseExecutorForContext())
    • Kill leaked process and release permit if onSpawned callback fails
  • packages/sandbox-container/tests/runtime/process-pool-concurrency.test.ts — 17 unit tests
  • tests/e2e/parallel-context-creation.test.ts — 4 E2E tests

Test coverage

Area Tests What's verified
Parallelism 3 6/10 concurrent spawns complete in ~1 cycle, no staircase
Fast path 2 Pre-warmed executors assigned without spawning
Correctness 3 Unique executors, pool drain, cross-language isolation
maxProcesses 5 Exact boundary, rejection, off-by-one, permit release on context deletion, permit release on spawn failure
Permit accounting 4 Single release frees exactly one permit, full cycle restores all, no drift over 3 cycles, partial releases
E2E 4 Parallel create/delete against real container, cross-language, timing ratio

Full container unit suite: 598 pass, 0 fail.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 23, 2026

🦋 Changeset detected

Latest commit: c04d438

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@cloudflare/sandbox Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration[bot]

This comment was marked as resolved.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 23, 2026

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/sandbox-sdk/@cloudflare/sandbox@515

commit: c04d438

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

🐳 Docker Images Published

Variant Image
Default cloudflare/sandbox:0.0.0-pr-515-c04d438
Python cloudflare/sandbox:0.0.0-pr-515-c04d438-python
OpenCode cloudflare/sandbox:0.0.0-pr-515-c04d438-opencode
Musl cloudflare/sandbox:0.0.0-pr-515-c04d438-musl
Desktop cloudflare/sandbox:0.0.0-pr-515-c04d438-desktop

Usage:

FROM cloudflare/sandbox:0.0.0-pr-515-c04d438

Version: 0.0.0-pr-515-c04d438


📦 Standalone Binary

For arbitrary Dockerfiles:

COPY --from=cloudflare/sandbox:0.0.0-pr-515-c04d438 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]

Download via GitHub CLI:

gh run download 23850388228 -n sandbox-binary

Extract from Docker:

docker run --rm cloudflare/sandbox:0.0.0-pr-515-c04d438 cat /container-server/sandbox > sandbox && chmod +x sandbox

@Muhammad-Bin-Ali Muhammad-Bin-Ali force-pushed the create-executors-outside-of-mutex branch from c4f634c to d124ecb Compare March 23, 2026 18:26
@Muhammad-Bin-Ali
Copy link
Copy Markdown
Contributor Author

/bonk review this PR

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The core idea is sound — moving createProcess() outside the mutex to allow parallel spawning is the right fix for the serialization bottleneck. The pendingSpawns counter approach is a reasonable way to track in-flight spawns for limit enforcement.

However, there are two correctness issues that need fixing before merge:

1. Off-by-one in maxProcesses enforcement (bug)

Devin's review caught this correctly. In createProcess() at line 407:

pool.length + pending >= config.maxProcesses

The caller increments pendingSpawns inside the mutex before calling createProcess(). So pending already includes the caller's own reservation. Using >= means the limit is effectively reduced by 1.

Example: maxProcesses = 3, pool.length = 0, one caller increments pendingSpawns to 1. createProcess checks 0 + 1 >= 3 → false → OK. Second caller makes it pendingSpawns = 2, checks 0 + 2 >= 3 → false → OK. Third caller makes it pendingSpawns = 3, checks 0 + 3 >= 3rejects. But this should be allowed — the old code would check pool.length >= 30 >= 3 → false → allowed.

Fix: change >= to >.

2. createUnassignedExecutor doesn't participate in pendingSpawns (bug)

createUnassignedExecutor() (line 880) calls createProcess() directly without incrementing pendingSpawns first. Meanwhile, createProcess() now reads pendingSpawns for the limit check. This means:

  • Pre-warming and ensureMinimumPool spawns are invisible to pendingSpawns, so concurrent borrowExecutor/reserveExecutorForContext calls could over-count available slots
  • More importantly, createUnassignedExecutor also doesn't hold the mutex when it mutates availableExecutors and pools arrays (lines 886-892). Before this PR it was called the same way, but now that other callers explicitly protect these mutations with the mutex, the inconsistency is more concerning.

This is a pre-existing issue made slightly worse by the new code reading pendingSpawns in createProcess. In practice, createUnassignedExecutor is only called during startup pre-warming and after releaseExecutorForContext, so the window for conflict is narrow. But it should be addressed for correctness. Consider either having createUnassignedExecutor follow the same pendingSpawns+mutex pattern, or at minimum protecting its pool mutations with the mutex.

Tests

The tests are well-structured and clearly validate the parallelism improvement. The mock approach is appropriate for unit-testing concurrency behavior. One minor note: none of the tests exercise the maxProcesses limit path, which is where the off-by-one bug lives. Adding a test with maxProcesses configured would catch the regression.

ask-bonk[bot]

This comment was marked as resolved.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 23, 2026

Note on createUnassignedExecutor (line 880): This method calls createProcess without incrementing pendingSpawns, making these spawns invisible to the limit check. Since createProcess now reads pendingSpawns for enforcement, concurrent calls from borrowExecutor/reserveExecutorForContext and ensureMinimumPool could exceed maxProcesses.

Additionally, the pool mutations in createUnassignedExecutor (lines 886-892) happen without holding the mutex, while the refactored borrowExecutor and reserveExecutorForContext now explicitly protect these same mutations under the lock. Consider applying the same pendingSpawns+mutex pattern there for consistency.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 23, 2026

Review posted on PR #515 with "request changes". Summary of findings:

  1. Off-by-one bug in maxProcesses check (line 407): >= should be > because pendingSpawns already includes the current caller's reservation. This effectively reduces the process limit by 1 — and with maxProcesses = 1, no processes can ever be created. Added an inline comment with a suggested fix. This confirms Devin's automated review finding.

  2. createUnassignedExecutor doesn't participate in pendingSpawns tracking: This method calls createProcess() without incrementing pendingSpawns and mutates pool arrays without holding the mutex, creating an inconsistency with the refactored borrowExecutor/reserveExecutorForContext which now protect these operations. Posted a general comment since this code wasn't in the diff.

  3. Missing test coverage for maxProcesses: None of the new tests configure maxProcesses, so the off-by-one bug isn't caught. Recommended adding a test case.

The core approach (moving spawns outside the mutex, tracking with pendingSpawns) is correct and well-motivated.

github run

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@aron-cf
Copy link
Copy Markdown
Contributor

aron-cf commented Mar 26, 2026

@ask-bonk can you please review this pr

ask-bonk[bot]
ask-bonk bot previously approved these changes Mar 26, 2026
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The switch from pendingSpawns counters to Semaphore was the right call — it eliminated the off-by-one and double-counting bugs from the earlier approach. The overall design is sound and well-structured. CI is green.

Correctness analysis of spawnAndRegister

I traced through all the failure and race-condition paths in the current code:

Semaphore check + acquire — The getValue() === 0 check followed by acquire() under the mutex is correct. Because the mutex serializes these, no other caller can consume the permit between the check and the acquire. No off-by-one: if maxProcesses = 3 and 3 permits exist, all 3 callers successfully acquire. The 4th sees getValue() === 0 and throws.

createProcess rejection — The catch block calls release?.() directly. The exit handler fires but calls releaseProcessSlot(id) which finds no entry in processReleasers (never populated since createProcess rejected). No double-release.

Process dies between createProcess resolution and registrationprocessReleasers is set immediately after createProcess resolves. If the process dies before the registration mutex is acquired, the exit handler fires and consumes the releaser. The registration mutex then detects the dead process via exitCode/signalCode check and throws. The catch block's releaseProcessSlot finds nothing (already consumed). No double-release, no permit leak.

onSpawned callback throws — The catch block removes the exit handler, kills the process, and calls releaseProcessSlot. Clean.

Three removal paths — Exit handler, releaseExecutorForContext, and cleanupIdleProcesses all correctly handle semaphore release: the latter two remove the exit handler before killing (preventing its releaseProcessSlot call), then call releaseProcessSlot themselves. The one-shot processReleasers map ensures exactly-once release.

createUnassignedExecutor — now correct

The earlier concern about this method not participating in pendingSpawns tracking is resolved: it now goes through spawnAndRegister, which handles semaphore acquisition, mutex-protected registration, and all the error paths. The pre-existing issue of unprotected pool mutations is fixed.

Previous review findings — status

The earlier "changes requested" review from bonk and the Devin findings were against the pendingSpawns counter approach (commit d124ecb). All three issues identified there are no longer applicable:

  1. Off-by-one in maxProcesses — eliminated by switching to Semaphore (no manual counting)
  2. createUnassignedExecutor not participating in tracking — fixed, now uses spawnAndRegister
  3. Double semaphore release on spawn failure — not possible with the one-shot processReleasers map pattern

Two minor items (non-blocking)

  1. Exit handler pool mutations aren't mutex-protected. The exit handler modifies pools, availableExecutors, and contextExecutors without holding poolLocks. In JavaScript's single-threaded model this is safe against parallel mutation (the handler is synchronous), but it means pool state could be temporarily inconsistent from the perspective of code running after an await that allowed the handler to interleave. This is a pre-existing pattern from the base code and not something this PR introduced — fine to address separately if at all.

  2. ensureMinimumPool calls createUnassignedExecutor in parallel (Promise.all(spawnPromises)). Under maxProcesses, if the pool is fully depleted and minSize replenishment spawns compete with user requests for semaphore permits, user requests could get rejected with "Maximum limit reached" while background replenishment is consuming permits. This is an edge case and also pre-existing behavior — just worth noting for future consideration.

Tests

The 17 unit tests and 4 E2E tests cover the important cases well: parallelism verification, fast-path pool hits, unique executor assignment, cross-language isolation, maxProcesses enforcement (exact boundary, rejection, permit release on context deletion, permit release on spawn failure), permit accounting across cycles, and dead process detection. The mock approach with controlled spawn delays is appropriate for testing concurrency behavior deterministically.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 26, 2026

Review posted and approved on PR #515. Here's a summary of the findings:

Verdict: Approve. The code has been substantially rewritten since the earlier "changes requested" reviews. The switch from pendingSpawns counters to async-mutex Semaphore eliminated the off-by-one and double-counting bugs. All three issues from the earlier bonk and Devin reviews are no longer applicable in the current code.

I traced through every failure and race-condition path in spawnAndRegister:

  • Semaphore check+acquire under mutex — correct, no off-by-one
  • createProcess rejection — no double-release (exit handler's releaseProcessSlot is a no-op since processReleasers was never populated)
  • Process death between spawn and registration — handled by the exitCode/signalCode check and one-shot processReleasers map
  • All three removal paths (exit handler, context release, idle cleanup) correctly coordinate via the one-shot releaser pattern

Two non-blocking observations noted: exit handler pool mutations aren't mutex-protected (pre-existing pattern), and ensureMinimumPool replenishment could compete with user requests for semaphore permits under maxProcesses (edge case, pre-existing).

github run

spawnAndRegister() did not delete the executorLocks entry when a
process died between spawn and registration, leaking a stale Mutex.
Seven unit tests and one E2E test overlapped with other tests
in the suite without guarding distinct behavior.
Copy link
Copy Markdown
Member

@ghostwriternr ghostwriternr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, looks pretty clean! reserve/spawn/register seems like a good pattern to use under the mutex while allowing the slow spawns to happen in parallel.

i made 2 small changes: cleaned out some redundant tests and fixed 1-line cleanup miss under spawnAndRegister.

@ghostwriternr ghostwriternr enabled auto-merge (squash) April 1, 2026 13:15
@ghostwriternr ghostwriternr merged commit bf54f69 into main Apr 1, 2026
16 checks passed
@ghostwriternr ghostwriternr deleted the create-executors-outside-of-mutex branch April 1, 2026 13:22
@sandy-bonk sandy-bonk bot mentioned this pull request Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel code context ops crash container

3 participants