Fix parallel context creation serialization in process pool#515
Fix parallel context creation serialization in process pool#515ghostwriternr merged 12 commits intomainfrom
Conversation
🦋 Changeset detectedLatest commit: c04d438 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
commit: |
🐳 Docker Images Published
Usage: FROM cloudflare/sandbox:0.0.0-pr-515-c04d438Version: 📦 Standalone BinaryFor arbitrary Dockerfiles: COPY --from=cloudflare/sandbox:0.0.0-pr-515-c04d438 /container-server/sandbox /sandbox
ENTRYPOINT ["/sandbox"]Download via GitHub CLI: gh run download 23850388228 -n sandbox-binaryExtract from Docker: docker run --rm cloudflare/sandbox:0.0.0-pr-515-c04d438 cat /container-server/sandbox > sandbox && chmod +x sandbox |
c4f634c to
d124ecb
Compare
|
/bonk review this PR |
There was a problem hiding this comment.
Review
The core idea is sound — moving createProcess() outside the mutex to allow parallel spawning is the right fix for the serialization bottleneck. The pendingSpawns counter approach is a reasonable way to track in-flight spawns for limit enforcement.
However, there are two correctness issues that need fixing before merge:
1. Off-by-one in maxProcesses enforcement (bug)
Devin's review caught this correctly. In createProcess() at line 407:
pool.length + pending >= config.maxProcessesThe caller increments pendingSpawns inside the mutex before calling createProcess(). So pending already includes the caller's own reservation. Using >= means the limit is effectively reduced by 1.
Example: maxProcesses = 3, pool.length = 0, one caller increments pendingSpawns to 1. createProcess checks 0 + 1 >= 3 → false → OK. Second caller makes it pendingSpawns = 2, checks 0 + 2 >= 3 → false → OK. Third caller makes it pendingSpawns = 3, checks 0 + 3 >= 3 → rejects. But this should be allowed — the old code would check pool.length >= 3 → 0 >= 3 → false → allowed.
Fix: change >= to >.
2. createUnassignedExecutor doesn't participate in pendingSpawns (bug)
createUnassignedExecutor() (line 880) calls createProcess() directly without incrementing pendingSpawns first. Meanwhile, createProcess() now reads pendingSpawns for the limit check. This means:
- Pre-warming and
ensureMinimumPoolspawns are invisible topendingSpawns, so concurrentborrowExecutor/reserveExecutorForContextcalls could over-count available slots - More importantly,
createUnassignedExecutoralso doesn't hold the mutex when it mutatesavailableExecutorsandpoolsarrays (lines 886-892). Before this PR it was called the same way, but now that other callers explicitly protect these mutations with the mutex, the inconsistency is more concerning.
This is a pre-existing issue made slightly worse by the new code reading pendingSpawns in createProcess. In practice, createUnassignedExecutor is only called during startup pre-warming and after releaseExecutorForContext, so the window for conflict is narrow. But it should be addressed for correctness. Consider either having createUnassignedExecutor follow the same pendingSpawns+mutex pattern, or at minimum protecting its pool mutations with the mutex.
Tests
The tests are well-structured and clearly validate the parallelism improvement. The mock approach is appropriate for unit-testing concurrency behavior. One minor note: none of the tests exercise the maxProcesses limit path, which is where the off-by-one bug lives. Adding a test with maxProcesses configured would catch the regression.
|
Note on Additionally, the pool mutations in |
|
Review posted on PR #515 with "request changes". Summary of findings:
The core approach (moving spawns outside the mutex, tracking with |
|
@ask-bonk can you please review this pr |
There was a problem hiding this comment.
Review
The switch from pendingSpawns counters to Semaphore was the right call — it eliminated the off-by-one and double-counting bugs from the earlier approach. The overall design is sound and well-structured. CI is green.
Correctness analysis of spawnAndRegister
I traced through all the failure and race-condition paths in the current code:
Semaphore check + acquire — The getValue() === 0 check followed by acquire() under the mutex is correct. Because the mutex serializes these, no other caller can consume the permit between the check and the acquire. No off-by-one: if maxProcesses = 3 and 3 permits exist, all 3 callers successfully acquire. The 4th sees getValue() === 0 and throws.
createProcess rejection — The catch block calls release?.() directly. The exit handler fires but calls releaseProcessSlot(id) which finds no entry in processReleasers (never populated since createProcess rejected). No double-release.
Process dies between createProcess resolution and registration — processReleasers is set immediately after createProcess resolves. If the process dies before the registration mutex is acquired, the exit handler fires and consumes the releaser. The registration mutex then detects the dead process via exitCode/signalCode check and throws. The catch block's releaseProcessSlot finds nothing (already consumed). No double-release, no permit leak.
onSpawned callback throws — The catch block removes the exit handler, kills the process, and calls releaseProcessSlot. Clean.
Three removal paths — Exit handler, releaseExecutorForContext, and cleanupIdleProcesses all correctly handle semaphore release: the latter two remove the exit handler before killing (preventing its releaseProcessSlot call), then call releaseProcessSlot themselves. The one-shot processReleasers map ensures exactly-once release.
createUnassignedExecutor — now correct
The earlier concern about this method not participating in pendingSpawns tracking is resolved: it now goes through spawnAndRegister, which handles semaphore acquisition, mutex-protected registration, and all the error paths. The pre-existing issue of unprotected pool mutations is fixed.
Previous review findings — status
The earlier "changes requested" review from bonk and the Devin findings were against the pendingSpawns counter approach (commit d124ecb). All three issues identified there are no longer applicable:
- Off-by-one in
maxProcesses— eliminated by switching to Semaphore (no manual counting) createUnassignedExecutornot participating in tracking — fixed, now usesspawnAndRegister- Double semaphore release on spawn failure — not possible with the one-shot
processReleasersmap pattern
Two minor items (non-blocking)
-
Exit handler pool mutations aren't mutex-protected. The exit handler modifies
pools,availableExecutors, andcontextExecutorswithout holdingpoolLocks. In JavaScript's single-threaded model this is safe against parallel mutation (the handler is synchronous), but it means pool state could be temporarily inconsistent from the perspective of code running after anawaitthat allowed the handler to interleave. This is a pre-existing pattern from the base code and not something this PR introduced — fine to address separately if at all. -
ensureMinimumPoolcallscreateUnassignedExecutorin parallel (Promise.all(spawnPromises)). UndermaxProcesses, if the pool is fully depleted andminSizereplenishment spawns compete with user requests for semaphore permits, user requests could get rejected with "Maximum limit reached" while background replenishment is consuming permits. This is an edge case and also pre-existing behavior — just worth noting for future consideration.
Tests
The 17 unit tests and 4 E2E tests cover the important cases well: parallelism verification, fast-path pool hits, unique executor assignment, cross-language isolation, maxProcesses enforcement (exact boundary, rejection, permit release on context deletion, permit release on spawn failure), permit accounting across cycles, and dead process detection. The mock approach with controlled spawn delays is appropriate for testing concurrency behavior deterministically.
|
Review posted and approved on PR #515. Here's a summary of the findings: Verdict: Approve. The code has been substantially rewritten since the earlier "changes requested" reviews. The switch from I traced through every failure and race-condition path in
Two non-blocking observations noted: exit handler pool mutations aren't mutex-protected (pre-existing pattern), and |
6bcd0fd to
26a655c
Compare
spawnAndRegister() did not delete the executorLocks entry when a process died between spawn and registration, leaking a stale Mutex.
Seven unit tests and one E2E test overlapped with other tests in the suite without guarding distinct behavior.
ghostwriternr
left a comment
There was a problem hiding this comment.
nice, looks pretty clean! reserve/spawn/register seems like a good pattern to use under the mutex while allowing the slow spawns to happen in parallel.
i made 2 small changes: cleaned out some redundant tests and fixed 1-line cleanup miss under spawnAndRegister.
Fix parallel context creation serialization in process pool
Fixes #276
Problem
reserveExecutorForContext()andborrowExecutor()hold the per-language mutex while spawning child processes. Spawning blocks for 300-500ms waiting for a"ready"signal, during which all other requests for the same language queue behind the lock. With 10 parallel context creations against a pre-warmed pool of 3, requests 4-10 staircase — each waiting for the previous spawn to finish before starting its own:Wall time: 5.5s. Should be ~500ms.
Fix
Move
createProcess()outside the mutex via a newspawnAndRegister()method that all spawn paths funnel through:maxProcesseslimit, throws if at capacityonSpawnedcallback adds the process to tracking structures, stores a one-shot release function inprocessReleasersA per-language
Semaphore(fromasync-mutex) replaces the previous manualpool.lengthcheck insidecreateProcess. Permits are acquired before spawning and released when a process leaves the pool. Three removal paths exist — context deletion, idle cleanup, and unexpected exit — andprocessReleasers(aMap<processId, releaseFunction>) ensures exactly-once release: whichever path fires first consumes the entry, subsequent paths find nothing.Also fixed:
createUnassignedExecutor()previously mutated pool data structures without holding the mutexcleanupIdleProcesses()did not remove exit handlers before killing, which could fire the handler redundantlyChanges
packages/sandbox-container/src/runtime/process-pool.tsspawnAndRegister()— shared spawn-outside-mutex skeleton withonSpawnedcallbackreleaseProcessSlot()— one-shot semaphore release viaprocessReleasersmappool.lengthcheck increateProcesswith per-languageSemaphoreformaxProcessesenforcementreserveExecutorForContext(),borrowExecutor(),createUnassignedExecutor()to usespawnAndRegister()cleanupIdleProcesses()(consistency withreleaseExecutorForContext())onSpawnedcallback failspackages/sandbox-container/tests/runtime/process-pool-concurrency.test.ts— 17 unit teststests/e2e/parallel-context-creation.test.ts— 4 E2E testsTest coverage
Full container unit suite: 598 pass, 0 fail.