feat(cli): add --shard-split-dynamic for work-stealing parallel test execution#3341
Open
thamys-moraes wants to merge 3 commits into
Open
feat(cli): add --shard-split-dynamic for work-stealing parallel test execution#3341thamys-moraes wants to merge 3 commits into
thamys-moraes wants to merge 3 commits into
Conversation
…execution ## Problem The existing `--shard-split N` flag divides flows into static chunks before execution (round-robin by index). When flows have heterogeneous durations, faster devices finish their chunk early and sit idle while slower devices are still running. The total time is bound by the slowest shard. ## Solution Add `--shard-split-dynamic N` which distributes flows via a shared queue (work-stealing): each device opens one session and consumes the next available flow as soon as it finishes the previous one. Faster devices naturally pick up more flows. The total time trends toward sum/N rather than max(shard). ``` # Before (static): device B idles while A finishes slow flows device A: [slow][slow][slow] device B: [fast][fast]░░░░░ # After (dynamic): no idle time device A: [slow][slow] device B: [fast][fast][fast] ``` ## New flags `--shard-split-dynamic N` Distribute flows dynamically across N devices using a shared queue. Mutually exclusive with --shard-split and --shard-all. `--min-healthy-devices M` (default: 2) Minimum number of alive workers required to continue. If devices crash and the count drops below M, the run is aborted with a clear error instead of letting one surviving device run the entire remaining queue. ## Robustness - Re-enqueue: if a device session crashes mid-flow, the in-progress flow is returned to the queue for another worker to execute. - Fail-fast: aborts when alive workers < --min-healthy-devices. - Auto memory cleanup: stops the app after each flow (resolved from the flow's own launchApp command) to free device memory and prevent slowdown in long test suites. No extra configuration needed. - Zero regression: --shard-split / --shard-all paths are unchanged. ## Implementation - `DynamicShardScheduler`: coroutine-per-device over a `Channel<Path>`, `AtomicInteger` for pending/alive counters, `AtomicBoolean` for cancellation — no manual locking. - `TestSuiteInteractor`: extracted `buildSummary()`, added `runFromQueue()` which drives the work-stealing loop. - `TestCommand`: new flags with mutual-exclusion validation; dynamic branch in `handleSessions()` returns before static `makeChunkPlans()`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.