Skip to content

feat(cli): add --shard-split-dynamic for work-stealing parallel test execution#3341

Open
thamys-moraes wants to merge 3 commits into
mobile-dev-inc:mainfrom
thamys-moraes:feature/dynamic-sharding
Open

feat(cli): add --shard-split-dynamic for work-stealing parallel test execution#3341
thamys-moraes wants to merge 3 commits into
mobile-dev-inc:mainfrom
thamys-moraes:feature/dynamic-sharding

Conversation

@thamys-moraes

@thamys-moraes thamys-moraes commented Jun 5, 2026

Copy link
Copy Markdown

No description provided.

…execution

## Problem

The existing `--shard-split N` flag divides flows into static chunks
before execution (round-robin by index). When flows have heterogeneous
durations, faster devices finish their chunk early and sit idle while
slower devices are still running. The total time is bound by the slowest
shard.

## Solution

Add `--shard-split-dynamic N` which distributes flows via a shared queue
(work-stealing): each device opens one session and consumes the next
available flow as soon as it finishes the previous one. Faster devices
naturally pick up more flows. The total time trends toward sum/N rather
than max(shard).

```
# Before (static): device B idles while A finishes slow flows
device A: [slow][slow][slow]
device B: [fast][fast]░░░░░

# After (dynamic): no idle time
device A: [slow][slow]
device B: [fast][fast][fast]
```

## New flags

`--shard-split-dynamic N`
  Distribute flows dynamically across N devices using a shared queue.
  Mutually exclusive with --shard-split and --shard-all.

`--min-healthy-devices M` (default: 2)
  Minimum number of alive workers required to continue. If devices crash
  and the count drops below M, the run is aborted with a clear error
  instead of letting one surviving device run the entire remaining queue.

## Robustness

- Re-enqueue: if a device session crashes mid-flow, the in-progress flow
  is returned to the queue for another worker to execute.
- Fail-fast: aborts when alive workers < --min-healthy-devices.
- Auto memory cleanup: stops the app after each flow (resolved from the
  flow's own launchApp command) to free device memory and prevent
  slowdown in long test suites. No extra configuration needed.
- Zero regression: --shard-split / --shard-all paths are unchanged.

## Implementation

- `DynamicShardScheduler`: coroutine-per-device over a `Channel<Path>`,
  `AtomicInteger` for pending/alive counters, `AtomicBoolean` for
  cancellation — no manual locking.
- `TestSuiteInteractor`: extracted `buildSummary()`, added `runFromQueue()`
  which drives the work-stealing loop.
- `TestCommand`: new flags with mutual-exclusion validation; dynamic branch
  in `handleSessions()` returns before static `makeChunkPlans()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants