Problem
The current --shard-split N flag divides flows into static chunks before execution starts (round-robin by index). When flows have heterogeneous durations, faster devices finish their chunk early and sit idle while slower devices are still running. The total wall-clock time is bound by the slowest shard, not by the average flow duration.
# --shard-split 2 with heterogeneous flows:
device A: [5m][5m][5m] ← determines total time
device B: [1m][1m]░░░░░░░░░░░ ← idle 8 minutes
This leads to:
- Underutilization of the device farm
- Unnecessarily long CI pipeline durations
- Manual load-balancing effort to keep shards even
Related: #1818, #2337
Proposed solution
Add --shard-split-dynamic N — a new flag that distributes flows via a shared queue (work-stealing):
- All flows are placed in a shared queue
- N workers (one per device) open a single session each
- Each worker consumes the next available flow as soon as it finishes the previous one
- Faster devices naturally pick up more flows — no idle time
# --shard-split-dynamic 2 with the same flows:
device A: [5m][5m] ← no idle time
device B: [1m][1m][1m][5m] ← picks up flows as they become available
Total time trends toward sum(durations) / N instead of max(shard_duration).
Additional flag: --min-healthy-devices M (default: 2)
Prevents a pathological scenario where multiple devices crash and a single surviving device ends up running the entire remaining queue (making dynamic worse than static):
- Tracks alive workers with an
AtomicInteger
- If alive workers drops below
M, the run is aborted with a clear error
- The in-progress flow on the crashed device is re-enqueued for another worker
# N=3, --min-healthy-devices 2
device C crashes → re-enqueue flow → alive=2 (≥2) → CONTINUE
device B crashes → re-enqueue flow → alive=1 (<2) → ABORT
Automatic app memory cleanup
After each flow, the app is stopped automatically (resolved from the flow's own launchApp command — no config needed). This frees device memory before the next flow starts, preventing slowdown in long test suites where accumulated memory causes animation timeouts.
Related: #1862 (per-device attribution in output is also cleaner — each shard label maps to one device throughout the run)
Implementation sketch
DynamicShardScheduler: coroutineScope + one async per device over a Channel<Path>. AtomicInteger for pending flow count and alive worker count. AtomicBoolean for cancellation. No manual locking.
TestSuiteInteractor.runFromQueue(): work-stealing loop — tryReceive(), runFlow(), decrement pending. Session crash → re-enqueue + check alive threshold.
TestCommand: new flags with mutual-exclusion validation against --shard-split/--shard-all; dynamic branch in handleSessions() returns early, leaving the static path untouched.
- Zero regression:
--shard-split and --shard-all paths are unchanged.
Working implementation
I have a working implementation ready. Tested against a real test suite of 16 flows across 3 iOS simulators — all flows complete, correct per-device attribution, and measurably reduced total execution time vs --shard-split. Happy to open a PR if the team is interested.
Problem
The current
--shard-split Nflag divides flows into static chunks before execution starts (round-robin by index). When flows have heterogeneous durations, faster devices finish their chunk early and sit idle while slower devices are still running. The total wall-clock time is bound by the slowest shard, not by the average flow duration.This leads to:
Related: #1818, #2337
Proposed solution
Add
--shard-split-dynamic N— a new flag that distributes flows via a shared queue (work-stealing):Total time trends toward
sum(durations) / Ninstead ofmax(shard_duration).Additional flag:
--min-healthy-devices M(default: 2)Prevents a pathological scenario where multiple devices crash and a single surviving device ends up running the entire remaining queue (making dynamic worse than static):
AtomicIntegerM, the run is aborted with a clear errorAutomatic app memory cleanup
After each flow, the app is stopped automatically (resolved from the flow's own
launchAppcommand — no config needed). This frees device memory before the next flow starts, preventing slowdown in long test suites where accumulated memory causes animation timeouts.Related: #1862 (per-device attribution in output is also cleaner — each shard label maps to one device throughout the run)
Implementation sketch
DynamicShardScheduler:coroutineScope+ oneasyncper device over aChannel<Path>.AtomicIntegerfor pending flow count and alive worker count.AtomicBooleanfor cancellation. No manual locking.TestSuiteInteractor.runFromQueue(): work-stealing loop —tryReceive(),runFlow(), decrement pending. Session crash → re-enqueue + check alive threshold.TestCommand: new flags with mutual-exclusion validation against--shard-split/--shard-all; dynamic branch inhandleSessions()returns early, leaving the static path untouched.--shard-splitand--shard-allpaths are unchanged.Working implementation
I have a working implementation ready. Tested against a real test suite of 16 flows across 3 iOS simulators — all flows complete, correct per-device attribution, and measurably reduced total execution time vs
--shard-split. Happy to open a PR if the team is interested.