Skip to content

feat: --shard-split-dynamic — work-stealing parallel execution to eliminate device idle time #3340

@thamys-moraes

Description

@thamys-moraes

Problem

The current --shard-split N flag divides flows into static chunks before execution starts (round-robin by index). When flows have heterogeneous durations, faster devices finish their chunk early and sit idle while slower devices are still running. The total wall-clock time is bound by the slowest shard, not by the average flow duration.

# --shard-split 2 with heterogeneous flows:
device A: [5m][5m][5m]          ← determines total time
device B: [1m][1m]░░░░░░░░░░░   ← idle 8 minutes

This leads to:

  • Underutilization of the device farm
  • Unnecessarily long CI pipeline durations
  • Manual load-balancing effort to keep shards even

Related: #1818, #2337

Proposed solution

Add --shard-split-dynamic N — a new flag that distributes flows via a shared queue (work-stealing):

  1. All flows are placed in a shared queue
  2. N workers (one per device) open a single session each
  3. Each worker consumes the next available flow as soon as it finishes the previous one
  4. Faster devices naturally pick up more flows — no idle time
# --shard-split-dynamic 2 with the same flows:
device A: [5m][5m]              ← no idle time
device B: [1m][1m][1m][5m]      ← picks up flows as they become available

Total time trends toward sum(durations) / N instead of max(shard_duration).

Additional flag: --min-healthy-devices M (default: 2)

Prevents a pathological scenario where multiple devices crash and a single surviving device ends up running the entire remaining queue (making dynamic worse than static):

  • Tracks alive workers with an AtomicInteger
  • If alive workers drops below M, the run is aborted with a clear error
  • The in-progress flow on the crashed device is re-enqueued for another worker
# N=3, --min-healthy-devices 2
device C crashes → re-enqueue flow → alive=2 (≥2) → CONTINUE
device B crashes → re-enqueue flow → alive=1 (<2)  → ABORT

Automatic app memory cleanup

After each flow, the app is stopped automatically (resolved from the flow's own launchApp command — no config needed). This frees device memory before the next flow starts, preventing slowdown in long test suites where accumulated memory causes animation timeouts.

Related: #1862 (per-device attribution in output is also cleaner — each shard label maps to one device throughout the run)

Implementation sketch

  • DynamicShardScheduler: coroutineScope + one async per device over a Channel<Path>. AtomicInteger for pending flow count and alive worker count. AtomicBoolean for cancellation. No manual locking.
  • TestSuiteInteractor.runFromQueue(): work-stealing loop — tryReceive(), runFlow(), decrement pending. Session crash → re-enqueue + check alive threshold.
  • TestCommand: new flags with mutual-exclusion validation against --shard-split/--shard-all; dynamic branch in handleSessions() returns early, leaving the static path untouched.
  • Zero regression: --shard-split and --shard-all paths are unchanged.

Working implementation

I have a working implementation ready. Tested against a real test suite of 16 flows across 3 iOS simulators — all flows complete, correct per-device attribution, and measurably reduced total execution time vs --shard-split. Happy to open a PR if the team is interested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions