- Author: @MilkClouds
- Status: Implemented
- Type: Standards Track
- Created: 2026-02-23
- Requires: RFC-0004
- Superseded-By: —
Add episode-level sharding so that N independent OS processes can evaluate disjoint subsets of episodes in parallel, sharing a single model server. A merge command combines shard results into a unified report. This is the primary mechanism for reducing wall-clock evaluation time.
LIBERO Spatial: 10 tasks × 50 episodes × ~220 steps = ~110,000 steps, executed sequentially. A single evaluation run takes hours. The bottleneck is the serial alternation of CPU simulation and GPU inference — while one is working, the other is idle.
Parallelizing at the episode level is the most impactful optimization because:
- Episodes are independent — no shared state between them.
- Multiple simulation processes can overlap their inference wait time with other simulations.
- The model server already accepts multiple WebSocket clients concurrently.
Some benchmark simulators (notably LIBERO/robosuite) have internal state that conflicts with Python multiprocessing (OpenGL contexts, MuJoCo shared memory, global state in robosuite.utils). Shell-level process isolation (cmd &) avoids these issues entirely — each process gets its own address space with no shared state.
# Run 4 shards in parallel
vla-eval run -c libero_spatial.yaml --shard-id 0 --num-shards 4 &
vla-eval run -c libero_spatial.yaml --shard-id 1 --num-shards 4 &
vla-eval run -c libero_spatial.yaml --shard-id 2 --num-shards 4 &
vla-eval run -c libero_spatial.yaml --shard-id 3 --num-shards 4 &
wait
# Merge results
vla-eval merge results/libero_spatial_shard*.json -o results/libero_spatial.jsonWithout --shard-id/--num-shards, behavior is unchanged (single process, all episodes).
Work items are the flat list of (task, episode_idx) pairs. Round-robin assignment distributes them evenly:
work_items = [(task, ep) for task in tasks for ep in range(episodes_per_task)]
my_items = [w for i, w in enumerate(work_items) if i % num_shards == shard_id]Round-robin (not contiguous blocks) because different tasks may have different max_steps, and round-robin naturally balances load across shards.
Shard results use deterministic filenames (no timestamp) so re-runs overwrite previous attempts:
results/{benchmark_name}_shard{id}of{total}.json
Non-sharded runs keep the existing {name}_{mode}_{timestamp}.json format.
Each shard result includes shard provenance:
{
"benchmark": "libero_spatial",
"mode": "sync",
"shard": {"id": 0, "total": 4},
"partial": false,
"tasks": [ ... ],
"overall_success_rate": 0.92
}The "shard" field is absent in non-sharded runs.
vla-eval merge <glob> [-o output.json] performs:
- Load all shard JSON files.
- Validate consistency: same benchmark name, same shard total, no duplicate shard IDs.
- Detect missing shards: compare found shard IDs against
range(total). - Merge episodes by task name. Deduplicate by
episode_id(last-write-wins for re-runs). - Aggregate metrics: per-task success rate, overall success rate.
- Report coverage and save merged result.
Output example:
All 4 shards complete. Coverage: 500/500 episodes (100.0%)
Overall success rate: 94.2% (471/500)
Saved to: results/libero_spatial.json
Partial example:
⚠ Missing shards: [2] (expected 0..3)
Coverage: 375/500 episodes (75.0%)
Merged result (PARTIAL): 93.1% (349/375)
Saved to: results/libero_spatial.json
To complete: vla-eval run -c libero_spatial.yaml --shard-id 2 --num-shards 4
| Scenario | Behavior |
|---|---|
| Shard process crashes | No result file (or partial). merge detects missing shard, reports coverage gap, suggests re-run command. |
| Some episodes fail within a shard | Existing per-episode error isolation (RFC-0004). Episodes recorded with failure_reason. Shard file marked partial: true. |
| Re-run a failed shard | Deterministic filename overwrites previous attempt. merge deduplicates by episode_id (last-write-wins). |
| All shards succeed | Clean merge, full coverage. |
Sharding args pass through to the Docker container transparently:
docker run ghcr.io/allenai/vla-evaluation-harness/libero:latest run --no-docker --config /tmp/config.yaml --shard-id 0 --num-shards 4The _run_via_docker function forwards --shard-id and --num-shards to the container command.
N shard processes connect to the same model server URL. The server already handles multiple WebSocket clients via websockets.serve (one coroutine per connection). PredictModelServer dispatches each request to run_in_executor, so N concurrent requests run in N threads.
This is not true GPU batching — requests are serialized on the GPU. True batching requires BatchPredictModelServer (RFC-0003, future work). However, even without batching, sharding provides significant speedup because simulation time in one process overlaps with inference time in another.
| Change | File | ~Lines |
|---|---|---|
CLI args --shard-id, --num-shards |
cli/main.py |
~15 |
| Shard filtering + deterministic filenames | orchestrator.py |
~30 |
vla-eval merge command |
cli/main.py + results/merge.py |
~100 |
| Docker arg forwarding | cli/main.py |
~5 |
| Tests | tests/ |
~80 |
Total: ~230 lines of new/changed code.
vla-eval run-parallel: Convenience wrapper that spawns N shard processes + auto-merges. Not needed for v1 — shell scripts suffice.