RFC-0006: Episode Sharding for Parallel Evaluation

Author: @MilkClouds
Status: Implemented
Type: Standards Track
Created: 2026-02-23
Requires: RFC-0004
Superseded-By: —

Summary

Add episode-level sharding so that N independent OS processes can evaluate disjoint subsets of episodes in parallel, sharing a single model server. A merge command combines shard results into a unified report. This is the primary mechanism for reducing wall-clock evaluation time.

Motivation

LIBERO Spatial: 10 tasks × 50 episodes × ~220 steps = ~110,000 steps, executed sequentially. A single evaluation run takes hours. The bottleneck is the serial alternation of CPU simulation and GPU inference — while one is working, the other is idle.

Parallelizing at the episode level is the most impactful optimization because:

Episodes are independent — no shared state between them.
Multiple simulation processes can overlap their inference wait time with other simulations.
The model server already accepts multiple WebSocket clients concurrently.

Why shell-level processes, not in-process multiprocessing?

Some benchmark simulators (notably LIBERO/robosuite) have internal state that conflicts with Python multiprocessing (OpenGL contexts, MuJoCo shared memory, global state in robosuite.utils). Shell-level process isolation (cmd &) avoids these issues entirely — each process gets its own address space with no shared state.

Design

CLI Interface

# Run 4 shards in parallel
vla-eval run -c libero_spatial.yaml --shard-id 0 --num-shards 4 &
vla-eval run -c libero_spatial.yaml --shard-id 1 --num-shards 4 &
vla-eval run -c libero_spatial.yaml --shard-id 2 --num-shards 4 &
vla-eval run -c libero_spatial.yaml --shard-id 3 --num-shards 4 &
wait

# Merge results
vla-eval merge results/libero_spatial_shard*.json -o results/libero_spatial.json

Without --shard-id/--num-shards, behavior is unchanged (single process, all episodes).

Shard Assignment

Work items are the flat list of (task, episode_idx) pairs. Round-robin assignment distributes them evenly:

work_items = [(task, ep) for task in tasks for ep in range(episodes_per_task)]
my_items = [w for i, w in enumerate(work_items) if i % num_shards == shard_id]

Round-robin (not contiguous blocks) because different tasks may have different max_steps, and round-robin naturally balances load across shards.

Result File Naming

Shard results use deterministic filenames (no timestamp) so re-runs overwrite previous attempts:

results/{benchmark_name}_shard{id}of{total}.json

Non-sharded runs keep the existing {name}_{mode}_{timestamp}.json format.

Shard Metadata in Result JSON

Each shard result includes shard provenance:

{
  "benchmark": "libero_spatial",
  "mode": "sync",
  "shard": {"id": 0, "total": 4},
  "partial": false,
  "tasks": [ ... ],
  "overall_success_rate": 0.92
}

The "shard" field is absent in non-sharded runs.

Merge Command

vla-eval merge <glob> [-o output.json] performs:

Load all shard JSON files.
Validate consistency: same benchmark name, same shard total, no duplicate shard IDs.
Detect missing shards: compare found shard IDs against range(total).
Merge episodes by task name. Deduplicate by episode_id (last-write-wins for re-runs).
Aggregate metrics: per-task success rate, overall success rate.
Report coverage and save merged result.

Output example:

All 4 shards complete. Coverage: 500/500 episodes (100.0%)
Overall success rate: 94.2% (471/500)
Saved to: results/libero_spatial.json

Partial example:

⚠ Missing shards: [2] (expected 0..3)
Coverage: 375/500 episodes (75.0%)
Merged result (PARTIAL): 93.1% (349/375)
Saved to: results/libero_spatial.json

To complete: vla-eval run -c libero_spatial.yaml --shard-id 2 --num-shards 4

Error Handling

Scenario	Behavior
Shard process crashes	No result file (or partial). `merge` detects missing shard, reports coverage gap, suggests re-run command.
Some episodes fail within a shard	Existing per-episode error isolation (RFC-0004). Episodes recorded with `failure_reason`. Shard file marked `partial: true`.
Re-run a failed shard	Deterministic filename overwrites previous attempt. `merge` deduplicates by `episode_id` (last-write-wins).
All shards succeed	Clean merge, full coverage.

Interaction with Docker

Sharding args pass through to the Docker container transparently:

docker run ghcr.io/allenai/vla-evaluation-harness/libero:latest run --no-docker --config /tmp/config.yaml --shard-id 0 --num-shards 4

The _run_via_docker function forwards --shard-id and --num-shards to the container command.

Interaction with Model Server

N shard processes connect to the same model server URL. The server already handles multiple WebSocket clients via websockets.serve (one coroutine per connection). PredictModelServer dispatches each request to run_in_executor, so N concurrent requests run in N threads.

This is not true GPU batching — requests are serialized on the GPU. True batching requires BatchPredictModelServer (RFC-0003, future work). However, even without batching, sharding provides significant speedup because simulation time in one process overlaps with inference time in another.

Implementation Scope

Change	File	~Lines
CLI args `--shard-id`, `--num-shards`	`cli/main.py`	~15
Shard filtering + deterministic filenames	`orchestrator.py`	~30
`vla-eval merge` command	`cli/main.py` + `results/merge.py`	~100
Docker arg forwarding	`cli/main.py`	~5
Tests	`tests/`	~80

Total: ~230 lines of new/changed code.

Future Work

vla-eval run-parallel: Convenience wrapper that spawns N shard processes + auto-merges. Not needed for v1 — shell scripts suffice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC-0006: Episode Sharding for Parallel Evaluation

Summary

Motivation

Why shell-level processes, not in-process multiprocessing?

Design

CLI Interface

Shard Assignment

Result File Naming

Shard Metadata in Result JSON

Merge Command

Error Handling

Interaction with Docker

Interaction with Model Server

Implementation Scope

Future Work

FilesExpand file tree

0006-episode-sharding.md

Latest commit

History

0006-episode-sharding.md

File metadata and controls

RFC-0006: Episode Sharding for Parallel Evaluation

Summary

Motivation

Why shell-level processes, not in-process multiprocessing?

Design

CLI Interface

Shard Assignment

Result File Naming

Shard Metadata in Result JSON

Merge Command

Error Handling

Interaction with Docker

Interaction with Model Server

Implementation Scope

Future Work