JSON Benchmark Evaluation

Run learned policies on fixed, reproducible benchmarks.

Concepts

A benchmark is a benchmark.json file containing a list of self-contained episode specs. Each spec includes everything needed to recreate a task: scene, robot pose, object poses, cameras, language instructions.

An eval config is a normal datagen config with your policy attached. When pointed at a benchmark, episode-specific fields (cameras, init_qpos, object_poses, etc.) are overwritten by the benchmark spec. This means you can debug your policy in normal datagen using datagen/main, then just swap to benchmark mode for eval by using eval_main or importing run_evaluation from same.

Installing the Benchmarks

The benchmark datasets are installed when resource manager gets instantiated. You can either run

export MLSPACES_ASSETS_DIR=/path/to/symlink/resources
python -m molmo_spaces.molmo_spaces_constants

or you can explicitly code in the python script

from molmospaces.molmo_spaces_constants import get_resource_manager()
get_resource_manager()

Running Benchmarks

Quick guide: set up the Pi policy server once, then run the benchmark.

1. Setup: install and run the Pi policy server

Download the checkpoint and start the policy server (leave it running in a separate terminal):

git clone https://github.com/omarrayyann/openpi
mkdir checkpoints && cd checkpoints
gsutil cp -r gs://openpi-assets/checkpoints/pi05_droid_jointpos .
# other options: `pi05_droid_jointpos`, `pi0_fast_droid_jointpos`, `pi0_droid_jointpos`

Install openpi and run the server (default port: 8080):

uv run scripts/serve_policy.py --port=8080 policy:checkpoint \
  --policy.config=<checkpoint_name> \
  --policy.dir=checkpoints/<checkpoint_name>/

2. Run the benchmark

If using OpenPI models: pip install openpi_client.

Then launch benchmark episodes in MuJoCo:

python molmo_spaces/evaluation/eval_main.py \
    molmo_spaces.evaluation.configs.evaluation_configs:PiPolicyEvalConfig \
    --benchmark_dir assets/benchmarks/path-to-benchmark/ \
    --task_horizon_steps 500

Make sure the port number is the same in molmo_spaces.configs.policy_configs_baselines:PiPolicyConfig

Also, see molmo_spaces/evaluation/configs/evaluation_configs.py for more examples on eval configs.

Running benchmark with a custom asset

You might want to replace the target rigid object for pick or pick-and-place with a custom asset for a specific episode.

python molmo_spaces/evaluation/eval_main.py \
    molmo_spaces.evaluation.configs.evaluation_configs:PiPolicyEvalConfig \
    --benchmark_dir assets/benchmarks/path-to-benchmark.json \
    --checkpoint_path <path/to/checkpoint/pi0_fast_droid_jointpos> \
    --task_horizon_steps 500
    --idx 0
    --add_custom_object
    --custom_object_path <path/to/custom/object.xml>
    --custom_object_name <natural/language/name/of/object>

Implementing Eval in an External Repo

You need three things: a policy class, a policy config, and an eval config.

1. Policy Class

Extend InferencePolicy. Must implement prepare_model, reset, and get_action.

# my_repo/policy.py
from molmo_spaces.policy.base_policy import InferencePolicy

class MyPolicy(InferencePolicy):
    def __init__(self, config, task_type):
        super().__init__(config, task_type)
        self.camera_names = config.policy_config.camera_names
        self.action_spec = config.policy_config.action_spec
        self.prepare_model()

    def prepare_model(self):
        # Load your model from config.policy_config.checkpoint_path
        self.model = load_my_model(self.config.policy_config.checkpoint_path)

    def reset(self):
        # Called at the start of each episode
        pass

    def get_action(self, observation) -> dict[str, np.ndarray]:
        # observation is a dict with camera images and robot_state
        # Return dict mapping move group names to action arrays
        # e.g. {"arm": np.array([...]), "gripper": np.array([...])}
        obs = observation[0] if isinstance(observation, list) else observation
        images = [obs[cam] for cam in self.camera_names]
        state = obs["robot_state"]["qpos"]
        return self.model.predict(images, state)

See molmo_spaces/policy/learned_policy/synthvla_policy.py for a full example with action chunking.

2. Policy Config

Extend BasePolicyConfig. Define your model's interface.

# my_repo/configs.py
from molmo_spaces.configs.policy_configs import BasePolicyConfig

class MyPolicyConfig(BasePolicyConfig):
    policy_type: str = "learned"
    action_type: str = "joint_pos_rel"
    policy_cls: type = None

    def model_post_init(self, __context):
        if self.policy_cls is None:
            from my_repo.policy import MyPolicy
            object.__setattr__(self, "policy_cls", MyPolicy)

    checkpoint_path: str
    camera_names: list[str] = ["exo_camera_1", "wrist_camera"]
    action_move_group_names: list[str] = ["arm", "gripper"]
    action_spec: dict[str, int] = {"arm": 7, "gripper": 1}

3. Eval Config

Extend JsonBenchmarkEvalConfig. This is the minimal config for benchmark eval - episode-specific data (cameras, poses, task params) comes from the benchmark JSON.

# my_repo/configs.py
from molmo_spaces.configs.robot_configs import FrankaRobotConfig
from molmo_spaces.evaluation.configs.evaluation_configs import JsonBenchmarkEvalConfig

class MyEvalConfig(JsonBenchmarkEvalConfig):
    robot_config: FrankaRobotConfig = FrankaRobotConfig()
    policy_config: MyPolicyConfig = MyPolicyConfig(
        checkpoint_path="/path/to/default/checkpoint"
    )
    policy_dt_ms: float = 200.0  # Match your model's expected control rate

    def model_post_init(self, __context):
        super().model_post_init(__context)
        self.robot_config.action_noise_config.enabled = False

4. Run Evaluation

Command line:

python molmo_spaces/evaluation/eval_main.py \
    my_repo.configs:MyEvalConfig \
    --benchmark_dir <path/to/benchmark.json> \
    --checkpoint_path <path/to/checkpoint/directory> \
    --task_horizon_steps 500

Or programmatically:

from pathlib import Path
from molmo_spaces.evaluation.eval_main import run_evaluation

results = run_evaluation(
    eval_config_cls="my_repo.configs:MyEvalConfig",
    benchmark_dir=Path("<path/to/benchmark.json>"),
    checkpoint_path="<path/to/checkpoint/directory>",
    task_horizon_steps=500,
    use_wandb=True,
)

print(f"Success rate: {results.success_rate:.1%}")
for r in results.episode_results:
    print(f"{r.house_id}/ep{r.episode_idx}: {'pass' if r.success else 'fail'}")

You can also pass preloaded_policy= if you've already instantiated the policy.

Sample Episode Spec

{
  "source": {
    "h5_file": "/.../.../house_2115/trajectories_batch_3_of_3.h5",
    "traj_key": "traj_2",
    "episode_length": 49,
    "camera_system_class": "FrankaDroidCameraSystem",
    "source_data_date": "2025-12-19",
    "benchmark_created_date": "2026-01-21"
  },
  "house_index": 2115,
  "scene_dataset": "procthor-objaverse",
  "data_split": "val",
  "seed": null,
  "robot": {
    "robot_name": "franka_droid",
    "init_qpos": {
      "base": [],
      "arm": [-0.024, -0.737, -0.007, -2.327, -0.038, 1.590, 0.020],
      "gripper": [0.003, 0.003]
    }
  },
  "cameras": [
    {
      "name": "wrist_camera",
      "type": "robot_mounted",
      "reference_body_names": ["robot_0/gripper/base"],
      "camera_offset": [0.031, 0.074, 0.022],
      "lookat_offset": [0.0, 0.0, 0.08],
      "camera_quaternion": [-0.006, -0.001, 0.986, 0.169],
      "fov": 56.74
    },
    {
      "name": "exo_camera_1",
      "type": "robot_mounted",
      "reference_body_names": ["robot_0/fr3_link0"],
      "camera_offset": [0.1, 0.57, 0.66],
      "lookat_offset": [0.0, 0.0, 0.08],
      "camera_quaternion": [-0.363, -0.124, 0.426, 0.819],
      "fov": 71.0
    }
  ],
  "scene_modifications": {
    "added_objects": {},
    "object_poses": {
      "pillow_1c5c1394...": [6.79, 3.98, 0.84, -0.50, -0.50, 0.50, 0.50],
      "bowl_f159d8f5...": [2.03, 3.97, 0.82, 0.0, 0.0, 0.71, 0.71],
      "...": "// ~20 more objects with [x, y, z, qw, qx, qy, qz] poses"
    }
  },
  "task": {
    "task_cls": "molmo_spaces.tasks.pick_task.PickTask",
    "robot_base_pose": [1.61, 4.41, 0.07, 0.90, 0.0, 0.0, -0.45],
    "pickup_obj_name": "bowl_f159d8f5528715d01c1bddd6ef86dbcb_1_0_2",
    "pickup_obj_start_pose": [2.03, 3.97, 0.82, 0.0, 0.0, 0.71, 0.71],
    "pickup_obj_goal_pose": [2.03, 3.97, 0.87, 0.0, 0.0, 0.71, 0.71],
    "succ_pos_threshold": 0.01
  },
  "language": {
    "task_description": "Pick up a white bowl",
    "referral_expressions": {"pickup_obj_name": "white bowl"},
    "referral_expressions_priority": {
      "pickup_obj_name": [
        [0.045, 0.31, "smooth white bowl"],
        [0.041, 0.29, "white bowl"],
        [0.023, 0.25, "bowl"],
        "// ... more referrals with [clip_score, dino_score, text]"
      ]
    }
  },
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Benchmark Evaluation

Concepts

Installing the Benchmarks

Running Benchmarks

Quick guide: set up the Pi policy server once, then run the benchmark.

1. Setup: install and run the Pi policy server

2. Run the benchmark

Running benchmark with a custom asset

Implementing Eval in an External Repo

1. Policy Class

2. Policy Config

3. Eval Config

4. Run Evaluation

Sample Episode Spec

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

JSON Benchmark Evaluation

Concepts

Installing the Benchmarks

Running Benchmarks

Quick guide: set up the Pi policy server once, then run the benchmark.

1. Setup: install and run the Pi policy server

2. Run the benchmark

Running benchmark with a custom asset

Implementing Eval in an External Repo

1. Policy Class

2. Policy Config

3. Eval Config

4. Run Evaluation

Sample Episode Spec