Summary
I'm trying to reproduce the zero-shot evaluation results in Figure 10 using the public pi05_droid_jointpos checkpoint and the official evaluation code, but consistently get significantly lower success rates on the Pick task (~18% vs. reported 28%).
Setup
- Checkpoint:
pi05_droid_jointpos, downloaded via gsutil cp -r gs://openpi-assets/checkpoints/pi05_droid_jointpos . as instructed in the README
- Benchmark:
FrankaPickHardBench_20260206_json_benchmark (procthor-objaverse, 2000 episodes)
- Eval config:
PiPolicyEvalConfig (policy_dt_ms=66.0)
- task_horizon_steps: 300
- Eval command:
from molmo_spaces.evaluation.eval_main import run_evaluation
results = run_evaluation(
eval_config_cls=PiPolicyEvalConfig,
benchmark_dir="...FrankaPickHardBench_20260206_json_benchmark",
checkpoint_path="checkpoints/pi05_droid_jointpos",
task_horizon_steps=300,
num_workers=16,
)
- OpenPI server:
serve_policy.py --policy.config=pi05_droid_jointpos
Results
| Setting |
Pick Success Rate |
| Paper Figure 10 (π0.5) |
28% |
My result (objaverse 20260131, horizon=500) |
~15.9% |
My result (objaverse 20251016_from_20250610, horizon=500) |
~18.4% |
My result (objaverse 20251016_from_20250610, horizon=300) |
~14.1% |
Questions
-
Which objaverse version? The code accepts both 20260131 and 20251016_from_20250610. Which was used for the reported results?
-
Any other configuration details not documented (e.g., specific server-side settings, random seeds) that might affect reproducibility?
Environment
- Python 3.10, conda environment
- MuJoCo with EGL rendering
- OpenPI server on the same machine
Any guidance on reproducing the reported numbers would be appreciated.