Cannot reproduce Figure 10 pick success rate (28%) with public pi05_droid_jointpos checkpoint

## Summary

I'm trying to reproduce the zero-shot evaluation results in Figure 10 using the public `pi05_droid_jointpos` checkpoint and the official evaluation code, but consistently get significantly lower success rates on the Pick task (~18% vs. reported 28%).

## Setup

- **Checkpoint:** `pi05_droid_jointpos`, downloaded via `gsutil cp -r gs://openpi-assets/checkpoints/pi05_droid_jointpos .` as instructed in the README
- **Benchmark:** `FrankaPickHardBench_20260206_json_benchmark` (procthor-objaverse, 2000 episodes)
- **Eval config:** `PiPolicyEvalConfig` (policy_dt_ms=66.0)
- **task_horizon_steps:** 300
- **Eval command:**
  ```python
  from molmo_spaces.evaluation.eval_main import run_evaluation
  results = run_evaluation(
      eval_config_cls=PiPolicyEvalConfig,
      benchmark_dir="...FrankaPickHardBench_20260206_json_benchmark",
      checkpoint_path="checkpoints/pi05_droid_jointpos",
      task_horizon_steps=300,
      num_workers=16,
  )
  ```
- **OpenPI server:** `serve_policy.py --policy.config=pi05_droid_jointpos`

## Results

| Setting | Pick Success Rate |
|---|---|
| Paper Figure 10 (π0.5) | **28%** |
| My result (objaverse `20260131`, horizon=500) | ~15.9% |
| My result (objaverse `20251016_from_20250610`, horizon=500) | ~18.4% |
| My result (objaverse `20251016_from_20250610`, horizon=300) | ~14.1% |


## Questions

1. **Which objaverse version?** The code accepts both `20260131` and `20251016_from_20250610`. Which was used for the reported results?

2. **Any other configuration details** not documented (e.g., specific server-side settings, random seeds) that might affect reproducibility?

## Environment

- Python 3.10, conda environment
- MuJoCo with EGL rendering
- OpenPI server on the same machine

Any guidance on reproducing the reported numbers would be appreciated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce Figure 10 pick success rate (28%) with public pi05_droid_jointpos checkpoint #14

Summary

Setup

Results

Questions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Setting	Pick Success Rate
Paper Figure 10 (π0.5)	28%
My result (objaverse `20260131`, horizon=500)	~15.9%
My result (objaverse `20251016_from_20250610`, horizon=500)	~18.4%
My result (objaverse `20251016_from_20250610`, horizon=300)	~14.1%

Cannot reproduce Figure 10 pick success rate (28%) with public pi05_droid_jointpos checkpoint #14

Description

Summary

Setup

Results

Questions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions