Cannot reproduce SmolVLA results on LIBERO benchmark

Hello,

I am trying to reproduce LIBERO benchmark results of [SmolVLA](https://huggingface.co/HuggingFaceVLA/smolvla_libero).
However, I can't reproduce results on neither [leaderboard](https://huggingface.co/spaces/HuggingFaceVLA/libero-vla-leaderboard) and [paper](https://arxiv.org/abs/2506.01844)

I am working on NVIDIA Jetson AGX Orin Developer Kit (Jetpack 6.2.1, Jetson Linux 36.4.4)
and below is my pip list

Hello,

I am trying to reproduce the LIBERO benchmark results of [SmolVLA](https://huggingface.co/HuggingFaceVLA/smolvla_libero).  
However, I can't reproduce the results on either the [leaderboard](https://huggingface.co/spaces/HuggingFaceVLA/libero-vla-leaderboard) or the [paper](https://arxiv.org/abs/2506.01844).

I am working on an NVIDIA Jetson AGX Orin Developer Kit (JetPack 6.2.1, Jetson Linux 36.4.4),  
and below is my pip list.

<details>
<summary>pip list</summary>

```
absl-py==2.3.1
accelerate==1.10.1
aiohappyeyeballs==2.6.1
aiohttp==3.13.0
aiosignal==1.4.0
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.9.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==3.0.0
async-lru==2.0.5
attrs==23.2.0
av==15.1.0
babel==2.17.0
bddl==1.0.1
beautifulsoup4==4.13.4
bleach==6.2.0
blinker==1.7.0
certifi==2025.1.31
cffi==1.17.1
charset-normalizer==3.4.1
click==8.3.0
cloudpickle==3.1.1
cmake==3.31.6
comm==0.2.2
contourpy==1.3.2
cryptography==41.0.7
cuda-bindings==12.8.0
cuda-python==12.8.0
cycler==0.12.1
Cython==3.0.12
dataclasses==0.6
datasets==4.1.1
dbus-python==1.3.2
debugpy==1.8.14
decorator==5.2.1
deepdiff==8.6.1
defusedxml==0.7.1
diffusers @ file:///opt/diffusers-0.34.0.dev0-py3-none-any.whl#sha256=cf07a8004c994f02e0d41e9bface90486f53a98cd3abdda39972c5ffe7009d87
dill==0.4.0
distro==1.9.0
docopt==0.6.2
docutils==0.21.2
draccus==0.10.0
easydict==1.13
egl_probe @ git+https://github.com/huggingface/egl_probe.git@eb5e5f882236a5668e43a0e78121aaa10cdf2243
einops==0.8.1
etils==1.13.0
evdev==1.9.2
executing==2.2.0
Farama-Notifications==0.0.4
fastjsonschema==2.21.1
filelock==3.18.0
fonttools==4.57.0
fqdn==1.5.1
frozenlist==1.8.0
fsspec==2025.3.2
future==1.0.0
gitdb==4.0.12
GitPython==3.1.45
glfw==2.10.0
grpcio==1.75.1
gym==0.26.2
gym-notices==0.1.0
gymnasium==0.29.1
h11==0.14.0
h5py==3.13.0
hf-xet==1.1.10
hf_transfer==0.1.9
httpcore==1.0.8
httplib2==0.20.4
httpx==0.28.1
huggingface-hub==0.35.3
hydra-core==1.3.2
id==1.5.0
idna==3.10
imageio==2.37.0
imageio-ffmpeg==0.6.0
importlib_metadata==8.6.1
importlib_resources==6.5.2
iniconfig==2.1.0
inquirerpy==0.3.4
ipykernel==6.29.5
ipython==9.1.0
ipython_pygments_lexers==1.1.1
ipywidgets==8.1.6
isoduration==20.11.0
jaraco.classes==3.4.0
jaraco.context==6.0.1
jaraco.functools==4.1.0
jedi==0.19.2
jeepney==0.9.0
Jinja2==3.1.6
json5==0.12.0
jsonlines==4.0.0
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2025.4.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.12.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter_server==2.15.0
jupyter_server_terminals==0.5.3
jupyterlab==4.4.1
jupyterlab_myst==2.4.2
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==3.0.14
jupytext==1.17.3
keyring==25.6.0
kiwisolver==1.4.8
launchpadlib==1.11.0
lazr.restfulclient==0.14.6
lazr.uri==1.0.6
-e git+https://github.com/huggingface/lerobot@6f5bb4d4a49fbdb47acfeaa2c190b5fa125f645a#egg=lerobot
libero @ git+https://github.com/huggingface/lerobot-libero.git@b053a4b0de70a3f2d736abe0f9a9ee64477365df
llvmlite==0.45.1
Mako==1.3.10
Markdown==3.9
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.10.1
matplotlib-inline==0.1.7
mdit-py-plugins==0.5.0
mdurl==0.1.2
mergedeep==1.3.4
mistune==3.1.3
more-itertools==10.7.0
mpmath==1.3.0
mujoco==3.3.2
multidict==6.7.0
multiprocess==0.70.16
mypy_extensions==1.1.0
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.4.2
nh3==0.2.21
ninja==1.11.1.4
notebook==7.4.1
notebook_shim==0.2.4
num2words==0.5.14
numba==0.62.1
numpy==2.2.5
oauthlib==3.2.2
omegaconf==2.3.0
onnx==1.17.0
opencv-contrib-python==4.11.0.86
opencv-python==4.11.0
opencv-python-headless==4.12.0.88
optimum==1.24.0
orderly-set==5.5.0
overrides==7.7.0
packaging==25.0
pandas==2.3.3
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pfzy==0.3.4
pillow==11.2.1
pkginfo==1.12.1.2
platformdirs==4.3.7
pluggy==1.6.0
prometheus_client==0.21.1
prompt_toolkit==3.0.51
propcache==0.4.1
protobuf==6.30.2
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==21.0.0
pyav==14.2.1
pycparser==2.22
pycuda==2025.1
pydantic==2.12.1
pydantic_core==2.41.3
Pygments==2.19.1
PyGObject==3.48.2
PyJWT==2.7.0
pynput==1.8.1
PyOpenGL==3.1.10
PyOpenGL-accelerate==3.1.10
pyparsing==3.1.1
pyrsistent==0.20.0
pyserial==3.5
pytest==8.4.2
python-apt==2.7.7+ubuntu4
python-dateutil==2.9.0.post0
python-json-logger==3.3.0
python-xlib==0.33
pytools==2025.1.2
pytz==2025.2
PyYAML==6.0.2
pyyaml-include==1.4.1
pyzmq==26.4.0
readme_renderer==44.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
requests-toolbelt==1.0.0
rerun-sdk==0.22.1
rfc3339-validator==0.1.4
rfc3986==2.0.0
rfc3986-validator==0.1.1
rich==14.0.0
robomimic==0.2.0
robosuite==1.4.0
rpds-py==0.24.0
safetensors==0.5.3
scikit-build==0.18.1
scipy==1.16.2
SecretStorage==3.3.3
semantic-version==2.10.0
Send2Trash==1.8.3
sentencepiece==0.2.0
sentry-sdk==2.41.0
setuptools==79.0.1
setuptools-rust==1.11.1
six==1.16.0
smmap==5.0.2
sniffio==1.3.1
soupsieve==2.7
ssh-import-id==5.11
stack-data==0.6.3
sympy==1.13.3
tensorboard==2.20.0
tensorboard-data-server==0.7.2
tensorboardX==2.6.4
tensorrt @ file:///usr/src/tensorrt/python/tensorrt-10.7.0-cp312-none-linux_aarch64.whl#sha256=60e975db13ccc26b269bb63c57584435e11c9c0b91cbd4d7d91c12f4de8baddc
termcolor==3.1.0
terminado==0.18.1
thop==0.1.1.post2209072238
tinycss2==1.4.0
tokenizers==0.22.1
toml==0.10.2
torch==2.7.0
torchvision==0.22.0
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.57.0
twine==6.1.0
types-python-dateutil==2.9.0.20241206
typing-inspect==0.9.0
typing-inspection==0.4.2
typing_extensions==4.15.0
tzdata==2025.2
uri-template==1.3.0
urllib3==2.4.0
uv==0.6.16
wadllib==1.3.6
wandb==0.22.2
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
Werkzeug==3.1.3
wheel==0.45.1
widgetsnbextension==4.0.14
xxhash==3.6.0
yarl==1.22.0
zipp==3.21.0
```

</details>

My benchmark results are as follows:

|  | Spatial | Object | Goal | Long |
| :--- | :---: | :---: | :---: | :---: |
| [Leaderboard](https://huggingface.co/spaces/HuggingFaceVLA/libero-vla-leaderboard) | 0.9 | 1.0 | 1.0 | 0.6 |
| [Paper](https://arxiv.org/abs/2506.01844) | 0.90 | 0.96 | 0.92 | 0.71 |
| My reproduction | 0.73 | 0.91 | 0.83 | 0.43 |

Below is the command I used for evaluation  
(I followed the original paper setting with `n_action_steps=1`):

```
lerobot-eval --policy.path=HuggingFaceVLA/smolvla_libero --policy.num_steps=10 --policy.n_action_steps=1 --env.type=libero --env.task=libero_spatial,libero_object,libero_goal,libero_10 --eval.n_episodes=10 --eval.batch_size=1
```

I also found that when I solely evaluate the model on the LIBERO Long (`libero_10`) benchmark,  
its performance improves (0.43 → 0.51), but still remains below the original baseline.

Am I properly evaluating the model? I wonder where I might be making mistakes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot reproduce SmolVLA results on LIBERO benchmark #2354

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Spatial	Object	Goal	Long
Leaderboard	0.9	1.0	1.0	0.6
Paper	0.90	0.96	0.92	0.71
My reproduction	0.73	0.91	0.83	0.43

Cannot reproduce SmolVLA results on LIBERO benchmark #2354

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions