Skip to content

Cannot reproduce SmolVLA results on LIBERO benchmark #2354

@Hesh0629

Description

@Hesh0629

Hello,

I am trying to reproduce LIBERO benchmark results of SmolVLA.
However, I can't reproduce results on neither leaderboard and paper

I am working on NVIDIA Jetson AGX Orin Developer Kit (Jetpack 6.2.1, Jetson Linux 36.4.4)
and below is my pip list

Hello,

I am trying to reproduce the LIBERO benchmark results of SmolVLA.
However, I can't reproduce the results on either the leaderboard or the paper.

I am working on an NVIDIA Jetson AGX Orin Developer Kit (JetPack 6.2.1, Jetson Linux 36.4.4),
and below is my pip list.

pip list
absl-py==2.3.1
accelerate==1.10.1
aiohappyeyeballs==2.6.1
aiohttp==3.13.0
aiosignal==1.4.0
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.9.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==3.0.0
async-lru==2.0.5
attrs==23.2.0
av==15.1.0
babel==2.17.0
bddl==1.0.1
beautifulsoup4==4.13.4
bleach==6.2.0
blinker==1.7.0
certifi==2025.1.31
cffi==1.17.1
charset-normalizer==3.4.1
click==8.3.0
cloudpickle==3.1.1
cmake==3.31.6
comm==0.2.2
contourpy==1.3.2
cryptography==41.0.7
cuda-bindings==12.8.0
cuda-python==12.8.0
cycler==0.12.1
Cython==3.0.12
dataclasses==0.6
datasets==4.1.1
dbus-python==1.3.2
debugpy==1.8.14
decorator==5.2.1
deepdiff==8.6.1
defusedxml==0.7.1
diffusers @ file:///opt/diffusers-0.34.0.dev0-py3-none-any.whl#sha256=cf07a8004c994f02e0d41e9bface90486f53a98cd3abdda39972c5ffe7009d87
dill==0.4.0
distro==1.9.0
docopt==0.6.2
docutils==0.21.2
draccus==0.10.0
easydict==1.13
egl_probe @ git+https://github.com/huggingface/egl_probe.git@eb5e5f882236a5668e43a0e78121aaa10cdf2243
einops==0.8.1
etils==1.13.0
evdev==1.9.2
executing==2.2.0
Farama-Notifications==0.0.4
fastjsonschema==2.21.1
filelock==3.18.0
fonttools==4.57.0
fqdn==1.5.1
frozenlist==1.8.0
fsspec==2025.3.2
future==1.0.0
gitdb==4.0.12
GitPython==3.1.45
glfw==2.10.0
grpcio==1.75.1
gym==0.26.2
gym-notices==0.1.0
gymnasium==0.29.1
h11==0.14.0
h5py==3.13.0
hf-xet==1.1.10
hf_transfer==0.1.9
httpcore==1.0.8
httplib2==0.20.4
httpx==0.28.1
huggingface-hub==0.35.3
hydra-core==1.3.2
id==1.5.0
idna==3.10
imageio==2.37.0
imageio-ffmpeg==0.6.0
importlib_metadata==8.6.1
importlib_resources==6.5.2
iniconfig==2.1.0
inquirerpy==0.3.4
ipykernel==6.29.5
ipython==9.1.0
ipython_pygments_lexers==1.1.1
ipywidgets==8.1.6
isoduration==20.11.0
jaraco.classes==3.4.0
jaraco.context==6.0.1
jaraco.functools==4.1.0
jedi==0.19.2
jeepney==0.9.0
Jinja2==3.1.6
json5==0.12.0
jsonlines==4.0.0
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2025.4.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.12.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter_server==2.15.0
jupyter_server_terminals==0.5.3
jupyterlab==4.4.1
jupyterlab_myst==2.4.2
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==3.0.14
jupytext==1.17.3
keyring==25.6.0
kiwisolver==1.4.8
launchpadlib==1.11.0
lazr.restfulclient==0.14.6
lazr.uri==1.0.6
-e git+https://github.com/huggingface/lerobot@6f5bb4d4a49fbdb47acfeaa2c190b5fa125f645a#egg=lerobot
libero @ git+https://github.com/huggingface/lerobot-libero.git@b053a4b0de70a3f2d736abe0f9a9ee64477365df
llvmlite==0.45.1
Mako==1.3.10
Markdown==3.9
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.10.1
matplotlib-inline==0.1.7
mdit-py-plugins==0.5.0
mdurl==0.1.2
mergedeep==1.3.4
mistune==3.1.3
more-itertools==10.7.0
mpmath==1.3.0
mujoco==3.3.2
multidict==6.7.0
multiprocess==0.70.16
mypy_extensions==1.1.0
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.4.2
nh3==0.2.21
ninja==1.11.1.4
notebook==7.4.1
notebook_shim==0.2.4
num2words==0.5.14
numba==0.62.1
numpy==2.2.5
oauthlib==3.2.2
omegaconf==2.3.0
onnx==1.17.0
opencv-contrib-python==4.11.0.86
opencv-python==4.11.0
opencv-python-headless==4.12.0.88
optimum==1.24.0
orderly-set==5.5.0
overrides==7.7.0
packaging==25.0
pandas==2.3.3
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pfzy==0.3.4
pillow==11.2.1
pkginfo==1.12.1.2
platformdirs==4.3.7
pluggy==1.6.0
prometheus_client==0.21.1
prompt_toolkit==3.0.51
propcache==0.4.1
protobuf==6.30.2
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==21.0.0
pyav==14.2.1
pycparser==2.22
pycuda==2025.1
pydantic==2.12.1
pydantic_core==2.41.3
Pygments==2.19.1
PyGObject==3.48.2
PyJWT==2.7.0
pynput==1.8.1
PyOpenGL==3.1.10
PyOpenGL-accelerate==3.1.10
pyparsing==3.1.1
pyrsistent==0.20.0
pyserial==3.5
pytest==8.4.2
python-apt==2.7.7+ubuntu4
python-dateutil==2.9.0.post0
python-json-logger==3.3.0
python-xlib==0.33
pytools==2025.1.2
pytz==2025.2
PyYAML==6.0.2
pyyaml-include==1.4.1
pyzmq==26.4.0
readme_renderer==44.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
requests-toolbelt==1.0.0
rerun-sdk==0.22.1
rfc3339-validator==0.1.4
rfc3986==2.0.0
rfc3986-validator==0.1.1
rich==14.0.0
robomimic==0.2.0
robosuite==1.4.0
rpds-py==0.24.0
safetensors==0.5.3
scikit-build==0.18.1
scipy==1.16.2
SecretStorage==3.3.3
semantic-version==2.10.0
Send2Trash==1.8.3
sentencepiece==0.2.0
sentry-sdk==2.41.0
setuptools==79.0.1
setuptools-rust==1.11.1
six==1.16.0
smmap==5.0.2
sniffio==1.3.1
soupsieve==2.7
ssh-import-id==5.11
stack-data==0.6.3
sympy==1.13.3
tensorboard==2.20.0
tensorboard-data-server==0.7.2
tensorboardX==2.6.4
tensorrt @ file:///usr/src/tensorrt/python/tensorrt-10.7.0-cp312-none-linux_aarch64.whl#sha256=60e975db13ccc26b269bb63c57584435e11c9c0b91cbd4d7d91c12f4de8baddc
termcolor==3.1.0
terminado==0.18.1
thop==0.1.1.post2209072238
tinycss2==1.4.0
tokenizers==0.22.1
toml==0.10.2
torch==2.7.0
torchvision==0.22.0
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.57.0
twine==6.1.0
types-python-dateutil==2.9.0.20241206
typing-inspect==0.9.0
typing-inspection==0.4.2
typing_extensions==4.15.0
tzdata==2025.2
uri-template==1.3.0
urllib3==2.4.0
uv==0.6.16
wadllib==1.3.6
wandb==0.22.2
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
Werkzeug==3.1.3
wheel==0.45.1
widgetsnbextension==4.0.14
xxhash==3.6.0
yarl==1.22.0
zipp==3.21.0

My benchmark results are as follows:

Spatial Object Goal Long
Leaderboard 0.9 1.0 1.0 0.6
Paper 0.90 0.96 0.92 0.71
My reproduction 0.73 0.91 0.83 0.43

Below is the command I used for evaluation
(I followed the original paper setting with n_action_steps=1):

lerobot-eval --policy.path=HuggingFaceVLA/smolvla_libero --policy.num_steps=10 --policy.n_action_steps=1 --env.type=libero --env.task=libero_spatial,libero_object,libero_goal,libero_10 --eval.n_episodes=10 --eval.batch_size=1

I also found that when I solely evaluate the model on the LIBERO Long (libero_10) benchmark,
its performance improves (0.43 → 0.51), but still remains below the original baseline.

Am I properly evaluating the model? I wonder where I might be making mistakes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    policiesItems related to robot policiesquestionRequests for clarification or additional informationsimulationMatters involving system simulation or modeling

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions