- 
                Notifications
    
You must be signed in to change notification settings  - Fork 2.9k
 
Description
Hello,
I am trying to reproduce LIBERO benchmark results of SmolVLA.
However, I can't reproduce results on neither leaderboard and paper
I am working on NVIDIA Jetson AGX Orin Developer Kit (Jetpack 6.2.1, Jetson Linux 36.4.4)
and below is my pip list
Hello,
I am trying to reproduce the LIBERO benchmark results of SmolVLA.
However, I can't reproduce the results on either the leaderboard or the paper.
I am working on an NVIDIA Jetson AGX Orin Developer Kit (JetPack 6.2.1, Jetson Linux 36.4.4),
and below is my pip list.
pip list
absl-py==2.3.1
accelerate==1.10.1
aiohappyeyeballs==2.6.1
aiohttp==3.13.0
aiosignal==1.4.0
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyio==4.9.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==3.0.0
async-lru==2.0.5
attrs==23.2.0
av==15.1.0
babel==2.17.0
bddl==1.0.1
beautifulsoup4==4.13.4
bleach==6.2.0
blinker==1.7.0
certifi==2025.1.31
cffi==1.17.1
charset-normalizer==3.4.1
click==8.3.0
cloudpickle==3.1.1
cmake==3.31.6
comm==0.2.2
contourpy==1.3.2
cryptography==41.0.7
cuda-bindings==12.8.0
cuda-python==12.8.0
cycler==0.12.1
Cython==3.0.12
dataclasses==0.6
datasets==4.1.1
dbus-python==1.3.2
debugpy==1.8.14
decorator==5.2.1
deepdiff==8.6.1
defusedxml==0.7.1
diffusers @ file:///opt/diffusers-0.34.0.dev0-py3-none-any.whl#sha256=cf07a8004c994f02e0d41e9bface90486f53a98cd3abdda39972c5ffe7009d87
dill==0.4.0
distro==1.9.0
docopt==0.6.2
docutils==0.21.2
draccus==0.10.0
easydict==1.13
egl_probe @ git+https://github.com/huggingface/egl_probe.git@eb5e5f882236a5668e43a0e78121aaa10cdf2243
einops==0.8.1
etils==1.13.0
evdev==1.9.2
executing==2.2.0
Farama-Notifications==0.0.4
fastjsonschema==2.21.1
filelock==3.18.0
fonttools==4.57.0
fqdn==1.5.1
frozenlist==1.8.0
fsspec==2025.3.2
future==1.0.0
gitdb==4.0.12
GitPython==3.1.45
glfw==2.10.0
grpcio==1.75.1
gym==0.26.2
gym-notices==0.1.0
gymnasium==0.29.1
h11==0.14.0
h5py==3.13.0
hf-xet==1.1.10
hf_transfer==0.1.9
httpcore==1.0.8
httplib2==0.20.4
httpx==0.28.1
huggingface-hub==0.35.3
hydra-core==1.3.2
id==1.5.0
idna==3.10
imageio==2.37.0
imageio-ffmpeg==0.6.0
importlib_metadata==8.6.1
importlib_resources==6.5.2
iniconfig==2.1.0
inquirerpy==0.3.4
ipykernel==6.29.5
ipython==9.1.0
ipython_pygments_lexers==1.1.1
ipywidgets==8.1.6
isoduration==20.11.0
jaraco.classes==3.4.0
jaraco.context==6.0.1
jaraco.functools==4.1.0
jedi==0.19.2
jeepney==0.9.0
Jinja2==3.1.6
json5==0.12.0
jsonlines==4.0.0
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2025.4.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.12.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter_server==2.15.0
jupyter_server_terminals==0.5.3
jupyterlab==4.4.1
jupyterlab_myst==2.4.2
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==3.0.14
jupytext==1.17.3
keyring==25.6.0
kiwisolver==1.4.8
launchpadlib==1.11.0
lazr.restfulclient==0.14.6
lazr.uri==1.0.6
-e git+https://github.com/huggingface/lerobot@6f5bb4d4a49fbdb47acfeaa2c190b5fa125f645a#egg=lerobot
libero @ git+https://github.com/huggingface/lerobot-libero.git@b053a4b0de70a3f2d736abe0f9a9ee64477365df
llvmlite==0.45.1
Mako==1.3.10
Markdown==3.9
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.10.1
matplotlib-inline==0.1.7
mdit-py-plugins==0.5.0
mdurl==0.1.2
mergedeep==1.3.4
mistune==3.1.3
more-itertools==10.7.0
mpmath==1.3.0
mujoco==3.3.2
multidict==6.7.0
multiprocess==0.70.16
mypy_extensions==1.1.0
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.4.2
nh3==0.2.21
ninja==1.11.1.4
notebook==7.4.1
notebook_shim==0.2.4
num2words==0.5.14
numba==0.62.1
numpy==2.2.5
oauthlib==3.2.2
omegaconf==2.3.0
onnx==1.17.0
opencv-contrib-python==4.11.0.86
opencv-python==4.11.0
opencv-python-headless==4.12.0.88
optimum==1.24.0
orderly-set==5.5.0
overrides==7.7.0
packaging==25.0
pandas==2.3.3
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pfzy==0.3.4
pillow==11.2.1
pkginfo==1.12.1.2
platformdirs==4.3.7
pluggy==1.6.0
prometheus_client==0.21.1
prompt_toolkit==3.0.51
propcache==0.4.1
protobuf==6.30.2
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==21.0.0
pyav==14.2.1
pycparser==2.22
pycuda==2025.1
pydantic==2.12.1
pydantic_core==2.41.3
Pygments==2.19.1
PyGObject==3.48.2
PyJWT==2.7.0
pynput==1.8.1
PyOpenGL==3.1.10
PyOpenGL-accelerate==3.1.10
pyparsing==3.1.1
pyrsistent==0.20.0
pyserial==3.5
pytest==8.4.2
python-apt==2.7.7+ubuntu4
python-dateutil==2.9.0.post0
python-json-logger==3.3.0
python-xlib==0.33
pytools==2025.1.2
pytz==2025.2
PyYAML==6.0.2
pyyaml-include==1.4.1
pyzmq==26.4.0
readme_renderer==44.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
requests-toolbelt==1.0.0
rerun-sdk==0.22.1
rfc3339-validator==0.1.4
rfc3986==2.0.0
rfc3986-validator==0.1.1
rich==14.0.0
robomimic==0.2.0
robosuite==1.4.0
rpds-py==0.24.0
safetensors==0.5.3
scikit-build==0.18.1
scipy==1.16.2
SecretStorage==3.3.3
semantic-version==2.10.0
Send2Trash==1.8.3
sentencepiece==0.2.0
sentry-sdk==2.41.0
setuptools==79.0.1
setuptools-rust==1.11.1
six==1.16.0
smmap==5.0.2
sniffio==1.3.1
soupsieve==2.7
ssh-import-id==5.11
stack-data==0.6.3
sympy==1.13.3
tensorboard==2.20.0
tensorboard-data-server==0.7.2
tensorboardX==2.6.4
tensorrt @ file:///usr/src/tensorrt/python/tensorrt-10.7.0-cp312-none-linux_aarch64.whl#sha256=60e975db13ccc26b269bb63c57584435e11c9c0b91cbd4d7d91c12f4de8baddc
termcolor==3.1.0
terminado==0.18.1
thop==0.1.1.post2209072238
tinycss2==1.4.0
tokenizers==0.22.1
toml==0.10.2
torch==2.7.0
torchvision==0.22.0
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.57.0
twine==6.1.0
types-python-dateutil==2.9.0.20241206
typing-inspect==0.9.0
typing-inspection==0.4.2
typing_extensions==4.15.0
tzdata==2025.2
uri-template==1.3.0
urllib3==2.4.0
uv==0.6.16
wadllib==1.3.6
wandb==0.22.2
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
Werkzeug==3.1.3
wheel==0.45.1
widgetsnbextension==4.0.14
xxhash==3.6.0
yarl==1.22.0
zipp==3.21.0
My benchmark results are as follows:
| Spatial | Object | Goal | Long | |
|---|---|---|---|---|
| Leaderboard | 0.9 | 1.0 | 1.0 | 0.6 | 
| Paper | 0.90 | 0.96 | 0.92 | 0.71 | 
| My reproduction | 0.73 | 0.91 | 0.83 | 0.43 | 
Below is the command I used for evaluation
(I followed the original paper setting with n_action_steps=1):
lerobot-eval --policy.path=HuggingFaceVLA/smolvla_libero --policy.num_steps=10 --policy.n_action_steps=1 --env.type=libero --env.task=libero_spatial,libero_object,libero_goal,libero_10 --eval.n_episodes=10 --eval.batch_size=1
I also found that when I solely evaluate the model on the LIBERO Long (libero_10) benchmark,
its performance improves (0.43 → 0.51), but still remains below the original baseline.
Am I properly evaluating the model? I wonder where I might be making mistakes.