Hi LeRobot team,
First of all, thank you for the amazing work on SmolVLA and the LeRobot library!
I am currently trying to reproduce the SmolVLA results on the LIBERO benchmark, but I am observing a significant gap between my evaluation results and the performance reported in the paper.
According to the paper, the 0.45B model achieves success rates of approximately 90%, 96%, 92%, 71% (Spatial, Object, Goal, Long), averaging 87.3%. However, my current fine-tuning setup yields lower results:
My Evaluation Results:
- libero_10: 40.6%
- libero_goal: 76.0%
- libero_object: 71.0%
- libero_spatial: 68.0%
My Training Setup:
- Base model: Fine-tuned from
smolvla_base (ensured load_vlm_weights=True)
- Policy type: SmolVLA
- Batch size: 32
- Steps: 80,000
I suspect the discrepancy might be primarily due to the effective batch size. Since I am training with a batch size of 32, I was wondering if the paper's results were achieved using a much larger effective batch size via multi-GPU training.
Could you please share the exact training configurations used for the LIBERO evaluations in the paper? Specifically, I would greatly appreciate details on:
- Effective / Global Batch Size (and the number of GPUs used)
- Learning Rate (and LR scheduler if applicable)
- The specific commit hash used for the paper's experiments (I would like to restore that exact version to rule out any codebase changes).
- Any other specific hyperparameter tweaks used for the LIBERO benchmark.
Thank you in advance for your time and guidance!
Hi LeRobot team,
First of all, thank you for the amazing work on SmolVLA and the LeRobot library!
I am currently trying to reproduce the SmolVLA results on the LIBERO benchmark, but I am observing a significant gap between my evaluation results and the performance reported in the paper.
According to the paper, the 0.45B model achieves success rates of approximately 90%, 96%, 92%, 71% (Spatial, Object, Goal, Long), averaging 87.3%. However, my current fine-tuning setup yields lower results:
My Evaluation Results:
My Training Setup:
smolvla_base(ensuredload_vlm_weights=True)I suspect the discrepancy might be primarily due to the effective batch size. Since I am training with a batch size of 32, I was wondering if the paper's results were achieved using a much larger effective batch size via multi-GPU training.
Could you please share the exact training configurations used for the LIBERO evaluations in the paper? Specifically, I would greatly appreciate details on:
Thank you in advance for your time and guidance!