You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! Thanks a lot for open-sourcing this awesome work.
I’ve been testing OmniVLA on a custom wheeled robot in outdoor environments, and I’m seeing mixed performance, as shown in the videos below (video speed is 3x):
vokoscreenNG-2026-01-15_10-54-24.mp4
acceptable performance
vokoscreenNG-2026-01-15_10-53-56.mp4
bad performance
The green line is generated from OmniVLA’s output waypoints, scaled by a metric resolution of 0.2 m. I tried both the satellite and prompt modalities, and the behavior is very similar.
I’m running inference on a desktop RTX 5090. The forward pass takes ~100 ms, but I’m only receiving images at 4 Hz. I also tried all released checkpoints (omnivla-original, omnivla-original-balance, omnivla-finetuned-cast) with similar results.
For inference, I wrote a ROS wrapper around the official script:
The wrapper fills in the robot state and produces a cmd_vel output, while keeping the core inference code unchanged. I limited the command speeds to:
linear velocity ≤ 0.3 m/s
angular velocity ≤ 0.75 rad/s
Here is a sample video where the model fails to keep the robot on the sidewalk. The prompt is: "navigate on the center of the sidewalk".
camera_color_image_raw.mp4
I was wondering:
What's the expected zero shot performance of the OmniVLA on unseen robots?
What's the amount of data you think it would be needed for fine tuning the model to reach the same performance as the one you get in the training platforms?
What's the best format to store the dataset for finetuning?
Hi! Thanks a lot for open-sourcing this awesome work.
I’ve been testing OmniVLA on a custom wheeled robot in outdoor environments, and I’m seeing mixed performance, as shown in the videos below (video speed is 3x):
vokoscreenNG-2026-01-15_10-54-24.mp4
acceptable performance
vokoscreenNG-2026-01-15_10-53-56.mp4
bad performance
The green line is generated from OmniVLA’s output waypoints, scaled by a metric resolution of 0.2 m. I tried both the satellite and prompt modalities, and the behavior is very similar.
I’m running inference on a desktop RTX 5090. The forward pass takes ~100 ms, but I’m only receiving images at 4 Hz. I also tried all released checkpoints (omnivla-original, omnivla-original-balance, omnivla-finetuned-cast) with similar results.
For inference, I wrote a ROS wrapper around the official script:
https://github.com/NHirose/OmniVLA/blob/main/inference/run_omnivla.py
The wrapper fills in the robot state and produces a cmd_vel output, while keeping the core inference code unchanged. I limited the command speeds to:
linear velocity ≤ 0.3 m/s
angular velocity ≤ 0.75 rad/s
Here is a sample video where the model fails to keep the robot on the sidewalk. The prompt is: "navigate on the center of the sidewalk".
camera_color_image_raw.mp4
I was wondering:
Thanks!