Skip to content

Expected zero shot performance on custom robot #11

@pepisg

Description

@pepisg

Hi! Thanks a lot for open-sourcing this awesome work.

I’ve been testing OmniVLA on a custom wheeled robot in outdoor environments, and I’m seeing mixed performance, as shown in the videos below (video speed is 3x):

vokoscreenNG-2026-01-15_10-54-24.mp4

acceptable performance

vokoscreenNG-2026-01-15_10-53-56.mp4

bad performance

The green line is generated from OmniVLA’s output waypoints, scaled by a metric resolution of 0.2 m. I tried both the satellite and prompt modalities, and the behavior is very similar.

I’m running inference on a desktop RTX 5090. The forward pass takes ~100 ms, but I’m only receiving images at 4 Hz. I also tried all released checkpoints (omnivla-original, omnivla-original-balance, omnivla-finetuned-cast) with similar results.

For inference, I wrote a ROS wrapper around the official script:

https://github.com/NHirose/OmniVLA/blob/main/inference/run_omnivla.py

The wrapper fills in the robot state and produces a cmd_vel output, while keeping the core inference code unchanged. I limited the command speeds to:

linear velocity ≤ 0.3 m/s

angular velocity ≤ 0.75 rad/s

Here is a sample video where the model fails to keep the robot on the sidewalk. The prompt is: "navigate on the center of the sidewalk".

camera_color_image_raw.mp4

I was wondering:

  • What's the expected zero shot performance of the OmniVLA on unseen robots?
  • What's the amount of data you think it would be needed for fine tuning the model to reach the same performance as the one you get in the training platforms?
  • What's the best format to store the dataset for finetuning?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions