
I have collected 200 robot trajectories and trained them using Lingbot-VLA for a total of 200,000 steps, saving checkpoints every 50,000 steps.
I tested the 100,000-step checkpoint and found the inference performance to be very poor, even though the training loss dropped to around 0.02. I am training on A100 GPUs with a batch size of 32 (16 per node). The dataset is in LeRobot format.
I am struggling to identify the root cause. Could you provide some insights? Is it more likely an issue with the dataset size, action normalization, training configuration, or deployment pipeline
I tested the 100,000-step checkpoint and found the inference performance to be very poor, even though the training loss dropped to around 0.02. I am training on A100 GPUs with a batch size of 32 (16 per node). The dataset is in LeRobot format.
I am struggling to identify the root cause. Could you provide some insights? Is it more likely an issue with the dataset size, action normalization, training configuration, or deployment pipeline