Skip to content

High fluctuation in training loss during SFT on RoboTwin tasks - Is this expected? #21

@lwbscu

Description

@lwbscu

Hi LingBot-VLA team,

Thanks for releasing this amazing work and the highly efficient training codebase!

I am currently running the Post-Training (SFT) pipeline using the provided lingbot-vla-4b model and Qwen2.5-VL-3B-Instruct base. I have conducted rigorous experiments on both a single-task dataset (e.g., click_bell) and the RoboTwin 5-tasks mixed dataset.

1. My Training Setup

  • Hardware: 8x A800 GPUs
  • Batch Size: micro_batch_size = 4 per GPU (resulting in global_batch_size = 32)
  • Config: Modified from robotwin_load20000h.yaml

2. Observation A: High Loss Fluctuation

During training, I plotted the loss from loss.jsonl. Interestingly, even in the single-task training scenario (as shown in the attached figure for the click_bell task), the raw training loss (blue line) fluctuates heavily with dense spikes.

Image

3. Observation B: Significant Evaluation Performance Gap

While the official claims demonstrate a highly capable model (e.g., ~80%+ success rates on RoboTwin tasks), my evaluation results fall far short of this, despite strictly following the SFT pipeline:

  • Single-Task SFT: When training exclusively on the simplest task (click_bell), the success rate is only around 30%.
  • Multi-Task SFT (5 Tasks): I then strictly followed the official documentation snippet: "Training Configuration: We provide the mixed post-training configuration in five RoboTwin 2.0 tasks ('open microwave' 'click bell' 'stack blocks three' 'place shoe' 'put object cabinet')." Surprisingly, when training on the 5-task mixed dataset, the success rate for individual tasks drops drastically to single digits (just a few percent).
Image

4. My Questions:

  1. Loss Fluctuation: Given the 8-GPU setup (micro_batch_size=4), is this high-frequency fluctuation completely normal for LingBot-VLA's architecture during SFT, even on a single task? How do you usually determine optimal convergence?
  2. Performance Reproduction: How can we reproduce the high success rates mentioned in your reports? Are there specific hyperparameters, data augmentation strategies, or longer training epochs required that are not reflected in the default robotwin_load20000h.yaml?
  3. Multi-task Degradation: Is it expected that the 5-task mixed training performs significantly worse than single-task training under the current config? Does the multi-task setup require a different sampling strategy or a much larger batch size to handle task interference?

Thanks in advance for your time and insights!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions