High fluctuation in training loss during SFT on RoboTwin tasks - Is this expected?

Hi LingBot-VLA team,

Thanks for releasing this amazing work and the highly efficient training codebase! 

I am currently running the Post-Training (SFT) pipeline using the provided `lingbot-vla-4b` model and `Qwen2.5-VL-3B-Instruct` base. I have conducted rigorous experiments on both a single-task dataset (e.g., `click_bell`) and the RoboTwin 5-tasks mixed dataset.

### 1. My Training Setup
- **Hardware:** 8x A800 GPUs
- **Batch Size:** `micro_batch_size = 4` per GPU (resulting in `global_batch_size = 32`)
- **Config:** Modified from `robotwin_load20000h.yaml`

### 2. Observation A: High Loss Fluctuation
During training, I plotted the loss from `loss.jsonl`. Interestingly, **even in the single-task training scenario** (as shown in the attached figure for the `click_bell` task), the raw training loss (blue line) fluctuates heavily with dense spikes. 

<img width="2545" height="1651" alt="Image" src="https://github.com/user-attachments/assets/5966dc03-2b22-4652-919b-f08e512d2ebb" />

### 3. Observation B: Significant Evaluation Performance Gap
While the official claims demonstrate a highly capable model (e.g., ~80%+ success rates on RoboTwin tasks), my evaluation results fall far short of this, despite strictly following the SFT pipeline:
* **Single-Task SFT:** When training exclusively on the simplest task (`click_bell`), the success rate is only around **30%**.
* **Multi-Task SFT (5 Tasks):** I then strictly followed the official documentation snippet: *"Training Configuration: We provide the mixed post-training configuration in five RoboTwin 2.0 tasks ('open microwave' 'click bell' 'stack blocks three' 'place shoe' 'put object cabinet')."* Surprisingly, when training on the 5-task mixed dataset, the success rate for individual tasks drops drastically to **single digits (just a few percent)**.

<img width="2488" height="2133" alt="Image" src="https://github.com/user-attachments/assets/386662a9-9280-4da5-a36a-09dd1b43ff2b" />

### 4. My Questions:
1. **Loss Fluctuation:** Given the 8-GPU setup (`micro_batch_size=4`), is this high-frequency fluctuation completely normal for LingBot-VLA's architecture during SFT, even on a single task? How do you usually determine optimal convergence?
2. **Performance Reproduction:** How can we reproduce the high success rates mentioned in your reports? Are there specific hyperparameters, data augmentation strategies, or longer training epochs required that are not reflected in the default `robotwin_load20000h.yaml`?
3. **Multi-task Degradation:** Is it expected that the 5-task mixed training performs significantly worse than single-task training under the current config? Does the multi-task setup require a different sampling strategy or a much larger batch size to handle task interference?

Thanks in advance for your time and insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High fluctuation in training loss during SFT on RoboTwin tasks - Is this expected? #21

1. My Training Setup

2. Observation A: High Loss Fluctuation

3. Observation B: Significant Evaluation Performance Gap

4. My Questions:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High fluctuation in training loss during SFT on RoboTwin tasks - Is this expected? #21

Description

1. My Training Setup

2. Observation A: High Loss Fluctuation

3. Observation B: Significant Evaluation Performance Gap

4. My Questions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions