Hi LingBot-VLA team,
Thanks for releasing this amazing work and the highly efficient training codebase!
I am currently running the Post-Training (SFT) pipeline using the provided lingbot-vla-4b model and Qwen2.5-VL-3B-Instruct base. I have conducted rigorous experiments on both a single-task dataset (e.g., click_bell) and the RoboTwin 5-tasks mixed dataset.
1. My Training Setup
- Hardware: 8x A800 GPUs
- Batch Size:
micro_batch_size = 4 per GPU (resulting in global_batch_size = 32)
- Config: Modified from
robotwin_load20000h.yaml
2. Observation A: High Loss Fluctuation
During training, I plotted the loss from loss.jsonl. Interestingly, even in the single-task training scenario (as shown in the attached figure for the click_bell task), the raw training loss (blue line) fluctuates heavily with dense spikes.
3. Observation B: Significant Evaluation Performance Gap
While the official claims demonstrate a highly capable model (e.g., ~80%+ success rates on RoboTwin tasks), my evaluation results fall far short of this, despite strictly following the SFT pipeline:
- Single-Task SFT: When training exclusively on the simplest task (
click_bell), the success rate is only around 30%.
- Multi-Task SFT (5 Tasks): I then strictly followed the official documentation snippet: "Training Configuration: We provide the mixed post-training configuration in five RoboTwin 2.0 tasks ('open microwave' 'click bell' 'stack blocks three' 'place shoe' 'put object cabinet')." Surprisingly, when training on the 5-task mixed dataset, the success rate for individual tasks drops drastically to single digits (just a few percent).
4. My Questions:
- Loss Fluctuation: Given the 8-GPU setup (
micro_batch_size=4), is this high-frequency fluctuation completely normal for LingBot-VLA's architecture during SFT, even on a single task? How do you usually determine optimal convergence?
- Performance Reproduction: How can we reproduce the high success rates mentioned in your reports? Are there specific hyperparameters, data augmentation strategies, or longer training epochs required that are not reflected in the default
robotwin_load20000h.yaml?
- Multi-task Degradation: Is it expected that the 5-task mixed training performs significantly worse than single-task training under the current config? Does the multi-task setup require a different sampling strategy or a much larger batch size to handle task interference?
Thanks in advance for your time and insights!
Hi LingBot-VLA team,
Thanks for releasing this amazing work and the highly efficient training codebase!
I am currently running the Post-Training (SFT) pipeline using the provided
lingbot-vla-4bmodel andQwen2.5-VL-3B-Instructbase. I have conducted rigorous experiments on both a single-task dataset (e.g.,click_bell) and the RoboTwin 5-tasks mixed dataset.1. My Training Setup
micro_batch_size = 4per GPU (resulting inglobal_batch_size = 32)robotwin_load20000h.yaml2. Observation A: High Loss Fluctuation
During training, I plotted the loss from
loss.jsonl. Interestingly, even in the single-task training scenario (as shown in the attached figure for theclick_belltask), the raw training loss (blue line) fluctuates heavily with dense spikes.3. Observation B: Significant Evaluation Performance Gap
While the official claims demonstrate a highly capable model (e.g., ~80%+ success rates on RoboTwin tasks), my evaluation results fall far short of this, despite strictly following the SFT pipeline:
click_bell), the success rate is only around 30%.4. My Questions:
micro_batch_size=4), is this high-frequency fluctuation completely normal for LingBot-VLA's architecture during SFT, even on a single task? How do you usually determine optimal convergence?robotwin_load20000h.yaml?Thanks in advance for your time and insights!