[Bug] DeepSpeed ZeRO-2 is unstable with GRPOTrainer using Qwen3-VL MoE

### Reproduction

Exact same script ran with zero2 vs plain accelerate leads to measurably worse reward.

```
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

<img width="1078" height="398" alt="Image" src="https://github.com/user-attachments/assets/dd80fd89-c510-486b-85ef-691d5e101b62" />

### System Info

- Platform: Linux-6.8.0-85-generic-x86_64-with-glibc2.35
- Python version: 3.12.9
- TRL version: 0.25.1
- PyTorch version: 2.8.0
- accelerator(s): NVIDIA H200, NVIDIA H200, NVIDIA H200, NVIDIA H200, NVIDIA H200, NVIDIA H200, NVIDIA H200, NVIDIA H200
- Transformers version: 4.57.1
- Accelerate version: 1.12.0
- Accelerate config: not found
- Datasets version: 4.4.1
- HF Hub version: 0.36.0
- bitsandbytes version: 0.48.2
- DeepSpeed version: 0.18.2
- Liger-Kernel version: 0.6.4
- LLM-Blender version: not installed
- OpenAI version: 2.8.1
- PEFT version: 0.18.0
- vLLM version: 0.11.0

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] DeepSpeed ZeRO-2 is unstable with GRPOTrainer using Qwen3-VL MoE #4631

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] DeepSpeed ZeRO-2 is unstable with GRPOTrainer using Qwen3-VL MoE #4631

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions