system oom with qwen 235b

**Describe the bug**

CPU memory usage steadily increasing until OOM. Qwen 235b a22b. OOMs at the end of this chart. Customer-reported, do not have full reproducer yet, but the RL environment is likely not the culprit

<img width="720" height="240" alt="Image" src="https://github.com/user-attachments/assets/84baa9fe-5197-4011-b47d-b5ae9de97126" />


```
--config examples/configs/grpo_math_qwen30ba3b_megatron.yaml policy.model_name=Qwen/Qwen3-235B-A22B cluster.gpus_per_node=8 policy.megatron_cfg.tensor_model_parallel_size=4 policy.megatron_cfg.expert_tensor_parallel_size=1 policy.megatron_cfg.pipeline_model_parallel_size=16 policy.megatron_cfg.expert_model_parallel_size=4 policy.megatron_cfg.context_parallel_size=2 policy.megatron_cfg.sequence_parallel=True policy.generation.vllm_cfg.tensor_parallel_size=16 policy.generation.vllm_cfg.pipeline_parallel_size=1 cluster.num_nodes=32 policy.megatron_cfg.num_layers_in_first_pipeline_stage=5 policy.megatron_cfg.num_layers_in_last_pipeline_stage=5 policy.max_total_sequence_length=8192 policy.train_global_batch_size=512 grpo.num_generations_per_prompt=16 grpo.num_prompts_per_step=32 policy.generation.vllm_cfg.enforce_eager=True
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

system oom with qwen 235b #1442

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

system oom with qwen 235b #1442

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions