Skip to content

system oom with qwen 235b #1442

@cmunley1

Description

@cmunley1

Describe the bug

CPU memory usage steadily increasing until OOM. Qwen 235b a22b. OOMs at the end of this chart. Customer-reported, do not have full reproducer yet, but the RL environment is likely not the culprit

Image
--config examples/configs/grpo_math_qwen30ba3b_megatron.yaml policy.model_name=Qwen/Qwen3-235B-A22B cluster.gpus_per_node=8 policy.megatron_cfg.tensor_model_parallel_size=4 policy.megatron_cfg.expert_tensor_parallel_size=1 policy.megatron_cfg.pipeline_model_parallel_size=16 policy.megatron_cfg.expert_model_parallel_size=4 policy.megatron_cfg.context_parallel_size=2 policy.megatron_cfg.sequence_parallel=True policy.generation.vllm_cfg.tensor_parallel_size=16 policy.generation.vllm_cfg.pipeline_parallel_size=1 cluster.num_nodes=32 policy.megatron_cfg.num_layers_in_first_pipeline_stage=5 policy.megatron_cfg.num_layers_in_last_pipeline_stage=5 policy.max_total_sequence_length=8192 policy.train_global_batch_size=512 grpo.num_generations_per_prompt=16 grpo.num_prompts_per_step=32 policy.generation.vllm_cfg.enforce_eager=True

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions