DeepSeek-V3 intermittently fails with OOM

DeepSeek-V3 runs fail due to OOM but on different iterations when doing performance benchmarking. The issue is observed for 128 GPUs + BF16 combination on GB200.

For a particular config where should we expect the randomness to be coming from?

Image: http://nvcr.io/nvidian/nemo:25.11.rc7
GPU: GB200
branch: [scsudhakaran/llmb-r0.2.0](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/scsudhakaran/llmb-r0.2.0)

Command used
```
git checkout scsudhakaran/llmb-r0.2.0

python3 scripts/performance/setup_experiment.py \
    --container_image <container_image>  \
    --model_name deepseek \
    --model_size v3 \
    --domain llm \
    --task pretrain \
    --compute_dtype  bf16 \
    --gpu "gb200" \
    --num_gpus 128 \
    --gpus_per_node 4 \
    --tensor_model_parallel_size 1 \
    --pipeline_model_parallel_size 4 \
    --expert_model_parallel_size 32 \
    --virtual_pipeline_model_parallel_size None \
    --global_batch_size 1024 \
    --micro_batch_size 1 \
    --account <account> \
    --partition <partition> \
    --log_dir <log_dir> \
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek-V3 intermittently fails with OOM #1512

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepSeek-V3 intermittently fails with OOM #1512

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions