Skip to content

DeepSeek-V3 intermittently fails with OOM #1512

@scsudhakaran

Description

@scsudhakaran

DeepSeek-V3 runs fail due to OOM but on different iterations when doing performance benchmarking. The issue is observed for 128 GPUs + BF16 combination on GB200.

For a particular config where should we expect the randomness to be coming from?

Image: http://nvcr.io/nvidian/nemo:25.11.rc7
GPU: GB200
branch: scsudhakaran/llmb-r0.2.0

Command used

git checkout scsudhakaran/llmb-r0.2.0

python3 scripts/performance/setup_experiment.py \
    --container_image <container_image>  \
    --model_name deepseek \
    --model_size v3 \
    --domain llm \
    --task pretrain \
    --compute_dtype  bf16 \
    --gpu "gb200" \
    --num_gpus 128 \
    --gpus_per_node 4 \
    --tensor_model_parallel_size 1 \
    --pipeline_model_parallel_size 4 \
    --expert_model_parallel_size 32 \
    --virtual_pipeline_model_parallel_size None \
    --global_batch_size 1024 \
    --micro_batch_size 1 \
    --account <account> \
    --partition <partition> \
    --log_dir <log_dir> \

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions