-
Notifications
You must be signed in to change notification settings - Fork 117
Open
Description
DeepSeek-V3 runs fail due to OOM but on different iterations when doing performance benchmarking. The issue is observed for 128 GPUs + BF16 combination on GB200.
For a particular config where should we expect the randomness to be coming from?
Image: http://nvcr.io/nvidian/nemo:25.11.rc7
GPU: GB200
branch: scsudhakaran/llmb-r0.2.0
Command used
git checkout scsudhakaran/llmb-r0.2.0
python3 scripts/performance/setup_experiment.py \
--container_image <container_image> \
--model_name deepseek \
--model_size v3 \
--domain llm \
--task pretrain \
--compute_dtype bf16 \
--gpu "gb200" \
--num_gpus 128 \
--gpus_per_node 4 \
--tensor_model_parallel_size 1 \
--pipeline_model_parallel_size 4 \
--expert_model_parallel_size 32 \
--virtual_pipeline_model_parallel_size None \
--global_batch_size 1024 \
--micro_batch_size 1 \
--account <account> \
--partition <partition> \
--log_dir <log_dir> \
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working