This recipe contains information and scripts to produce performance results for the Deepseek-v3 pre-training workload. The scripts help perform environment setup and launch benchmark jobs.
Weak scaling methodology is used in the configurations below.
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | EP | ETP | DP | VP | MBS | GBS | GA | RecomputeModule |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 671B | FP8 | 512 | 4096 | 61 | 2 | 8 | 1 | 32 | 1 | 32 | 4 | 1 | 8192 | 128 | mla_up_proj,mlp |
| 671B | BF16/FP8 | 1024 | 4096 | 61 | 2 | 8 | 1 | 64 | 1 | 64 | 4 | 1 | 8192 | 128 | mla_up_proj,mlp |
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | EP | ETP | DP | VP | MBS | GBS | GA | RecomputeModule |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 671B | BF16/FP8 | 256 | 4096 | 61 | 1 | 16 | 1 | 8 | 1 | 16 | 1 | 1 | 2048 | 128 | mla_up_proj,mlp |
| 671B | BF16/FP8 | 512 | 4096 | 61 | 1 | 16 | 1 | 8 | 1 | 32 | 1 | 1 | 4096 | 128 | mla_up_proj,mlp |
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | EP | ETP | DP | VP | MBS | GBS | GA | RecomputeModule |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 671B | BF16/FP8 | 256 | 4096 | 61 | 1 | 4 | 1 | 64 | 1 | 64 | 4 | 1 | 2048 | 32 | mla_up_proj |
| 671B | BF16/FP8 | 512 | 4096 | 61 | 1 | 4 | 1 | 64 | 1 | 128 | 4 | 1 | 4096 | 32 | mla_up_proj |
| Size | Precision | GPUs | SeqLen | Layers | TP | PP | CP | EP | ETP | DP | VP | MBS | GBS | GA | RecomputeModule |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 671B | BF16/FP8 | 128 | 4096 | 61 | 1 | 4 | 1 | 32 | 1 | 32 | 4 | 1 | 1024 | 32 | core_attn |
| 671B | BF16/FP8 | 256 | 4096 | 61 | 1 | 4 | 1 | 64 | 1 | 64 | 4 | 1 | 2048 | 32 | core_attn |
| 671B | BF16/FP8 | 512 | 4096 | 61 | 1 | 4 | 1 | 64 | 1 | 128 | 4 | 1 | 4096 | 32 | core_attn |
Performance for Deepseek-v3 training is measured by seconds per iteration, or in other words seconds per training step. This metric is logged for every training step in the main training log file see Output Locations.
Since the early training steps typically take much longer time (with input prefetch, activation memory allocation, and JIT compilation), we use the parse_train_timing_mbridge.sh script to analyze iterations 35-44 and calculate mean and standard deviation for reliable performance metrics. We also get the achieved GPU FLOPS via TFLOPS_per_GPU metric.
To analyze training timing from your experiment results, run the script from the workload directory. In an installed environment, recipe files are available under $LLMB_INSTALL/llmb_repo (a copy created by the installer).
# Basic usage - parses results in the directory named 'experiments' in the current folder
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh
# Specify a different experiments directory
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh /path/to/experiments
# Output in CSV format
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh --format=csv
# Output in JSON format
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh --format=json
# Show full filenames instead of shortened versions
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh --full-namesExample output:
Elapsed Time (ms) and TFLOPS/GPU Analysis (iterations 35-44)
================================================================================
Experiment Status Time Mean (ms) Time Std (ms) TFLOPS_per_GPU Mean TFLOPS_per_GPU Std
------------------------------------------------------------------------------------------ -------- ------------- ------------ ------------------- ------------------
pretrain_deepseek_v3_bf16_gpus256_tp1_pp4_cp1_vp4_ep64_mbs1_gbs2048_992591 Success 11071.480 8.236 769.50 0.58To obtain throughput as a tokens per second measurement, follow this formula:
(throughput in tokens per second) = (sequence length) * (global batch size) / training_step_timingE.g. 4096 * 2048 / 11.072 = 757641
To calculate time to train estimate:
(time to train in days) = (total tokens) / (throughput in tokens per second) / (number of seconds in a day)E.g. 1e12 / 757641 / 86400 = 15.28 days
To calculate the model flops utilization (MFU):
MFU = (achieved TFLOPS_per_GPU) / (peak GPU FLOPS)E.g. DeepSeek-V3 BF16 on 256x GB200 GPUs (GBS=2048)
peak FLOPS for GB200 BF16 = 2.45 PFLOPS
achieved TFLOPS_per_GPU = 769.50 TFLOPS
MFU = 769.50e+12 / 2.45e+15 = 31.41%Peak theoretical throughput across GPUs and Data Types (in TFLOPS)
| Data Type | GB300 | GB200 | B200 | H100 |
|---|---|---|---|---|
| BF16 | 2450 | 2450 | 2250 | 989 |
| FP8 | 4900 | 4900 | 4500 | 1979 |
A HuggingFace account is required and you will need to create a HuggingFace access token. Add the generated token to your environment via export HF_TOKEN=<your token>.
Requires Python 3.12.x, or conda.
No special access is required to run this benchmark.
We reference a number of Slurm commands and parameters in this document. A brief summary is included below. It's important to note these are a guide and might not be applicable to all environments. Please consult with your system administrator for the parameters that are specific to your system.
Common parameters:
SBATCH_PARTITIONor-p- Partition (or queue) to use.SBATCH_ACCOUNTor-A- Slurm account to associate with your job, different from your user. Meant for accounting purposes.SBATCH_GPUS_PER_NODEor--gres=gpu:<num gpus>- If your cluster is configured with GRES this should be set to all GPUs in a node. Ignore if not configured.- Encountering errors such as 'GPUs not found' or 'Cannot submit to this partition without GPU resources' means this setting is required.
These parameters can be set either by exporting the environment variable or using the corresponding sbatch flag.
Use the installer referenced in the main README (see installer documentation for details) to prepare the recipe environment:
The following directory layout and key variables are used in the recipe:
LLMB_INSTALL: Top-level directory for all benchmarking artifacts (images, datasets, venvs, workloads, etc).LLMB_WORKLOAD: Workload-specific directory, e.g.${LLMB_INSTALL}/workloads/pretrain_deepseek-v3.- Results, logs, and checkpoints are stored under subfolders of
LLMB_WORKLOAD(see below).
Since Deepseek-v3 training only uses synthetic datasets, this step is omitted.
Once the environment has been prepared, it is time to train a model. The training runs for the first 50 steps and then stops. Log files and results are stored under the ${LLMB_WORKLOAD}/experiments/ folder (see Output Locations for details).
The easiest way to run benchmarks is using the llmb-run launcher tool. This method handles configuration automatically and provides a streamlined interface.
# Navigate to your installation directory
cd $LLMB_INSTALL
# Run a benchmark with llmb-run
llmb-run submit -w pretrain_deepseek-v3 -s 671b --dtype bf16 --scale 256
# Example with different scale, and precision
llmb-run submit -w pretrain_deepseek-v3 -s 671b --dtype fp8 --scale 512
# Example with additional SLURM parameters
ADDITIONAL_SLURM_PARAMS="nodelist=node001,node002" llmb-run submit -w pretrain_deepseek-v3 -s 671b --dtype bf16 --scale 256For more details on llmb-run usage, see the llmb-run documentation.
Alternatively, you can run training directly using the launch script. This method provides more control over individual parameters and environment variables.
Important:
- Ensure your virtual environment is activated before running the training commands below. If you used the installer with conda, run
conda activate $LLMB_INSTALL/venvs/<env_name>. If you used the installer with python venv, runsource $LLMB_INSTALL/venvs/<env_name>/bin/activate. - Run the launch script from the installed recipe directory:
cd $LLMB_INSTALL/llmb_repo/deepseek_v3/pretrain/megatron_bridge/
JOB_TOTAL_GPUS=<number> GPU_TYPE=<type> [DTYPE=<precision>] [MODEL_SIZE=<size>] [ADDITIONAL_SLURM_PARAMS=<params>] ./launch.shRequired:
JOB_TOTAL_GPUS: Number of GPUs to useGPU_TYPE: Type of GPU hardwaregb300- NVIDIA GB300 GPUsgb200- NVIDIA GB200 GPUsb200- NVIDIA B200 GPUsh100- NVIDIA H100 GPUs
Optional:
DTYPE: Precision format (default:bf16)bf16- BFloat16 precisionfp8- FP8 precision
MODEL_SIZE: Model variant (fixed:671b)671b- 671 billion parameter model (only supported size)
ADDITIONAL_SLURM_PARAMS: Additional SLURM parameters (optional)- Format: Semicolon-separated sbatch arguments; supports key=value pairs and bare flags. Use semicolons when values contain commas.
- Examples:
"nodelist=node001,node002;constraint=gpu""constraint=gpu&memory;exclusive"
Train Deepseek-v3 with BF16 precision on 256 GB200 GPUs:
JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.shTrain on 1024 H100 GPUs:
JOB_TOTAL_GPUS=1024 GPU_TYPE=h100 ./launch.shTrain on specific nodes:
ADDITIONAL_SLURM_PARAMS="nodelist=node001,node002" JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.shTrain with node constraints:
ADDITIONAL_SLURM_PARAMS="constraint=gpu&memory;exclusive" JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.shTrain using a SLURM reservation:
ADDITIONAL_SLURM_PARAMS="reservation=my_reservation" JOB_TOTAL_GPUS=1024 GPU_TYPE=h100 ./launch.shAll benchmark results are saved under $LLMB_WORKLOAD/experiments/ with the following structure:
experiments/
├── <experiment_name>/
│ └── <experiment_name>_<timestamp>/
│ ├── <experiment_name>/
│ │ ├── log-<experiment_name>.out # Main training log with performance data
│ │ ├── sbatch_<experiment_name>.out # Batch script output
│ │ └── nsys_profile/ # Profiling output (when enabled)
│ │ └── *.nsys-rep files
│ └── [batch scripts and other files]
The <experiment_name> typically follows the pattern: pretrain_deepseek_v3_671b_<dtype>_<scale>_<config>
Key files:
log-<experiment_name>.out- Contains training step timing and performance metrics analyzed byparse_train_timing_mbridge.shnsys_profile/- Contains profiling traces whenENABLE_PROFILE=true
To enable profiling with Nsight Systems set variable ENABLE_PROFILE=true when submitting your job. The job will run for a total of 50 steps where steps 45-50 will be profiled.
In order to view the resulting profiles, ensure you have the latest version of Nsight Systems installed. For more information visit: Nsight Systems
- MPI Ranks: all
- Job Steps: 45-50
- Output Location: Profiling output saved alongside training results (see Output Locations)
- Filename format:
profile_${SLURM_JOB_ID}_nodeId_rankId.nsys-rep
Example command:
ENABLE_PROFILE=true JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.sh- Specify job steps to profile:
PROFILE_START_STEP: start profiling on this job step.
- Default: 45
PROFILE_STOP_STEP: stop profiling on this job step.
- Default: 50
- Enable GPU metrics collection:
ENABLE_GPU_METRICS: Enable GPU metrics collection during Nsight profiling (default: false)
- When set to
truealong withENABLE_PROFILE=true, captures detailed GPU performance metrics - Provides additional GPU utilization, memory usage, and compute efficiency data
- May require additional system configuration for GPU device metrics to work properly
Example command with GPU metrics:
ENABLE_PROFILE=true ENABLE_GPU_METRICS=true JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.shIn order to view the profile traces (*.nsys-rep files) interactively:
- Install the latest Nsight Systems client on your preferred system
- Copy the generated .nsys-rep files to a folder on your preferred system. E.g., /home/nsight-traces/
- Open Nsight Systems client, then click "File | Open" and select one or more .nsys-rep files from /home/nsight-systems folder. For more details, see Reading Your Report in GUI guide.
- Once loaded you can analyze the workload behavior to learn about any performance bottlenecks associated with the model or the job run.
Since most of the benchmarking jobs run on multiple GPUs, there will be multiple .nsys-rep files generated for each run. Multi-Report Analysis Guide will be very helpful to automate the analysis and get to results quicker by using Nsight recipes.
See these tutorials to get a quick start if you are new to Nsight profiling.