Skip to content

Latest commit

 

History

History
297 lines (212 loc) · 14.3 KB

File metadata and controls

297 lines (212 loc) · 14.3 KB

Overview

This recipe contains information and scripts to produce performance results for the Deepseek-v3 pre-training workload. The scripts help perform environment setup and launch benchmark jobs.

Weak scaling methodology is used in the configurations below.

H100

Size Precision GPUs SeqLen Layers TP PP CP EP ETP DP VP MBS GBS GA RecomputeModule
671B FP8 512 4096 61 2 8 1 32 1 32 4 1 8192 128 mla_up_proj,mlp
671B BF16/FP8 1024 4096 61 2 8 1 64 1 64 4 1 8192 128 mla_up_proj,mlp

B200

Size Precision GPUs SeqLen Layers TP PP CP EP ETP DP VP MBS GBS GA RecomputeModule
671B BF16/FP8 256 4096 61 1 16 1 8 1 16 1 1 2048 128 mla_up_proj,mlp
671B BF16/FP8 512 4096 61 1 16 1 8 1 32 1 1 4096 128 mla_up_proj,mlp

GB200

Size Precision GPUs SeqLen Layers TP PP CP EP ETP DP VP MBS GBS GA RecomputeModule
671B BF16/FP8 256 4096 61 1 4 1 64 1 64 4 1 2048 32 mla_up_proj
671B BF16/FP8 512 4096 61 1 4 1 64 1 128 4 1 4096 32 mla_up_proj

GB300

Size Precision GPUs SeqLen Layers TP PP CP EP ETP DP VP MBS GBS GA RecomputeModule
671B BF16/FP8 128 4096 61 1 4 1 32 1 32 4 1 1024 32 core_attn
671B BF16/FP8 256 4096 61 1 4 1 64 1 64 4 1 2048 32 core_attn
671B BF16/FP8 512 4096 61 1 4 1 64 1 128 4 1 4096 32 core_attn

Performance Measurement and Analysis

Performance for Deepseek-v3 training is measured by seconds per iteration, or in other words seconds per training step. This metric is logged for every training step in the main training log file see Output Locations.

Since the early training steps typically take much longer time (with input prefetch, activation memory allocation, and JIT compilation), we use the parse_train_timing_mbridge.sh script to analyze iterations 35-44 and calculate mean and standard deviation for reliable performance metrics. We also get the achieved GPU FLOPS via TFLOPS_per_GPU metric.

Running the parse_train_timing_mbridge.sh script

To analyze training timing from your experiment results, run the script from the workload directory. In an installed environment, recipe files are available under $LLMB_INSTALL/llmb_repo (a copy created by the installer).

# Basic usage - parses results in the directory named 'experiments' in the current folder
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh

# Specify a different experiments directory
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh /path/to/experiments

# Output in CSV format
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh --format=csv

# Output in JSON format
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh --format=json

# Show full filenames instead of shortened versions
$LLMB_INSTALL/llmb_repo/common/parse_train_timing_mbridge.sh --full-names

Example output:

Elapsed Time (ms) and TFLOPS/GPU Analysis (iterations 35-44)
================================================================================
Experiment                                                                                   Status Time Mean (ms) Time Std (ms) TFLOPS_per_GPU Mean TFLOPS_per_GPU Std
------------------------------------------------------------------------------------------ -------- ------------- ------------ ------------------- ------------------
pretrain_deepseek_v3_bf16_gpus256_tp1_pp4_cp1_vp4_ep64_mbs1_gbs2048_992591                  Success     11071.480        8.236              769.50               0.58

To obtain throughput as a tokens per second measurement, follow this formula:

(throughput in tokens per second) = (sequence length) * (global batch size) / training_step_timing

E.g. 4096 * 2048 / 11.072 = 757641

To calculate time to train estimate:

(time to train in days) = (total tokens) / (throughput in tokens per second) / (number of seconds in a day)

E.g. 1e12 / 757641 / 86400 = 15.28 days

To calculate the model flops utilization (MFU):

MFU = (achieved TFLOPS_per_GPU) / (peak GPU FLOPS)

E.g. DeepSeek-V3 BF16 on 256x GB200 GPUs (GBS=2048)

peak FLOPS for GB200 BF16 = 2.45 PFLOPS
achieved TFLOPS_per_GPU = 769.50 TFLOPS

MFU = 769.50e+12 / 2.45e+15 = 31.41%

Peak theoretical throughput across GPUs and Data Types (in TFLOPS)

Data Type GB300 GB200 B200 H100
BF16 2450 2450 2250 989
FP8 4900 4900 4500 1979

Prerequisites

A HuggingFace account is required and you will need to create a HuggingFace access token. Add the generated token to your environment via export HF_TOKEN=<your token>.

Requires Python 3.12.x, or conda.

Request Access

No special access is required to run this benchmark.

Slurm

We reference a number of Slurm commands and parameters in this document. A brief summary is included below. It's important to note these are a guide and might not be applicable to all environments. Please consult with your system administrator for the parameters that are specific to your system.

Common parameters:

  • SBATCH_PARTITION or -p - Partition (or queue) to use.
  • SBATCH_ACCOUNT or -A - Slurm account to associate with your job, different from your user. Meant for accounting purposes.
  • SBATCH_GPUS_PER_NODE or --gres=gpu:<num gpus> - If your cluster is configured with GRES this should be set to all GPUs in a node. Ignore if not configured.
    • Encountering errors such as 'GPUs not found' or 'Cannot submit to this partition without GPU resources' means this setting is required.

These parameters can be set either by exporting the environment variable or using the corresponding sbatch flag.

Prepare environment

Use the installer referenced in the main README (see installer documentation for details) to prepare the recipe environment:

The following directory layout and key variables are used in the recipe:

  • LLMB_INSTALL: Top-level directory for all benchmarking artifacts (images, datasets, venvs, workloads, etc).
  • LLMB_WORKLOAD: Workload-specific directory, e.g. ${LLMB_INSTALL}/workloads/pretrain_deepseek-v3.
  • Results, logs, and checkpoints are stored under subfolders of LLMB_WORKLOAD (see below).

Prepare Dataset

Since Deepseek-v3 training only uses synthetic datasets, this step is omitted.

Run Training

Once the environment has been prepared, it is time to train a model. The training runs for the first 50 steps and then stops. Log files and results are stored under the ${LLMB_WORKLOAD}/experiments/ folder (see Output Locations for details).

Using llmb-run (Recommended)

The easiest way to run benchmarks is using the llmb-run launcher tool. This method handles configuration automatically and provides a streamlined interface.

# Navigate to your installation directory
cd $LLMB_INSTALL

# Run a benchmark with llmb-run
llmb-run submit -w pretrain_deepseek-v3 -s 671b --dtype bf16 --scale 256

# Example with different scale, and precision
llmb-run submit -w pretrain_deepseek-v3 -s 671b --dtype fp8 --scale 512

# Example with additional SLURM parameters
ADDITIONAL_SLURM_PARAMS="nodelist=node001,node002" llmb-run submit -w pretrain_deepseek-v3 -s 671b --dtype bf16 --scale 256

For more details on llmb-run usage, see the llmb-run documentation.

Direct Method

Alternatively, you can run training directly using the launch script. This method provides more control over individual parameters and environment variables.

Important:

  • Ensure your virtual environment is activated before running the training commands below. If you used the installer with conda, run conda activate $LLMB_INSTALL/venvs/<env_name>. If you used the installer with python venv, run source $LLMB_INSTALL/venvs/<env_name>/bin/activate.
  • Run the launch script from the installed recipe directory: cd $LLMB_INSTALL/llmb_repo/deepseek_v3/pretrain/megatron_bridge/

Command Template

JOB_TOTAL_GPUS=<number> GPU_TYPE=<type> [DTYPE=<precision>] [MODEL_SIZE=<size>] [ADDITIONAL_SLURM_PARAMS=<params>] ./launch.sh

Environment Variables

Required:

  • JOB_TOTAL_GPUS: Number of GPUs to use
  • GPU_TYPE: Type of GPU hardware
    • gb300 - NVIDIA GB300 GPUs
    • gb200 - NVIDIA GB200 GPUs
    • b200 - NVIDIA B200 GPUs
    • h100 - NVIDIA H100 GPUs

Optional:

  • DTYPE: Precision format (default: bf16)
    • bf16 - BFloat16 precision
    • fp8 - FP8 precision
  • MODEL_SIZE: Model variant (fixed: 671b)
    • 671b - 671 billion parameter model (only supported size)
  • ADDITIONAL_SLURM_PARAMS: Additional SLURM parameters (optional)
    • Format: Semicolon-separated sbatch arguments; supports key=value pairs and bare flags. Use semicolons when values contain commas.
    • Examples:
      • "nodelist=node001,node002;constraint=gpu"
      • "constraint=gpu&memory;exclusive"

Example Commands

Train Deepseek-v3 with BF16 precision on 256 GB200 GPUs:

JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.sh

Train on 1024 H100 GPUs:

JOB_TOTAL_GPUS=1024 GPU_TYPE=h100 ./launch.sh

SLURM Node Specification Examples

Train on specific nodes:

ADDITIONAL_SLURM_PARAMS="nodelist=node001,node002" JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.sh

Train with node constraints:

ADDITIONAL_SLURM_PARAMS="constraint=gpu&memory;exclusive" JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.sh

Train using a SLURM reservation:

ADDITIONAL_SLURM_PARAMS="reservation=my_reservation" JOB_TOTAL_GPUS=1024 GPU_TYPE=h100 ./launch.sh

Output Locations

All benchmark results are saved under $LLMB_WORKLOAD/experiments/ with the following structure:

experiments/
├── <experiment_name>/
│   └── <experiment_name>_<timestamp>/
│       ├── <experiment_name>/
│       │   ├── log-<experiment_name>.out      # Main training log with performance data
│       │   ├── sbatch_<experiment_name>.out   # Batch script output  
│       │   └── nsys_profile/                  # Profiling output (when enabled)
│       │       └── *.nsys-rep files
│       └── [batch scripts and other files]

The <experiment_name> typically follows the pattern: pretrain_deepseek_v3_671b_<dtype>_<scale>_<config>

Key files:

  • log-<experiment_name>.out - Contains training step timing and performance metrics analyzed by parse_train_timing_mbridge.sh
  • nsys_profile/ - Contains profiling traces when ENABLE_PROFILE=true

Run Nsight Profiling

To enable profiling with Nsight Systems set variable ENABLE_PROFILE=true when submitting your job. The job will run for a total of 50 steps where steps 45-50 will be profiled.

In order to view the resulting profiles, ensure you have the latest version of Nsight Systems installed. For more information visit: Nsight Systems

Profiling job details:

  • MPI Ranks: all
  • Job Steps: 45-50
  • Output Location: Profiling output saved alongside training results (see Output Locations)
  • Filename format: profile_${SLURM_JOB_ID}_nodeId_rankId.nsys-rep

Example command:

ENABLE_PROFILE=true JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.sh

Customizing profiling behavior:

  • Specify job steps to profile:
    • PROFILE_START_STEP: start profiling on this job step.
    • Default: 45
    • PROFILE_STOP_STEP: stop profiling on this job step.
    • Default: 50
  • Enable GPU metrics collection:
    • ENABLE_GPU_METRICS: Enable GPU metrics collection during Nsight profiling (default: false)
    • When set to true along with ENABLE_PROFILE=true, captures detailed GPU performance metrics
    • Provides additional GPU utilization, memory usage, and compute efficiency data
    • May require additional system configuration for GPU device metrics to work properly

Example command with GPU metrics:

ENABLE_PROFILE=true ENABLE_GPU_METRICS=true JOB_TOTAL_GPUS=256 GPU_TYPE=gb200 ./launch.sh

Viewing results

In order to view the profile traces (*.nsys-rep files) interactively:

  • Install the latest Nsight Systems client on your preferred system
  • Copy the generated .nsys-rep files to a folder on your preferred system. E.g., /home/nsight-traces/
  • Open Nsight Systems client, then click "File | Open" and select one or more .nsys-rep files from /home/nsight-systems folder. For more details, see Reading Your Report in GUI guide.
  • Once loaded you can analyze the workload behavior to learn about any performance bottlenecks associated with the model or the job run.

Since most of the benchmarking jobs run on multiple GPUs, there will be multiple .nsys-rep files generated for each run. Multi-Report Analysis Guide will be very helpful to automate the analysis and get to results quicker by using Nsight recipes.

See these tutorials to get a quick start if you are new to Nsight profiling.