NVIDIA
diff --git a/‎benchmark/sdpa_benchmark_bf16_training/Dockerfile‎
Lines changed: 20 additions & 0 deletions b/‎benchmark/sdpa_benchmark_bf16_training/Dockerfile‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎benchmark/sdpa_benchmark_bf16_training/README.md‎
Lines changed: 127 additions & 0 deletions b/‎benchmark/sdpa_benchmark_bf16_training/README.md‎
Lines changed: 127 additions & 0 deletions
diff --git a/‎benchmark/sdpa_benchmark_bf16_training/artifacts/sample_b200_run.txt‎
Lines changed: 63 additions & 0 deletions b/‎benchmark/sdpa_benchmark_bf16_training/artifacts/sample_b200_run.txt‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎benchmark/sdpa_benchmark_bf16_training/artifacts/sdpa_benchmark_results_NVIDIA_B200.csv‎
Lines changed: 41 additions & 0 deletions b/‎benchmark/sdpa_benchmark_bf16_training/artifacts/sdpa_benchmark_results_NVIDIA_B200.csv‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎benchmark/sdpa_benchmark_bf16_training/artifacts/sdpa_benchmark_results_NVIDIA_B200.png‎
120 KB b/‎benchmark/sdpa_benchmark_bf16_training/artifacts/sdpa_benchmark_results_NVIDIA_B200.png‎
120 KB
@@ -0,0 +1,20 @@
+FROM nvcr.io/nvidia/pytorch:25.05-py3
+
+RUN pip install --upgrade pip && \
+    pip install seaborn
+
+RUN apt-get update && \
+    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
+    dpkg -i cuda-keyring_1.1-1_all.deb && \
+    apt-get update && \
+    apt-get -y install cudnn
+
+RUN pip uninstall -y cudnn
+
+COPY benchmark_sdpa.py .
+
+COPY benchmark_single_sdpa.py .
+
+ENV LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
+
+WORKDIR /workspace
@@ -0,0 +1,127 @@
+# Scaled Dot Product Attention Benchmark
+## Introduction
+
+The benchmarking script in this current directory profiles scaled dot product attention (SDPA) from various backends. Here we benchmark attention layer dimensions inspired by [Llama-3.1-405B](https://ai.meta.com/blog/meta-llama-3-1/) with sequence lengths ranging from 512 to 131,072. 
+
+The provided benchmark targets training use cases--causal masking is enabled for grouped query attention (GQA). Layer dimensions and causal masking can be altered by modifying the preset parameters in `benchmark_sdpa.py`. Inference-specific attention optimizations such as paged attention are not benchmarked at this time.
+
+## Contents
+
+- `Dockerfile` to create a Docker container for the dependencies and run the benchmark.
+- `benchmark_sdpa.py` which runs cudnn, pytorch, and other backends up to 128k sequence length.
+- Benchmark results on B200 in the `artifacts` directory.
+- Useful Python scripts for running single attention layers: 
+  - `benchmark_single_sdpa.py` for benchmarking a single flash attention instance from various backends.
+  - See below for usage example.
+
+## Software versions
+
+This benchmark code should run on any decently modern Python environment with CUDA-enabled GPU. The results in `artifacts` were collected using the PyTorch docker image [from the NVIDIA GPU CLOUD (NGC) catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch), `nvcr.io/nvidia/pytorch:25.05-py3`, where cuDNN 9.10.2 was used. We provide a `Dockerfile` to reproduce the environment with the following library versions
+
+
+| Software       | Version |
+|----------------|---------|
+| Python         | 3.12.9  |
+| CUDA           | 12.9.0  |
+| cuDNN          | 9.10.2  |
+| PyTorch        | 2.8.0   |
+| FlashAttention | 2.7.3   |
+
+
+## Steps to run
+### 0. *Optional*: Lock Clocks
+Although the benchmarking code inserts dynamically-sized delays to avoid GPU throttling, most reproducible results can be obtained when clocks are locked. For example, use `nvidia-smi -q -d SUPPORTED_CLOCKS` to get the supported clocks
+
+```
+sudo nvidia-smi -pm 1
+nvidia-smi -lgc <min_clock>,<max_clock>
+```
+
+### 1. Build docker container
+Launch the docker build and run. We prodivde a simple `Dockerfile` to help run the benchmark
+```
+docker build -t cudnn_attention_benchmark .
+docker run -it --gpus all --rm -v $(pwd):/workspace cudnn_attention_benchmark
+```
+
+### 2. Run Benchmark script
+The `benchmark_sdpa.py` executes a predefined set of attention layers of various sequence lengths, where the transformer dimensions are inspired by [Llama-3.1-405B](https://ai.meta.com/blog/meta-llama-3-1/) (`num_q_heads=128; num_kv_heads=8; head_dim=128; is_causal=True; dtype=bfloat16`)
+
+The following scaled dot product attention backends are benchmarked:
+- [PyTorch's SDPA backends](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html):
+    - cuDNN (`CUDNN_ATTENTION`)
+    - Standard Attention (`MATH`)
+    - FlashAttention-2 (`FLASH_ATTENTION`; PyTorch FAv2 )
+- [FlashAttention-2](https://github.com/Dao-AILab/flash-attention)'s original implementation (native FAv2)
+
+Please note that FlashAttention-3 is currently not supported on NVIDIA's Blackwell generation GPUs.
+
+Sample output
+```
+$ python3 benchmark_sdpa.py
+[INFO] torch.__version__ = '2.8.0a0+5228986c39.nv25.05'
+[INFO] torch.version.cuda = '12.9'
+[INFO] torch.cuda.is_available() = True
+[INFO] torch.cuda.device_count() = 1
+[INFO] torch.cuda.current_device() = 0
+[INFO] torch.cuda.get_device_name(torch.cuda.current_device()) = 'NVIDIA B200'
+[INFO] torch.backends.cudnn.version() = 91002
+[INFO] torch.backends.cudnn.enabled = True
+[INFO] flash_attn.__version__ = '2.7.3'
+[INFO] Begin benchmark for layers (batch_size,q_seqlen,kv_seqlen,num_q_heads,num_kv_heads,head_dim)
+[INFO] sdpa_configs = [(1, 512, 512, 128, 8, 128), (1, 1024, 1024, 128, 8, 128), (1, 2048, 2048, 128, 8, 128), (1, 4096, 4096, 128, 8, 128), (1, 8192, 8192, 128, 8, 128), (1, 16384, 16384, 128, 8, 128), (1, 32768, 32768, 128, 8, 128), (1, 65536, 65536, 128, 8, 128), (1, 131072, 131072, 128, 8, 128), (2, 131072, 131072, 128, 8, 128)]
+[INFO] Running layer (1, 512, 512, 128, 8, 128)
+...
+[INFO] Saving results to ./artifacts/sdpa_benchmark_results_NVIDIA_B200.csv
+[INFO] Saving plot to ./artifacts/sdpa_benchmark_results_NVIDIA_B200.png
+```
+
+Benchmarked performance numbers are stored in the [artifacts](artifacts) directory as csv and png files.
+
+## Results
+Below are the result of the benchmark running on a single B200 GPU.
+
+For both runs, the following software versions are used:
+
+- CUDA: 12.9 (from NGC container)
+- PyTorch: 2.8.0 (from NGC container)
+- cuDNN: 9.10.2 (Installed via `apt-get`; see `Dockerfile`)
+
+
+#### B200
+![Comparison of pytorch and cudnn](artifacts/sdpa_benchmark_results_NVIDIA_B200.png)
+- SDPA parameters were used `num_q_heads=128; num_kv_heads=8; head_dim=128; is_causal=True; dtype=bfloat16`. 
+- Batch size and sequence lengths are shown in the x-axis. 
+- Results were obtained on an NVIDIA B200 GPU with free clock.
+
+## Pytorch adoption
+As demonstrated can be seen from the results, cuDNN v9 can achieve over 2x the performance of the comparable PyTorch eager implementation. Refer to [PyTorch's scaled_dot_product_attention()](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) and [sdpa_kernel](https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html#torch.nn.attention.sdpa_kernel) context manager documentations for enabling the cuDNN backend for scaled dot product attention.
+
+## `benchmark_single_sdpa.py`
+`benchmark_single_sdpa.py` is provided to conveniently run a single SDPA operation. Try running `python benchmark_single_sdpa.py --help` to see available flags.
+
+Example commands and outputs:
+```
+## For running various PyTorch backends (FlashAttention, cuDNN, ...) or FlashAttention-2:
+$ python benchmark_single_sdpa.py --batch_size 1 --q_seqlen 32768 --kv_seqlen 32768 --num_q_heads 128 --num_kv_heads 8 --head_dim 128 --is_causal --data_type bfloat16 --num_iterations 10 --sdpa_backend pyt_cudnn --fwd_bwd
+pyt_cudnn:: Median (fwd, bwd) Execution Times: 24.602 ms (1430 TFLOPS), 78.140 ms (1126 TFLOPS) (max difference vs. pyt_reference: 0.007812 from 10 iterations)
+
+## For directly running cuDNN via cuDNN Frontend
+$ python benchmark_single_sdpa.py --batch_size 1 --q_seqlen 32768 --kv_seqlen 32768 --num_q_heads 128 --num_kv_heads 8 --head_dim 128 --is_causal --data_type bfloat16 --num_iterations 10 --sdpa_backend cudnn_fe --fwd_bwd
+cudnn_fe:: Median (fwd, bwd) Execution Times: 24.480 ms (1437 TFLOPS), 73.519 ms (1196 TFLOPS) (max difference vs. pyt_reference: 0.007812 from 10 iterations)
+```
+
+The cuDNN version used in the benchmark can be replaced by setting the `LD_LIBRARY_PATH` environment variable.
+```
+$ export LD_LIBRARY_PATH=<my_path_to_cuDNN_9.10.1>
+$ python benchmark_single_sdpa.py --batch_size 1 --q_seqlen 16384 --kv_seqlen 16384 --num_q_heads 128 --num_kv_heads 8 --head_dim 128 --is_causal --data_type bfloat16 --num_iterations 10 --sdpa_backend cudnn_fe --fwd_bwd --verbose
+[INFO] cuDNN Backend Version: cudnn.backend_version() = 91001
+[INFO] cuDNN Frontend Version: cudnn.__version__ = '1.11.0'
+[INFO] torch.__version__ = '2.8.0a0+5228986c39.nv25.05'
+[INFO] torch.version.cuda = '12.9'
+[INFO] torch.cuda.is_available() = True
+[INFO] torch.cuda.device_count() = 1
+[INFO] torch.cuda.current_device() = 0
+[INFO] torch.cuda.get_device_name(torch.cuda.current_device()) = 'NVIDIA B200'
+cudnn_fe:: Median (fwd, bwd) Execution Times: 6.421 ms (1370 TFLOPS), 19.367 ms (1135 TFLOPS) (max difference vs. pyt_reference: 0.007812 from 10 iterations)
+```
@@ -0,0 +1,63 @@
+[INFO] torch.__version__ = '2.8.0a0+5228986c39.nv25.05'
+[INFO] torch.version.cuda = '12.9'
+[INFO] torch.cuda.is_available() = True
+[INFO] torch.cuda.device_count() = 1
+[INFO] torch.cuda.current_device() = 0
+[INFO] torch.cuda.get_device_name(torch.cuda.current_device()) = 'NVIDIA B200'
+[INFO] torch.backends.cudnn.version() = 91002
+[INFO] torch.backends.cudnn.enabled = True
+[INFO] flash_attn.__version__ = '2.7.3'
+[INFO] Begin benchmark for layers (batch_size,q_seqlen,kv_seqlen,num_q_heads,num_kv_heads,head_dim)
+[INFO] sdpa_configs = [(1, 512, 512, 128, 8, 128), (1, 1024, 1024, 128, 8, 128), (1, 2048, 2048, 128, 8, 128), (1, 4096, 4096, 128, 8, 128), (1, 8192, 8192, 128, 8, 128), (1, 16384, 16384, 128, 8, 128), (1, 32768, 32768, 128, 8, 128), (1, 65536, 65536, 128, 8, 128), (1, 131072, 131072, 128, 8, 128), (2, 131072, 131072, 128, 8, 128)]
+[INFO] Running layer (1, 512, 512, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Running layer (1, 1024, 1024, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Running layer (1, 2048, 2048, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Running layer (1, 4096, 4096, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Running layer (1, 8192, 8192, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Running layer (1, 16384, 16384, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Running layer (1, 32768, 32768, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Running layer (1, 65536, 65536, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Running layer (1, 131072, 131072, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Running layer (2, 131072, 131072, 128, 8, 128)
+[INFO]   Benchmarking backend pyt_math
+[INFO]   Benchmarking backend pyt_cudnn
+[INFO]   Benchmarking backend pyt_flash_attention
+[INFO]   Benchmarking backend flash_attention
+[INFO] Saving results to ./artifacts/sdpa_benchmark_results_NVIDIA_B200.csv
+[INFO] Saving plot to ./artifacts/sdpa_benchmark_results_NVIDIA_B200.png
@@ -0,0 +1,41 @@
+batch_size,q_seqlen,kv_seqlen,num_q_heads,num_kv_heads,head_dim,is_causal,backend,forward_time,backward_time,fwd_tflops_per_sec,bwd_tflops_per_sec
+1,512,512,128,8,128,True,pyt_math,0.694,0.739,12.372,29.046
+1,512,512,128,8,128,True,pyt_cudnn,0.130,0.494,66.280,43.512
+1,512,512,128,8,128,True,pyt_flash_attention,0.142,0.514,60.602,41.808
+1,512,512,128,8,128,True,flash_attention,0.147,0.629,58.451,34.139
+1,1024,1024,128,8,128,True,pyt_math,1.823,1.356,18.844,63.354
+1,1024,1024,128,8,128,True,pyt_cudnn,0.159,0.482,216.110,178.132
+1,1024,1024,128,8,128,True,pyt_flash_attention,0.233,0.768,147.330,111.864
+1,1024,1024,128,8,128,True,flash_attention,0.235,0.754,146.187,113.956
+1,2048,2048,128,8,128,True,pyt_math,6.829,4.000,20.125,85.893
+1,2048,2048,128,8,128,True,pyt_cudnn,0.262,0.895,525.475,383.705
+1,2048,2048,128,8,128,True,pyt_flash_attention,0.548,1.765,250.977,194.686
+1,2048,2048,128,8,128,True,flash_attention,0.530,1.635,259.139,210.091
+1,4096,4096,128,8,128,True,pyt_math,27.183,14.169,20.225,96.998
+1,4096,4096,128,8,128,True,pyt_cudnn,0.613,1.938,897.309,709.138
+1,4096,4096,128,8,128,True,pyt_flash_attention,1.678,5.159,327.554,266.388
+1,4096,4096,128,8,128,True,flash_attention,1.581,4.722,347.732,291.053
+1,8192,8192,128,8,128,True,pyt_math,115.349,52.913,19.064,103.897
+1,8192,8192,128,8,128,True,pyt_cudnn,1.816,5.815,1211.140,945.440
+1,8192,8192,128,8,128,True,pyt_flash_attention,5.848,17.781,376.059,309.174
+1,8192,8192,128,8,128,True,flash_attention,5.453,16.392,403.305,335.386
+1,16384,16384,128,8,128,True,pyt_math,inf,inf,0.000,0.000
+1,16384,16384,128,8,128,True,pyt_cudnn,6.475,20.466,1358.566,1074.499
+1,16384,16384,128,8,128,True,pyt_flash_attention,22.158,66.431,396.971,331.025
+1,16384,16384,128,8,128,True,flash_attention,20.522,62.001,428.611,354.678
+1,32768,32768,128,8,128,True,pyt_math,inf,inf,0.000,0.000
+1,32768,32768,128,8,128,True,pyt_cudnn,24.518,77.677,1435.072,1132.387
+1,32768,32768,128,8,128,True,pyt_flash_attention,86.581,257.176,406.373,342.027
+1,32768,32768,128,8,128,True,flash_attention,80.170,241.841,438.873,363.713
+1,65536,65536,128,8,128,True,pyt_math,inf,inf,0.000,0.000
+1,65536,65536,128,8,128,True,pyt_cudnn,98.489,327.462,1428.973,1074.456
+1,65536,65536,128,8,128,True,pyt_flash_attention,342.696,1015.375,410.677,346.516
+1,65536,65536,128,8,128,True,flash_attention,317.255,958.828,443.610,366.952
+1,131072,131072,128,8,128,True,pyt_math,inf,inf,0.000,0.000
+1,131072,131072,128,8,128,True,pyt_cudnn,417.410,1355.987,1348.674,1037.897
+1,131072,131072,128,8,128,True,pyt_flash_attention,1366.320,4051.869,412.019,347.340
+1,131072,131072,128,8,128,True,flash_attention,1264.455,3845.172,445.212,366.011
+2,131072,131072,128,8,128,True,pyt_math,inf,inf,0.000,0.000
+2,131072,131072,128,8,128,True,pyt_cudnn,854.100,2736.916,1318.230,1028.438
+2,131072,131072,128,8,128,True,pyt_flash_attention,2731.190,8108.904,412.238,347.118
+2,131072,131072,128,8,128,True,flash_attention,2527.642,7692.665,445.435,365.900