GitHub

Work in Progress

NOTE : Performance analysis of a reduction operation in GPU is done here

Performance data on RTX3050Mobile for A(M x P) * B(P x N) = C(M x N). Here M = N = P = 1024.

The max theoretical performance of a RTX3050 Mobile is 5.501 TFLOPS (FP32 / float) and global memory bandwidth of 192 GBytes/s. Source : https://www.techpowerup.com/gpu-specs/geforce-rtx-3050-mobile.c3788

The time-complexity of a simple MM is O(N^3). For every entry in resulting matrix, there is N time product and near similar amount summing operation. So there are 2N FLOP for every entry in resulting matrix and therefore 2N^3 FLOP for the entire matrix.

Given theoretical peak performance (or the ROOF in roofline model), we can compute relative performance if time taken by the kernel is known using

$$ \text{AGAINSTROOF [PERCENT]} = \frac{(2 * N^3) [FLOP]} {\text{TIME [s] * GPUPEAKPERFORMANCE [FLOPs]}} * 100 [\text{PERCENT}] $$

nvcc matmul.cu -o matmul -arch=sm_80 -lcublas

nsys profile -o nsys_matmul --stats=true ./matmul

ncu -o ncu_matmul -f --clock-control none ./matmul

ncu-ui ncu_matmul.ncu-rep

CUBLAS time taken = 0.802 ms (uses ampere_sgemm_128x64_nn as seen from nsys data). In this case, CUBLAS performs at about roughly 50% of the roofline performance for this GPU.

For 2D register tiling, number of threads is a function of SM block size and register tile size. Therefore, carefully control these numbers to achieve 128-256 threads per block, which usually is a sweet spot for CUDA.

KERNEL	TIME (ms)	AGAINST_CUBLAS (%)
naiveMM	71.00	1.13
coalescedMM	8.04	10.02
sharedMM	6.54	12.26
1D Register tiling	2.064	38.85
2D register tiling (no sA transpose)	1.394	57.50
2D register tiling (sA transpose)	1.737	46.17



sharedMM+2Dblocktiling+FLOAT4	1.105	72.50

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
deviceQuery_RTX3050mobile.txt		deviceQuery_RTX3050mobile.txt
helper.h		helper.h
matmul.cu		matmul.cu
roofline.png		roofline.png
roofline.py		roofline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SKPsanjeevi/sgemm

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages