A progressive, hands-on learning project for mastering GPU programming with CUDA and its ecosystem.
This project takes you from "what's a GPU?" to writing production-quality CUDA code. Each module builds on the previous, with exercises that reinforce concepts through practice.
Hardware: NVIDIA GPU (any recent GeForce/RTX or data center GPU)
Software:
CUDA Toolkit 12.x
CMake 3.20+
C++17 compiler
Python 3.10+ (for Triton modules)
Knowledge: Basic C/C++, some linear algebra helps
# Clone and setup
cd ~ /code/cuda-learning-lab
# Check CUDA installation
nvcc --version
nvidia-smi
# Build all exercises
mkdir build && cd build
cmake ..
make -j$( nproc)
🟢 Level 1: Foundations (Start Here)
Module
Topic
Exercises
01
GPU Architecture
CPU vs GPU, SMs, warps, memory hierarchy
02
First Kernel
Hello world, thread indexing, grid/block
03
Memory Basics
Global, shared, local memory
04
Thread Synchronization
__syncthreads, atomics, barriers
05
Error Handling
CUDA error checking, debugging
Module
Topic
Exercises
06
Memory Coalescing
Access patterns, bandwidth optimization
07
Shared Memory
Bank conflicts, tiling, cache usage
08
Occupancy
Registers, blocks, theoretical vs achieved
09
Streams & Async
Concurrent kernels, overlap compute/transfer
10
Unified Memory
Managed memory, page migration
Module
Topic
Exercises
11
Thrust
STL-like parallel algorithms
12
CUB
Block/warp/device primitives
13
cuBLAS
Dense linear algebra (GEMM, etc.)
14
cuSPARSE
Sparse matrix operations
15
cuRAND
Random number generation
16
cuFFT
Fast Fourier transforms
Module
Topic
Exercises
17
cuDNN
Convolution, pooling, batch norm
18
CUTLASS
Template library for GEMM
19
Tensor Cores
FP16/BF16/INT8 matrix ops
20
Custom DL Ops
Write your own PyTorch extension
⚫ Level 5: Modern GPU Programming
Module
Topic
Exercises
21
Triton
Python-based kernel programming
22
CUDA Graphs
Capture and replay workflows
23
Multi-GPU
NCCL, peer-to-peer, data parallelism
24
Profiling
Nsight, NCU, optimization workflow
Project
Description
Modules Used
P1
Vector Add → MatMul
01-07
P2
Parallel Reduction
06-08, 12
P3
Flash Attention
07, 12, 18
P4
Custom Triton Kernel
21
P5
Titans Memory Layer
07, 12, 21
cuda-learning-lab/
├── README.md
├── CMakeLists.txt
├── modules/
│ ├── 01-gpu-architecture/
│ │ ├── README.md # Concept explanation
│ │ ├── exercises/ # Hands-on exercises
│ │ └── solutions/ # Reference solutions
│ ├── 02-first-kernel/
│ └── ...
├── projects/
│ ├── p1-matmul/
│ └── ...
├── common/
│ ├── cuda_utils.h # Error checking, timing
│ └── test_utils.h # Verification helpers
└── scripts/
└── profile.sh # Profiling shortcuts
README.md — Concept explanation with diagrams
exercises/ — Progressive coding challenges
solutions/ — Reference implementations (peek only when stuck!)
tests/ — Verify your solutions
# 1. Read the module README
cd modules/02-first-kernel
cat README.md
# 2. Attempt exercises in order
cd exercises
nvcc 01_hello_gpu.cu -o hello && ./hello
# 3. Run tests to verify
cd ../tests
./run_tests.sh
# 4. Check solution only if stuck
cat ../solutions/01_hello_gpu.cu
# Basic compilation
nvcc kernel.cu -o kernel
# With optimizations
nvcc -O3 -arch=sm_80 kernel.cu -o kernel
# With Thrust
nvcc -std=c++17 thrust_example.cu -o thrust_example
# With cuBLAS
nvcc cublas_example.cu -lcublas -o cublas_example
# Generate PTX (see assembly)
nvcc -ptx kernel.cu
Thread Indexing Cheat Sheet
// 1D grid, 1D block
int idx = blockIdx .x * blockDim .x + threadIdx .x;
// 2D grid, 2D block
int row = blockIdx .y * blockDim .y + threadIdx .y;
int col = blockIdx .x * blockDim .x + threadIdx .x;
// Global thread count
int total_threads = gridDim .x * blockDim .x;
Registers → Per thread → ~1 cycle → Limited (~255)
Shared Memory → Per block → ~5 cycles → 48-164 KB
L1 Cache → Per SM → ~30 cycles → 128 KB
L2 Cache → Global → ~200 cycles → 6-50 MB
Global Memory → Device → ~400 cycles → 8-80 GB
Built for learning 🧠 by Clawd 🦀