CUDA Learning Lab 🚀

A progressive, hands-on learning project for mastering GPU programming with CUDA and its ecosystem.

Overview

This project takes you from "what's a GPU?" to writing production-quality CUDA code. Each module builds on the previous, with exercises that reinforce concepts through practice.

Prerequisites

Hardware: NVIDIA GPU (any recent GeForce/RTX or data center GPU)
Software:
- CUDA Toolkit 12.x
- CMake 3.20+
- C++17 compiler
- Python 3.10+ (for Triton modules)
Knowledge: Basic C/C++, some linear algebra helps

Installation

# Clone and setup
cd ~/code/cuda-learning-lab

# Check CUDA installation
nvcc --version
nvidia-smi

# Build all exercises
mkdir build && cd build
cmake ..
make -j$(nproc)

Curriculum

🟢 Level 1: Foundations (Start Here)

Module	Topic	Exercises
01	GPU Architecture	CPU vs GPU, SMs, warps, memory hierarchy
02	First Kernel	Hello world, thread indexing, grid/block
03	Memory Basics	Global, shared, local memory
04	Thread Synchronization	__syncthreads, atomics, barriers
05	Error Handling	CUDA error checking, debugging

🟡 Level 2: Core CUDA

Module	Topic	Exercises
06	Memory Coalescing	Access patterns, bandwidth optimization
07	Shared Memory	Bank conflicts, tiling, cache usage
08	Occupancy	Registers, blocks, theoretical vs achieved
09	Streams & Async	Concurrent kernels, overlap compute/transfer
10	Unified Memory	Managed memory, page migration

🟠 Level 3: Libraries

Module	Topic	Exercises
11	Thrust	STL-like parallel algorithms
12	CUB	Block/warp/device primitives
13	cuBLAS	Dense linear algebra (GEMM, etc.)
14	cuSPARSE	Sparse matrix operations
15	cuRAND	Random number generation
16	cuFFT	Fast Fourier transforms

🔴 Level 4: Deep Learning

Module	Topic	Exercises
17	cuDNN	Convolution, pooling, batch norm
18	CUTLASS	Template library for GEMM
19	Tensor Cores	FP16/BF16/INT8 matrix ops
20	Custom DL Ops	Write your own PyTorch extension

⚫ Level 5: Modern GPU Programming

Module	Topic	Exercises
21	Triton	Python-based kernel programming
22	CUDA Graphs	Capture and replay workflows
23	Multi-GPU	NCCL, peer-to-peer, data parallelism
24	Profiling	Nsight, NCU, optimization workflow

🌟 Projects

Project	Description	Modules Used
P1	Vector Add → MatMul	01-07
P2	Parallel Reduction	06-08, 12
P3	Flash Attention	07, 12, 18
P4	Custom Triton Kernel	21
P5	Titans Memory Layer	07, 12, 21

Directory Structure

cuda-learning-lab/
├── README.md
├── CMakeLists.txt
├── modules/
│   ├── 01-gpu-architecture/
│   │   ├── README.md          # Concept explanation
│   │   ├── exercises/         # Hands-on exercises
│   │   └── solutions/         # Reference solutions
│   ├── 02-first-kernel/
│   └── ...
├── projects/
│   ├── p1-matmul/
│   └── ...
├── common/
│   ├── cuda_utils.h           # Error checking, timing
│   └── test_utils.h           # Verification helpers
└── scripts/
    └── profile.sh             # Profiling shortcuts

How to Use

Each Module Contains:

README.md — Concept explanation with diagrams
exercises/ — Progressive coding challenges
solutions/ — Reference implementations (peek only when stuck!)
tests/ — Verify your solutions

Workflow:

# 1. Read the module README
cd modules/02-first-kernel
cat README.md

# 2. Attempt exercises in order
cd exercises
nvcc 01_hello_gpu.cu -o hello && ./hello

# 3. Run tests to verify
cd ../tests
./run_tests.sh

# 4. Check solution only if stuck
cat ../solutions/01_hello_gpu.cu

Quick Reference

Compile Commands

# Basic compilation
nvcc kernel.cu -o kernel

# With optimizations
nvcc -O3 -arch=sm_80 kernel.cu -o kernel

# With Thrust
nvcc -std=c++17 thrust_example.cu -o thrust_example

# With cuBLAS
nvcc cublas_example.cu -lcublas -o cublas_example

# Generate PTX (see assembly)
nvcc -ptx kernel.cu

Thread Indexing Cheat Sheet

// 1D grid, 1D block
int idx = blockIdx.x * blockDim.x + threadIdx.x;

// 2D grid, 2D block
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

// Global thread count
int total_threads = gridDim.x * blockDim.x;

Memory Hierarchy

Registers     → Per thread     → ~1 cycle      → Limited (~255)
Shared Memory → Per block      → ~5 cycles     → 48-164 KB
L1 Cache      → Per SM         → ~30 cycles    → 128 KB
L2 Cache      → Global         → ~200 cycles   → 6-50 MB
Global Memory → Device         → ~400 cycles   → 8-80 GB

Resources

Built for learning 🧠 by Clawd 🦀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Learning Lab 🚀

Overview

Prerequisites

Installation

Curriculum

🟢 Level 1: Foundations (Start Here)

🟡 Level 2: Core CUDA

🟠 Level 3: Libraries

🔴 Level 4: Deep Learning

⚫ Level 5: Modern GPU Programming

🌟 Projects

Directory Structure

How to Use

Each Module Contains:

Workflow:

Quick Reference

Compile Commands

Thread Indexing Cheat Sheet

Memory Hierarchy

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
common		common
modules		modules
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CUDA Learning Lab 🚀

Overview

Prerequisites

Installation

Curriculum

🟢 Level 1: Foundations (Start Here)

🟡 Level 2: Core CUDA

🟠 Level 3: Libraries

🔴 Level 4: Deep Learning

⚫ Level 5: Modern GPU Programming

🌟 Projects

Directory Structure

How to Use

Each Module Contains:

Workflow:

Quick Reference

Compile Commands

Thread Indexing Cheat Sheet

Memory Hierarchy

Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages