Skip to content

hoshuaclawdbot/cuda-learning-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

CUDA Learning Lab 🚀

A progressive, hands-on learning project for mastering GPU programming with CUDA and its ecosystem.

Overview

This project takes you from "what's a GPU?" to writing production-quality CUDA code. Each module builds on the previous, with exercises that reinforce concepts through practice.

Prerequisites

  • Hardware: NVIDIA GPU (any recent GeForce/RTX or data center GPU)
  • Software:
    • CUDA Toolkit 12.x
    • CMake 3.20+
    • C++17 compiler
    • Python 3.10+ (for Triton modules)
  • Knowledge: Basic C/C++, some linear algebra helps

Installation

# Clone and setup
cd ~/code/cuda-learning-lab

# Check CUDA installation
nvcc --version
nvidia-smi

# Build all exercises
mkdir build && cd build
cmake ..
make -j$(nproc)

Curriculum

🟢 Level 1: Foundations (Start Here)

Module Topic Exercises
01 GPU Architecture CPU vs GPU, SMs, warps, memory hierarchy
02 First Kernel Hello world, thread indexing, grid/block
03 Memory Basics Global, shared, local memory
04 Thread Synchronization __syncthreads, atomics, barriers
05 Error Handling CUDA error checking, debugging

🟡 Level 2: Core CUDA

Module Topic Exercises
06 Memory Coalescing Access patterns, bandwidth optimization
07 Shared Memory Bank conflicts, tiling, cache usage
08 Occupancy Registers, blocks, theoretical vs achieved
09 Streams & Async Concurrent kernels, overlap compute/transfer
10 Unified Memory Managed memory, page migration

🟠 Level 3: Libraries

Module Topic Exercises
11 Thrust STL-like parallel algorithms
12 CUB Block/warp/device primitives
13 cuBLAS Dense linear algebra (GEMM, etc.)
14 cuSPARSE Sparse matrix operations
15 cuRAND Random number generation
16 cuFFT Fast Fourier transforms

🔴 Level 4: Deep Learning

Module Topic Exercises
17 cuDNN Convolution, pooling, batch norm
18 CUTLASS Template library for GEMM
19 Tensor Cores FP16/BF16/INT8 matrix ops
20 Custom DL Ops Write your own PyTorch extension

⚫ Level 5: Modern GPU Programming

Module Topic Exercises
21 Triton Python-based kernel programming
22 CUDA Graphs Capture and replay workflows
23 Multi-GPU NCCL, peer-to-peer, data parallelism
24 Profiling Nsight, NCU, optimization workflow

🌟 Projects

Project Description Modules Used
P1 Vector Add → MatMul 01-07
P2 Parallel Reduction 06-08, 12
P3 Flash Attention 07, 12, 18
P4 Custom Triton Kernel 21
P5 Titans Memory Layer 07, 12, 21

Directory Structure

cuda-learning-lab/
├── README.md
├── CMakeLists.txt
├── modules/
│   ├── 01-gpu-architecture/
│   │   ├── README.md          # Concept explanation
│   │   ├── exercises/         # Hands-on exercises
│   │   └── solutions/         # Reference solutions
│   ├── 02-first-kernel/
│   └── ...
├── projects/
│   ├── p1-matmul/
│   └── ...
├── common/
│   ├── cuda_utils.h           # Error checking, timing
│   └── test_utils.h           # Verification helpers
└── scripts/
    └── profile.sh             # Profiling shortcuts

How to Use

Each Module Contains:

  1. README.md — Concept explanation with diagrams
  2. exercises/ — Progressive coding challenges
  3. solutions/ — Reference implementations (peek only when stuck!)
  4. tests/ — Verify your solutions

Workflow:

# 1. Read the module README
cd modules/02-first-kernel
cat README.md

# 2. Attempt exercises in order
cd exercises
nvcc 01_hello_gpu.cu -o hello && ./hello

# 3. Run tests to verify
cd ../tests
./run_tests.sh

# 4. Check solution only if stuck
cat ../solutions/01_hello_gpu.cu

Quick Reference

Compile Commands

# Basic compilation
nvcc kernel.cu -o kernel

# With optimizations
nvcc -O3 -arch=sm_80 kernel.cu -o kernel

# With Thrust
nvcc -std=c++17 thrust_example.cu -o thrust_example

# With cuBLAS
nvcc cublas_example.cu -lcublas -o cublas_example

# Generate PTX (see assembly)
nvcc -ptx kernel.cu

Thread Indexing Cheat Sheet

// 1D grid, 1D block
int idx = blockIdx.x * blockDim.x + threadIdx.x;

// 2D grid, 2D block
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

// Global thread count
int total_threads = gridDim.x * blockDim.x;

Memory Hierarchy

Registers     → Per thread     → ~1 cycle      → Limited (~255)
Shared Memory → Per block      → ~5 cycles     → 48-164 KB
L1 Cache      → Per SM         → ~30 cycles    → 128 KB
L2 Cache      → Global         → ~200 cycles   → 6-50 MB
Global Memory → Device         → ~400 cycles   → 8-80 GB

Resources


Built for learning 🧠 by Clawd 🦀

About

Comprehensive CUDA learning curriculum - from basics to Tensor Cores, covering Thrust, CUB, cuBLAS, CUTLASS, and Triton

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors