Complete implementation progression from PyTorch to optimized CUDA: A step-by-step journey through 5 versions
This project implements a simple 2-layer MLP (Multi-Layer Perceptron) for MNIST digit classification, progressively optimizing from high-level PyTorch to low-level CUDA implementations. Each version demonstrates different optimization techniques and trade-offs between performance and complexity.
Architecture: 784 → 256 → 10 (input → hidden → output)
Dataset: 10,000 MNIST training samples, batch size 8, 10 epochs
Activation: ReLU, Loss: Cross-entropy, Optimizer: SGD (lr=0.01)
DISCLAIMER: ensure you have a GPU with compute capability 5.0 or greater (at least maxwell architecture). See compatibility guide: https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html
git clone https://github.com/Infatoshi/mnist-cuda
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Download MNIST data (if not already present)
python downloader.pyFor CUDA versions (v4, v5), ensure you have NVIDIA CUDA toolkit installed:
# Check CUDA installation
nvcc --version
# Compile v4 (naive CUDA kernels)
nvcc -o v4 v4.cu
# Compile v5 (cuBLAS optimized)
nvcc -o v5 v5.cu -lcublas- Framework: PyTorch with CUDA tensors
- Features:
- High-level PyTorch operations (Linear, ReLU, CrossEntropyLoss)
- GPU tensors with automatic memory management
- Built-in optimizations (cuDNN, etc.)
- Purpose: Establishes baseline performance and correctness reference
- Framework: Pure NumPy (CPU-only)
- Features:
- Manual forward/backward pass implementation
- Custom gradient computation and weight updates
- He initialization for weights
- Purpose: Demonstrates the underlying math without GPU acceleration
- Framework: Pure C with timing breakdown
- Features:
- Manual memory management
- Detailed timing instrumentation per operation
- CPU-optimized matrix operations
- Purpose: Shows CPU performance baseline and prepares for GPU porting
- Framework: CUDA C with custom kernels
- Features:
- Custom matrix multiplication kernels
- Element-wise operations (ReLU, bias, softmax) on GPU
- Manual memory transfers between host and device
- Purpose: First GPU implementation with basic CUDA kernels
- Framework: CUDA with cuBLAS library
- Features:
- cuBLAS optimized matrix operations (SGEMM, SAXPY)
- Persistent memory buffers to reduce allocations
- Minimal host-device synchronization points
- Optimized memory access patterns
- Purpose: Production-quality implementation with maximum performance
# Run each version
python v1.py # PyTorch baseline
python v2.py # NumPy CPU implementation
gcc -o v3 v3.c -lm && ./v3 # C CPU implementation
nvcc -o v4 v4.cu && ./v4 # Naive CUDA kernels
nvcc -o v5 v5.cu -lcublas && ./v5 # Optimized cuBLASExpected performance characteristics (timing will vary by hardware):
| Version | Implementation | Typical Training Time | Relative Performance |
|---|---|---|---|
| v1.py | PyTorch CUDA | ~2-5 seconds | 100% (baseline) |
| v2.py | NumPy CPU | ~60-120 seconds | 5-10% of baseline |
| v3.c | C CPU | ~40-80 seconds | 10-15% of baseline |
| v4.cu | Naive CUDA | ~8-15 seconds | 30-60% of baseline |
| v5.cu | cuBLAS CUDA | ~1-3 seconds | 120-200% of baseline |
Each implementation provides detailed timing breakdowns:
v1 (PyTorch): High-level operations with cuDNN optimization
- Forward pass: ~40% of time
- Backward pass: ~50% of time
- Weight updates: ~5% of time
- Data loading: ~5% of time
v5 (cuBLAS Optimized): Production-level performance
- GPU compute (forward + backward + updates): ~60% of time
- Memory transfers (H2D + D2H): ~30% of time
- Host computation (loss calculation): ~10% of time
Key observations from the implementation progression:
- Memory Management: Persistent buffers (v5) vs per-batch allocation (v4) significantly impacts performance
- Library Optimization: cuBLAS provides highly optimized GEMM operations that outperform naive kernels
- CPU-GPU Transfer: Minimizing host-device synchronization is crucial for GPU performance
- Numerical Stability: Proper softmax implementation with max subtraction prevents overflow
- Hardware Utilization: v5 achieves the best performance by maximizing GPU compute utilization
- Row vs Column Major: Matrix layout considerations for GEMM
- Tensor Cores: Hardware acceleration for mixed-precision operations
- Memory Patterns: Coalesced access and persistent allocation strategies
- Kernel Launch: Grid/block dimensions and occupancy considerations
- Unified Memory: Simplified memory management
- CUDA Streams: Overlapping computation with data transfer
- Mixed Precision: FP16 operations with Tensor Cores
- Kernel Fusion: Combining operations to reduce memory bandwidth
