Skip to content

Custom CUDA tensor library implementing GPU-accelerated operations from scratch for deep learning primitives.

License

Notifications You must be signed in to change notification settings

Normanisfine/baregrad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BareGrad

A minimal deep learning framework with automatic differentiation, built from scratch for educational purposes. BareGrad features a PyTorch-like Python API backed by custom CUDA tensor operations via pybind11.

Overview

BareGrad is a "bare minimum" machine learning framework that demonstrates the core concepts behind modern deep learning libraries like PyTorch and JAX. It implements:

  • Automatic differentiation (autograd) via dynamic computation graph
  • CUDA-accelerated tensor operations (matrix multiplication, element-wise ops, reductions)
  • Python bindings using pybind11 for seamless C++/Python interop
  • PyTorch-like API for intuitive model building

This project was developed as part of the NYU ML Systems course, combining Lab 1 (bare-metal CUDA tensor operations) and Lab 2 (Python autograd frontend).

Features

Tensor Operations

  • Element-wise operations: +, -, *, relu
  • Matrix operations: @ (matmul), transpose
  • Reductions: sum, mean, argmax
  • Loss functions: cross_entropy_loss
  • Comparison: ==
  • Broadcasting support for scalar and axis-wise operations

Autograd Engine

  • Dynamic computation graph construction
  • Reverse-mode automatic differentiation
  • Gradient accumulation with broadcasting
  • no_grad() context manager for inference
  • Efficient memory management

Backend

  • Custom CUDA kernels for GPU acceleration
  • Support for both CPU and CUDA tensors
  • Minimal overhead C++ tensor implementation
  • Float32 and UInt32 tensor types

Architecture

baregrad/
├── mygrad/
│   ├── engine.py          # AGTensor class & autograd logic
│   └── __init__.py
├── src/
│   ├── ops/               # CUDA kernel implementations
│   │   ├── op_mm.cuh           # Matrix multiplication
│   │   ├── op_elemwise.cuh     # Element-wise operations
│   │   ├── op_reduction.cuh    # Reduction operations
│   │   └── op_cross_entropy.cuh # Cross-entropy loss
│   ├── utils/
│   │   ├── tensor.cuh          # Tensor data structure
│   │   ├── check_error.cuh     # CUDA error checking
│   │   └── dataset_mnist.hh    # MNIST dataset utilities
│   ├── bindings.cu             # pybind11 bindings
│   ├── py_tensor_shim.hh       # Python-C++ interface
│   └── test.cu                 # C++ tests
├── train_mnist.py              # MNIST training example
├── test_ag_tensor.py           # Python test suite
├── CMakeLists.txt              # Build configuration
└── README.md

Quick Start

Prerequisites

  • CMake 3.20+
  • CUDA Toolkit (with cuRAND)
  • Python 3.7+
  • NumPy
  • PyTorch (for data loading only)
  • pytest (for testing)

Building

# Navigate to the baregrad directory
cd baregrad

# Create build directory and compile
mkdir -p build
cd build
cmake ..
make
cd ..

This will compile the bten Python extension module in the build/ directory.

Running Tests

# Run all tests
pytest -v test_ag_tensor.py

# Run specific test
pytest -v test_ag_tensor.py::test_matmul

# Run with detailed output
pytest -v -s test_ag_tensor.py

Expected output:

test_ag_tensor.py::test_sum PASSED           [  7%]
test_ag_tensor.py::test_transpose PASSED     [ 15%]
test_ag_tensor.py::test_mul PASSED           [ 23%]
test_ag_tensor.py::test_accum PASSED         [ 30%]
test_ag_tensor.py::test_mean PASSED          [ 38%]
test_ag_tensor.py::test_add PASSED           [ 46%]
test_ag_tensor.py::test_sub PASSED           [ 53%]
test_ag_tensor.py::test_matmul PASSED        [ 61%]
test_ag_tensor.py::test_relu PASSED          [ 69%]
test_ag_tensor.py::test_cross_entropy PASSED [ 76%]
test_ag_tensor.py::test_argmax PASSED        [ 84%]
test_ag_tensor.py::test_eq PASSED            [ 92%]
test_ag_tensor.py::test_2layer_mlp PASSED    [100%]

============= 13 passed in 1.83s ==============================

Training MNIST

Train a 2-layer MLP on MNIST:

python train_mnist.py

Expected output:

Epoch 1: train loss = 0.7655 (latency 16.58s)
  Test acc: 0.9022
Epoch 2: train loss = 0.3108 (latency 16.29s)
  Test acc: 0.9245
Epoch 3: train loss = 0.2559 (latency 16.34s)
  Test acc: 0.9348
...
Epoch 20: train loss = 0.0610 (latency 16.00s)
  Test acc: 0.9748

Usage Example

import numpy as np
from mygrad.engine import AGTensor

# Create tensors on GPU
a = AGTensor(np.array([[1, 2, 3]], dtype=np.float32), is_cuda=True)
b = AGTensor(np.array([[4], [5], [6]], dtype=np.float32), is_cuda=True)

# Forward pass with automatic graph construction
c = a @ b  # Matrix multiplication: (1,3) @ (3,1) = (1,1)
d = c.relu()  # ReLU activation
loss = d.sum()  # Reduce to scalar

# Backward pass - compute all gradients
loss.backward()

# Access gradients
print("Gradient of a:")
print(a.grad.to_numpy())
print("Gradient of b:")
print(b.grad.to_numpy())

Building a Simple Neural Network

import numpy as np
from mygrad.engine import AGTensor, no_grad

# Initialize parameters
def param(shape, is_cuda=True):
    h, w = shape
    arr = np.random.randn(h, w).astype(np.float32) * 0.01
    return AGTensor(arr, is_cuda=is_cuda)

# Create 2-layer MLP: 784 -> 128 -> 10
W1 = param((784, 128))
b1 = param((1, 128))
W2 = param((128, 10))
b2 = param((1, 10))

# Forward pass
x = AGTensor(np.random.randn(32, 784).astype(np.float32), is_cuda=True)
h = (x @ W1 + b1).relu()
logits = h @ W2 + b2

# Compute loss
y = np.random.randint(0, 10, 32).astype(np.int8)
loss = logits.cross_entropy_loss(y)

# Backprop
loss.backward()

# SGD update
lr = 0.01
for p in [W1, b1, W2, b2]:
    p.data -= p.grad * lr
    p.grad.fill(0.0)

How It Works

1. Tensor Operations (C++/CUDA Backend)

The low-level bten.TensorF class provides GPU-accelerated tensor operations:

  • Memory management: Automatic CUDA memory allocation/deallocation
  • Kernel implementations: Custom CUDA kernels for each operation
  • Python bindings: Exposed via pybind11 for Python access

2. Autograd Wrapper (Python Frontend)

The AGTensor class wraps bten.TensorF and tracks operations:

class AGTensor:
    def __init__(self, data, children=(), op=''):
        self.data = data           # bten.TensorF (actual tensor)
        self.grad = None           # Gradient accumulator
        self._prev = set(children) # Parent nodes in computation graph
        self._backward = lambda: None  # Gradient computation function

Each operation:

  1. Computes the forward result
  2. Stores references to input tensors
  3. Defines a closure for backward gradient computation

3. Backward Pass

When you call .backward():

  1. Build graph: Topologically sort all operations from the computation graph
  2. Initialize gradient: Set output gradient to 1.0 (for scalar loss)
  3. Traverse in reverse: Call each node's _backward() function
  4. Accumulate gradients: Sum gradients for nodes used multiple times

4. Broadcasting

BareGrad supports NumPy-style broadcasting:

  • Scalar operations: tensor + 3.0
  • Axis-wise broadcasting: (N, D) + (1, D) broadcasts across batch dimension
  • Automatic gradient dimension handling in backward pass

Performance

On a typical NVIDIA GPU (tested on RTX 3090/A100):

  • MNIST Training: ~15-17s per epoch (2-layer MLP, batch size 128)
  • Final Accuracy: ~97.5% after 20 epochs
  • Performance Gap: ~5x slower than pure C++ implementation

The overhead comes from:

  • Python function call overhead
  • Frequent small memory allocations
  • Dynamic graph construction
  • Lack of kernel fusion optimizations

This is expected and acceptable for an educational framework!

Operator Gradient Implementations

Operation Forward Backward
Addition y = a + b y = a + b da = dy, db = dy
Subtraction y = a - b y = a - b da = dy, db = -dy
Multiplication y = a * b y = a * b da = dy * b, db = dy * a
MatMul Y = A @ B Y = A @ B dA = dY @ B.T, dB = A.T @ dY
ReLU y = relu(x) y = max(0, x) dx = dy * (x > 0)
Sum y = sum(x) y = sum(x) dx = dy * ones_like(x)
Mean y = mean(x) y = sum(x) / n dx = dy / n
Cross-Entropy Fused softmax + log + NLL Analytical gradient

Testing

The test suite (test_ag_tensor.py) verifies:

  • Correctness: Gradient checks via numerical differentiation
  • Operations: Individual ops (add, mul, matmul, relu, etc.)
  • Broadcasting: Dimension handling in forward and backward
  • Accumulation: Multiple uses of same tensor
  • End-to-end: 2-layer MLP training

Limitations

Current limitations (intentional for educational scope):

  • 2D tensors only: No support for 1D vectors or N-D tensors
  • Limited operators: No conv2d, pooling, normalization, etc.
  • No optimizer abstraction: Manual SGD updates required
  • No model serialization: Can't save/load trained models
  • Single precision only: Float32 operations only
  • Basic broadcasting: Only supports simple broadcasting patterns
  • Memory efficiency: No memory pooling or kernel fusion

Future Improvements

Potential extensions for learning:

  • Add 1D and N-D tensor support
  • Implement convolution and pooling operators
  • Add batch normalization and layer normalization
  • Create optimizer classes (Adam, RMSprop)
  • Implement model serialization
  • Add kernel fusion for common operation patterns
  • Support mixed precision training
  • Add gradient clipping and regularization
  • Implement data parallel training
  • Add more activation functions (tanh, sigmoid, gelu)

Learning Resources

Key concepts demonstrated:

  • Automatic differentiation: How frameworks like PyTorch compute gradients
  • Dynamic computation graphs: Building graphs on-the-fly during forward pass
  • Operator overloading: Creating intuitive Python APIs
  • CUDA programming: Writing efficient GPU kernels
  • Python/C++ interop: Using pybind11 for native extensions
  • Memory management: Handling GPU memory in C++
  • Gradient computation: Chain rule and backpropagation

Debugging Tips

# Print tensor values
print(tensor.numpy())

# Check tensor properties
print(tensor.shape)        # (rows, cols)
print(tensor.is_cuda)      # True/False
print(tensor._op)          # Operation that created this tensor

# Disable autograd for inference
from mygrad.engine import no_grad
with no_grad():
    output = model(input)  # No graph construction

# Verify gradients manually
def numerical_gradient(f, x, eps=1e-5):
    grad = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'])
    while not it.finished:
        idx = it.multi_index
        old_val = x[idx]
        x[idx] = old_val + eps
        fxh = f()
        x[idx] = old_val - eps
        fxl = f()
        grad[idx] = (fxh - fxl) / (2 * eps)
        x[idx] = old_val
        it.iternext()
    return grad

Comparison with Other Frameworks

Feature BareGrad micrograd PyTorch JAX
Tensor rank 2D only Scalar N-D N-D
Backend CUDA CPU CUDA/CPU XLA
Graph type Dynamic Dynamic Dynamic Static (JIT)
Target Education Education Production Research
Lines of code ~1500 ~200 ~1M ~500k

Contributing

This is an educational project. Feel free to:

  • Add new operators
  • Optimize existing kernels
  • Improve documentation
  • Fix bugs
  • Add tests

Credits

Developed as part of NYU's ML Systems course (Fall 2024).

Inspired by:

  • PyTorch - Industry-standard deep learning framework
  • micrograd by Andrej Karpathy - Minimal scalar autograd engine
  • tinygrad by George Hotz - Minimal tensor library

Key concepts from:

  • CS231n (Stanford) - Backpropagation and neural networks
  • CUDA Programming Guide - GPU kernel optimization
  • pybind11 documentation - Python/C++ bindings

License

MIT License - Free to use for educational purposes!

Citation

If you use this code for educational purposes, please cite:

@misc{baregrad2024,
  title={BareGrad: A Minimal Deep Learning Framework with CUDA Acceleration},
  author={NYU ML Systems Course},
  year={2024},
  howpublished={\url{https://github.com/yourusername/baregrad}}
}

Built with ❤️ for ML Systems education

About

Custom CUDA tensor library implementing GPU-accelerated operations from scratch for deep learning primitives.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published