A minimal deep learning framework with automatic differentiation, built from scratch for educational purposes. BareGrad features a PyTorch-like Python API backed by custom CUDA tensor operations via pybind11.
BareGrad is a "bare minimum" machine learning framework that demonstrates the core concepts behind modern deep learning libraries like PyTorch and JAX. It implements:
- Automatic differentiation (autograd) via dynamic computation graph
- CUDA-accelerated tensor operations (matrix multiplication, element-wise ops, reductions)
- Python bindings using pybind11 for seamless C++/Python interop
- PyTorch-like API for intuitive model building
This project was developed as part of the NYU ML Systems course, combining Lab 1 (bare-metal CUDA tensor operations) and Lab 2 (Python autograd frontend).
- Element-wise operations:
+,-,*,relu - Matrix operations:
@(matmul),transpose - Reductions:
sum,mean,argmax - Loss functions:
cross_entropy_loss - Comparison:
== - Broadcasting support for scalar and axis-wise operations
- Dynamic computation graph construction
- Reverse-mode automatic differentiation
- Gradient accumulation with broadcasting
no_grad()context manager for inference- Efficient memory management
- Custom CUDA kernels for GPU acceleration
- Support for both CPU and CUDA tensors
- Minimal overhead C++ tensor implementation
- Float32 and UInt32 tensor types
baregrad/
├── mygrad/
│ ├── engine.py # AGTensor class & autograd logic
│ └── __init__.py
├── src/
│ ├── ops/ # CUDA kernel implementations
│ │ ├── op_mm.cuh # Matrix multiplication
│ │ ├── op_elemwise.cuh # Element-wise operations
│ │ ├── op_reduction.cuh # Reduction operations
│ │ └── op_cross_entropy.cuh # Cross-entropy loss
│ ├── utils/
│ │ ├── tensor.cuh # Tensor data structure
│ │ ├── check_error.cuh # CUDA error checking
│ │ └── dataset_mnist.hh # MNIST dataset utilities
│ ├── bindings.cu # pybind11 bindings
│ ├── py_tensor_shim.hh # Python-C++ interface
│ └── test.cu # C++ tests
├── train_mnist.py # MNIST training example
├── test_ag_tensor.py # Python test suite
├── CMakeLists.txt # Build configuration
└── README.md
- CMake 3.20+
- CUDA Toolkit (with cuRAND)
- Python 3.7+
- NumPy
- PyTorch (for data loading only)
- pytest (for testing)
# Navigate to the baregrad directory
cd baregrad
# Create build directory and compile
mkdir -p build
cd build
cmake ..
make
cd ..This will compile the bten Python extension module in the build/ directory.
# Run all tests
pytest -v test_ag_tensor.py
# Run specific test
pytest -v test_ag_tensor.py::test_matmul
# Run with detailed output
pytest -v -s test_ag_tensor.pyExpected output:
test_ag_tensor.py::test_sum PASSED [ 7%]
test_ag_tensor.py::test_transpose PASSED [ 15%]
test_ag_tensor.py::test_mul PASSED [ 23%]
test_ag_tensor.py::test_accum PASSED [ 30%]
test_ag_tensor.py::test_mean PASSED [ 38%]
test_ag_tensor.py::test_add PASSED [ 46%]
test_ag_tensor.py::test_sub PASSED [ 53%]
test_ag_tensor.py::test_matmul PASSED [ 61%]
test_ag_tensor.py::test_relu PASSED [ 69%]
test_ag_tensor.py::test_cross_entropy PASSED [ 76%]
test_ag_tensor.py::test_argmax PASSED [ 84%]
test_ag_tensor.py::test_eq PASSED [ 92%]
test_ag_tensor.py::test_2layer_mlp PASSED [100%]
============= 13 passed in 1.83s ==============================
Train a 2-layer MLP on MNIST:
python train_mnist.pyExpected output:
Epoch 1: train loss = 0.7655 (latency 16.58s)
Test acc: 0.9022
Epoch 2: train loss = 0.3108 (latency 16.29s)
Test acc: 0.9245
Epoch 3: train loss = 0.2559 (latency 16.34s)
Test acc: 0.9348
...
Epoch 20: train loss = 0.0610 (latency 16.00s)
Test acc: 0.9748
import numpy as np
from mygrad.engine import AGTensor
# Create tensors on GPU
a = AGTensor(np.array([[1, 2, 3]], dtype=np.float32), is_cuda=True)
b = AGTensor(np.array([[4], [5], [6]], dtype=np.float32), is_cuda=True)
# Forward pass with automatic graph construction
c = a @ b # Matrix multiplication: (1,3) @ (3,1) = (1,1)
d = c.relu() # ReLU activation
loss = d.sum() # Reduce to scalar
# Backward pass - compute all gradients
loss.backward()
# Access gradients
print("Gradient of a:")
print(a.grad.to_numpy())
print("Gradient of b:")
print(b.grad.to_numpy())import numpy as np
from mygrad.engine import AGTensor, no_grad
# Initialize parameters
def param(shape, is_cuda=True):
h, w = shape
arr = np.random.randn(h, w).astype(np.float32) * 0.01
return AGTensor(arr, is_cuda=is_cuda)
# Create 2-layer MLP: 784 -> 128 -> 10
W1 = param((784, 128))
b1 = param((1, 128))
W2 = param((128, 10))
b2 = param((1, 10))
# Forward pass
x = AGTensor(np.random.randn(32, 784).astype(np.float32), is_cuda=True)
h = (x @ W1 + b1).relu()
logits = h @ W2 + b2
# Compute loss
y = np.random.randint(0, 10, 32).astype(np.int8)
loss = logits.cross_entropy_loss(y)
# Backprop
loss.backward()
# SGD update
lr = 0.01
for p in [W1, b1, W2, b2]:
p.data -= p.grad * lr
p.grad.fill(0.0)The low-level bten.TensorF class provides GPU-accelerated tensor operations:
- Memory management: Automatic CUDA memory allocation/deallocation
- Kernel implementations: Custom CUDA kernels for each operation
- Python bindings: Exposed via pybind11 for Python access
The AGTensor class wraps bten.TensorF and tracks operations:
class AGTensor:
def __init__(self, data, children=(), op=''):
self.data = data # bten.TensorF (actual tensor)
self.grad = None # Gradient accumulator
self._prev = set(children) # Parent nodes in computation graph
self._backward = lambda: None # Gradient computation functionEach operation:
- Computes the forward result
- Stores references to input tensors
- Defines a closure for backward gradient computation
When you call .backward():
- Build graph: Topologically sort all operations from the computation graph
- Initialize gradient: Set output gradient to 1.0 (for scalar loss)
- Traverse in reverse: Call each node's
_backward()function - Accumulate gradients: Sum gradients for nodes used multiple times
BareGrad supports NumPy-style broadcasting:
- Scalar operations:
tensor + 3.0 - Axis-wise broadcasting:
(N, D) + (1, D)broadcasts across batch dimension - Automatic gradient dimension handling in backward pass
On a typical NVIDIA GPU (tested on RTX 3090/A100):
- MNIST Training: ~15-17s per epoch (2-layer MLP, batch size 128)
- Final Accuracy: ~97.5% after 20 epochs
- Performance Gap: ~5x slower than pure C++ implementation
The overhead comes from:
- Python function call overhead
- Frequent small memory allocations
- Dynamic graph construction
- Lack of kernel fusion optimizations
This is expected and acceptable for an educational framework!
| Operation | Forward | Backward |
|---|---|---|
Addition y = a + b |
y = a + b |
da = dy, db = dy |
Subtraction y = a - b |
y = a - b |
da = dy, db = -dy |
Multiplication y = a * b |
y = a * b |
da = dy * b, db = dy * a |
MatMul Y = A @ B |
Y = A @ B |
dA = dY @ B.T, dB = A.T @ dY |
ReLU y = relu(x) |
y = max(0, x) |
dx = dy * (x > 0) |
Sum y = sum(x) |
y = sum(x) |
dx = dy * ones_like(x) |
Mean y = mean(x) |
y = sum(x) / n |
dx = dy / n |
| Cross-Entropy | Fused softmax + log + NLL | Analytical gradient |
The test suite (test_ag_tensor.py) verifies:
- Correctness: Gradient checks via numerical differentiation
- Operations: Individual ops (add, mul, matmul, relu, etc.)
- Broadcasting: Dimension handling in forward and backward
- Accumulation: Multiple uses of same tensor
- End-to-end: 2-layer MLP training
Current limitations (intentional for educational scope):
- 2D tensors only: No support for 1D vectors or N-D tensors
- Limited operators: No conv2d, pooling, normalization, etc.
- No optimizer abstraction: Manual SGD updates required
- No model serialization: Can't save/load trained models
- Single precision only: Float32 operations only
- Basic broadcasting: Only supports simple broadcasting patterns
- Memory efficiency: No memory pooling or kernel fusion
Potential extensions for learning:
- Add 1D and N-D tensor support
- Implement convolution and pooling operators
- Add batch normalization and layer normalization
- Create optimizer classes (Adam, RMSprop)
- Implement model serialization
- Add kernel fusion for common operation patterns
- Support mixed precision training
- Add gradient clipping and regularization
- Implement data parallel training
- Add more activation functions (tanh, sigmoid, gelu)
Key concepts demonstrated:
- Automatic differentiation: How frameworks like PyTorch compute gradients
- Dynamic computation graphs: Building graphs on-the-fly during forward pass
- Operator overloading: Creating intuitive Python APIs
- CUDA programming: Writing efficient GPU kernels
- Python/C++ interop: Using pybind11 for native extensions
- Memory management: Handling GPU memory in C++
- Gradient computation: Chain rule and backpropagation
# Print tensor values
print(tensor.numpy())
# Check tensor properties
print(tensor.shape) # (rows, cols)
print(tensor.is_cuda) # True/False
print(tensor._op) # Operation that created this tensor
# Disable autograd for inference
from mygrad.engine import no_grad
with no_grad():
output = model(input) # No graph construction
# Verify gradients manually
def numerical_gradient(f, x, eps=1e-5):
grad = np.zeros_like(x)
it = np.nditer(x, flags=['multi_index'])
while not it.finished:
idx = it.multi_index
old_val = x[idx]
x[idx] = old_val + eps
fxh = f()
x[idx] = old_val - eps
fxl = f()
grad[idx] = (fxh - fxl) / (2 * eps)
x[idx] = old_val
it.iternext()
return grad| Feature | BareGrad | micrograd | PyTorch | JAX |
|---|---|---|---|---|
| Tensor rank | 2D only | Scalar | N-D | N-D |
| Backend | CUDA | CPU | CUDA/CPU | XLA |
| Graph type | Dynamic | Dynamic | Dynamic | Static (JIT) |
| Target | Education | Education | Production | Research |
| Lines of code | ~1500 | ~200 | ~1M | ~500k |
This is an educational project. Feel free to:
- Add new operators
- Optimize existing kernels
- Improve documentation
- Fix bugs
- Add tests
Developed as part of NYU's ML Systems course (Fall 2024).
Inspired by:
- PyTorch - Industry-standard deep learning framework
- micrograd by Andrej Karpathy - Minimal scalar autograd engine
- tinygrad by George Hotz - Minimal tensor library
Key concepts from:
- CS231n (Stanford) - Backpropagation and neural networks
- CUDA Programming Guide - GPU kernel optimization
- pybind11 documentation - Python/C++ bindings
MIT License - Free to use for educational purposes!
If you use this code for educational purposes, please cite:
@misc{baregrad2024,
title={BareGrad: A Minimal Deep Learning Framework with CUDA Acceleration},
author={NYU ML Systems Course},
year={2024},
howpublished={\url{https://github.com/yourusername/baregrad}}
}Built with ❤️ for ML Systems education