BareGrad

A minimal deep learning framework with automatic differentiation, built from scratch for educational purposes. BareGrad features a PyTorch-like Python API backed by custom CUDA tensor operations via pybind11.

Overview

BareGrad is a "bare minimum" machine learning framework that demonstrates the core concepts behind modern deep learning libraries like PyTorch and JAX. It implements:

Automatic differentiation (autograd) via dynamic computation graph
CUDA-accelerated tensor operations (matrix multiplication, element-wise ops, reductions)
Python bindings using pybind11 for seamless C++/Python interop
PyTorch-like API for intuitive model building

This project was developed as part of the NYU ML Systems course, combining Lab 1 (bare-metal CUDA tensor operations) and Lab 2 (Python autograd frontend).

Features

Tensor Operations

Element-wise operations: +, -, *, relu
Matrix operations: @ (matmul), transpose
Reductions: sum, mean, argmax
Loss functions: cross_entropy_loss
Comparison: ==
Broadcasting support for scalar and axis-wise operations

Autograd Engine

Dynamic computation graph construction
Reverse-mode automatic differentiation
Gradient accumulation with broadcasting
no_grad() context manager for inference
Efficient memory management

Backend

Custom CUDA kernels for GPU acceleration
Support for both CPU and CUDA tensors
Minimal overhead C++ tensor implementation
Float32 and UInt32 tensor types

Architecture

baregrad/
├── mygrad/
│   ├── engine.py          # AGTensor class & autograd logic
│   └── __init__.py
├── src/
│   ├── ops/               # CUDA kernel implementations
│   │   ├── op_mm.cuh           # Matrix multiplication
│   │   ├── op_elemwise.cuh     # Element-wise operations
│   │   ├── op_reduction.cuh    # Reduction operations
│   │   └── op_cross_entropy.cuh # Cross-entropy loss
│   ├── utils/
│   │   ├── tensor.cuh          # Tensor data structure
│   │   ├── check_error.cuh     # CUDA error checking
│   │   └── dataset_mnist.hh    # MNIST dataset utilities
│   ├── bindings.cu             # pybind11 bindings
│   ├── py_tensor_shim.hh       # Python-C++ interface
│   └── test.cu                 # C++ tests
├── train_mnist.py              # MNIST training example
├── test_ag_tensor.py           # Python test suite
├── CMakeLists.txt              # Build configuration
└── README.md

Quick Start

Prerequisites

CMake 3.20+
CUDA Toolkit (with cuRAND)
Python 3.7+
NumPy
PyTorch (for data loading only)
pytest (for testing)

Building

# Navigate to the baregrad directory
cd baregrad

# Create build directory and compile
mkdir -p build
cd build
cmake ..
make
cd ..

This will compile the bten Python extension module in the build/ directory.

Running Tests

# Run all tests
pytest -v test_ag_tensor.py

# Run specific test
pytest -v test_ag_tensor.py::test_matmul

# Run with detailed output
pytest -v -s test_ag_tensor.py

Expected output:

test_ag_tensor.py::test_sum PASSED           [  7%]
test_ag_tensor.py::test_transpose PASSED     [ 15%]
test_ag_tensor.py::test_mul PASSED           [ 23%]
test_ag_tensor.py::test_accum PASSED         [ 30%]
test_ag_tensor.py::test_mean PASSED          [ 38%]
test_ag_tensor.py::test_add PASSED           [ 46%]
test_ag_tensor.py::test_sub PASSED           [ 53%]
test_ag_tensor.py::test_matmul PASSED        [ 61%]
test_ag_tensor.py::test_relu PASSED          [ 69%]
test_ag_tensor.py::test_cross_entropy PASSED [ 76%]
test_ag_tensor.py::test_argmax PASSED        [ 84%]
test_ag_tensor.py::test_eq PASSED            [ 92%]
test_ag_tensor.py::test_2layer_mlp PASSED    [100%]

============= 13 passed in 1.83s ==============================

Training MNIST

Train a 2-layer MLP on MNIST:

python train_mnist.py

Expected output:

Epoch 1: train loss = 0.7655 (latency 16.58s)
  Test acc: 0.9022
Epoch 2: train loss = 0.3108 (latency 16.29s)
  Test acc: 0.9245
Epoch 3: train loss = 0.2559 (latency 16.34s)
  Test acc: 0.9348
...
Epoch 20: train loss = 0.0610 (latency 16.00s)
  Test acc: 0.9748

Usage Example

import numpy as np
from mygrad.engine import AGTensor

# Create tensors on GPU
a = AGTensor(np.array([[1, 2, 3]], dtype=np.float32), is_cuda=True)
b = AGTensor(np.array([[4], [5], [6]], dtype=np.float32), is_cuda=True)

# Forward pass with automatic graph construction
c = a @ b  # Matrix multiplication: (1,3) @ (3,1) = (1,1)
d = c.relu()  # ReLU activation
loss = d.sum()  # Reduce to scalar

# Backward pass - compute all gradients
loss.backward()

# Access gradients
print("Gradient of a:")
print(a.grad.to_numpy())
print("Gradient of b:")
print(b.grad.to_numpy())

Building a Simple Neural Network

import numpy as np
from mygrad.engine import AGTensor, no_grad

# Initialize parameters
def param(shape, is_cuda=True):
    h, w = shape
    arr = np.random.randn(h, w).astype(np.float32) * 0.01
    return AGTensor(arr, is_cuda=is_cuda)

# Create 2-layer MLP: 784 -> 128 -> 10
W1 = param((784, 128))
b1 = param((1, 128))
W2 = param((128, 10))
b2 = param((1, 10))

# Forward pass
x = AGTensor(np.random.randn(32, 784).astype(np.float32), is_cuda=True)
h = (x @ W1 + b1).relu()
logits = h @ W2 + b2

# Compute loss
y = np.random.randint(0, 10, 32).astype(np.int8)
loss = logits.cross_entropy_loss(y)

# Backprop
loss.backward()

# SGD update
lr = 0.01
for p in [W1, b1, W2, b2]:
    p.data -= p.grad * lr
    p.grad.fill(0.0)

How It Works

1. Tensor Operations (C++/CUDA Backend)

The low-level bten.TensorF class provides GPU-accelerated tensor operations:

Memory management: Automatic CUDA memory allocation/deallocation
Kernel implementations: Custom CUDA kernels for each operation
Python bindings: Exposed via pybind11 for Python access

2. Autograd Wrapper (Python Frontend)

The AGTensor class wraps bten.TensorF and tracks operations:

class AGTensor:
    def __init__(self, data, children=(), op=''):
        self.data = data           # bten.TensorF (actual tensor)
        self.grad = None           # Gradient accumulator
        self._prev = set(children) # Parent nodes in computation graph
        self._backward = lambda: None  # Gradient computation function

Each operation:

Computes the forward result
Stores references to input tensors
Defines a closure for backward gradient computation

3. Backward Pass

When you call .backward():

Build graph: Topologically sort all operations from the computation graph
Initialize gradient: Set output gradient to 1.0 (for scalar loss)
Traverse in reverse: Call each node's _backward() function
Accumulate gradients: Sum gradients for nodes used multiple times

4. Broadcasting

BareGrad supports NumPy-style broadcasting:

Scalar operations: tensor + 3.0
Axis-wise broadcasting: (N, D) + (1, D) broadcasts across batch dimension
Automatic gradient dimension handling in backward pass

Performance

On a typical NVIDIA GPU (tested on RTX 3090/A100):

MNIST Training: ~15-17s per epoch (2-layer MLP, batch size 128)
Final Accuracy: ~97.5% after 20 epochs
Performance Gap: ~5x slower than pure C++ implementation

The overhead comes from:

Python function call overhead
Frequent small memory allocations
Dynamic graph construction
Lack of kernel fusion optimizations

This is expected and acceptable for an educational framework!

Operator Gradient Implementations

Operation	Forward	Backward
Addition `y = a + b`	`y = a + b`	`da = dy`, `db = dy`
Subtraction `y = a - b`	`y = a - b`	`da = dy`, `db = -dy`
Multiplication `y = a * b`	`y = a * b`	`da = dy * b`, `db = dy * a`
MatMul `Y = A @ B`	`Y = A @ B`	`dA = dY @ B.T`, `dB = A.T @ dY`
ReLU `y = relu(x)`	`y = max(0, x)`	`dx = dy * (x > 0)`
Sum `y = sum(x)`	`y = sum(x)`	`dx = dy * ones_like(x)`
Mean `y = mean(x)`	`y = sum(x) / n`	`dx = dy / n`
Cross-Entropy	Fused softmax + log + NLL	Analytical gradient

Testing

The test suite (test_ag_tensor.py) verifies:

Correctness: Gradient checks via numerical differentiation
Operations: Individual ops (add, mul, matmul, relu, etc.)
Broadcasting: Dimension handling in forward and backward
Accumulation: Multiple uses of same tensor
End-to-end: 2-layer MLP training

Limitations

Current limitations (intentional for educational scope):

2D tensors only: No support for 1D vectors or N-D tensors
Limited operators: No conv2d, pooling, normalization, etc.
No optimizer abstraction: Manual SGD updates required
No model serialization: Can't save/load trained models
Single precision only: Float32 operations only
Basic broadcasting: Only supports simple broadcasting patterns
Memory efficiency: No memory pooling or kernel fusion

Future Improvements

Potential extensions for learning:

Learning Resources

Key concepts demonstrated:

Automatic differentiation: How frameworks like PyTorch compute gradients
Dynamic computation graphs: Building graphs on-the-fly during forward pass
Operator overloading: Creating intuitive Python APIs
CUDA programming: Writing efficient GPU kernels
Python/C++ interop: Using pybind11 for native extensions
Memory management: Handling GPU memory in C++
Gradient computation: Chain rule and backpropagation

Debugging Tips

# Print tensor values
print(tensor.numpy())

# Check tensor properties
print(tensor.shape)        # (rows, cols)
print(tensor.is_cuda)      # True/False
print(tensor._op)          # Operation that created this tensor

# Disable autograd for inference
from mygrad.engine import no_grad
with no_grad():
    output = model(input)  # No graph construction

# Verify gradients manually
def numerical_gradient(f, x, eps=1e-5):
    grad = np.zeros_like(x)
    it = np.nditer(x, flags=['multi_index'])
    while not it.finished:
        idx = it.multi_index
        old_val = x[idx]
        x[idx] = old_val + eps
        fxh = f()
        x[idx] = old_val - eps
        fxl = f()
        grad[idx] = (fxh - fxl) / (2 * eps)
        x[idx] = old_val
        it.iternext()
    return grad

Comparison with Other Frameworks

Feature	BareGrad	micrograd	PyTorch	JAX
Tensor rank	2D only	Scalar	N-D	N-D
Backend	CUDA	CPU	CUDA/CPU	XLA
Graph type	Dynamic	Dynamic	Dynamic	Static (JIT)
Target	Education	Education	Production	Research
Lines of code	~1500	~200	~1M	~500k

Contributing

This is an educational project. Feel free to:

Add new operators
Optimize existing kernels
Improve documentation
Fix bugs
Add tests

Credits

Developed as part of NYU's ML Systems course (Fall 2024).

Inspired by:

PyTorch - Industry-standard deep learning framework
micrograd by Andrej Karpathy - Minimal scalar autograd engine
tinygrad by George Hotz - Minimal tensor library

Key concepts from:

CS231n (Stanford) - Backpropagation and neural networks
CUDA Programming Guide - GPU kernel optimization
pybind11 documentation - Python/C++ bindings

License

MIT License - Free to use for educational purposes!

Citation

If you use this code for educational purposes, please cite:

@misc{baregrad2024,
  title={BareGrad: A Minimal Deep Learning Framework with CUDA Acceleration},
  author={NYU ML Systems Course},
  year={2024},
  howpublished={\url{https://github.com/yourusername/baregrad}}
}

Built with ❤️ for ML Systems education

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BareGrad

Overview

Features

Tensor Operations

Autograd Engine

Backend

Architecture

Quick Start

Prerequisites

Building

Running Tests

Training MNIST

Usage Example

Building a Simple Neural Network

How It Works

1. Tensor Operations (C++/CUDA Backend)

2. Autograd Wrapper (Python Frontend)

3. Backward Pass

4. Broadcasting

Performance

Operator Gradient Implementations

Testing

Limitations

Future Improvements

Learning Resources

Debugging Tips

Comparison with Other Frameworks

Contributing

Credits

License

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
mygrad		mygrad
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
test_ag_tensor.py		test_ag_tensor.py
train_mnist.py		train_mnist.py

License

Normanisfine/baregrad

Folders and files

Latest commit

History

Repository files navigation

BareGrad

Overview

Features

Tensor Operations

Autograd Engine

Backend

Architecture

Quick Start

Prerequisites

Building

Running Tests

Training MNIST

Usage Example

Building a Simple Neural Network

How It Works

1. Tensor Operations (C++/CUDA Backend)

2. Autograd Wrapper (Python Frontend)

3. Backward Pass

4. Broadcasting

Performance

Operator Gradient Implementations

Testing

Limitations

Future Improvements

Learning Resources

Debugging Tips

Comparison with Other Frameworks

Contributing

Credits

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages