Titans CUDA Framework 🚀

A high-performance CUDA implementation of the Titans neural memory architecture from "Titans: Learning to Memorize at Test Time" (Behrouz et al., 2024).

Overview

Titans introduces a novel neural memory module that learns at test time via gradient descent. This framework provides optimized CUDA implementations using:

CUDA Kernels — Custom kernels for memory operations
CUB — Block/warp-level primitives for reductions
Thrust — High-level parallel algorithms
cuBLAS — Optimized matrix operations

Architecture

Titans Memory Module
├── Memory: MLP that updates at test time
├── Surprise: Gradient + momentum signal
├── Forgetting: Weight decay for memory management
└── Variants: MAC, MAG, MAL

The Core Update Rule

S_t = η · S_{t-1} - θ · ∇ℓ(M_{t-1}; x_t)    # Surprise (momentum + gradient)
M_t = (1 - α) · M_{t-1} + S_t                # Memory update with forgetting

Where:

M = Neural memory (MLP weights)
S = Surprise signal (accumulated gradients)
η = Momentum coefficient
θ = Learning rate
α = Forgetting rate (weight decay)

Project Structure

titans-cuda/
├── include/
│   ├── titans/
│   │   ├── memory.cuh          # Neural memory module
│   │   ├── surprise.cuh        # Surprise computation
│   │   ├── projections.cuh     # Key-value projections
│   │   ├── variants.cuh        # MAC, MAG, MAL variants
│   │   └── utils.cuh           # Utilities
│   └── common/
│       ├── cuda_utils.cuh      # Error checking, timing
│       └── tensor.cuh          # Simple tensor wrapper
├── src/
│   ├── memory.cu               # Memory module implementation
│   ├── surprise.cu             # Surprise kernels
│   ├── projections.cu          # Projection kernels
│   └── variants.cu             # Variant implementations
├── kernels/
│   ├── naive/                  # Baseline implementations
│   ├── optimized/              # Optimized versions
│   └── experimental/           # Cutting-edge optimizations
├── tests/
│   ├── test_memory.cu          # Memory module tests
│   ├── test_surprise.cu        # Surprise computation tests
│   └── benchmarks.cu           # Performance benchmarks
├── examples/
│   ├── simple_memory.cu        # Basic usage example
│   ├── sequence_modeling.cu    # Sequence task example
│   └── benchmark_vs_pytorch.py # Compare with PyTorch
├── python/
│   └── titans_cuda/            # Python bindings (optional)
├── CMakeLists.txt
└── README.md

Building

Prerequisites

CUDA Toolkit 12.x
CMake 3.20+
C++17 compiler
(Optional) PyTorch for Python bindings

Build

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Run Tests

cd build
ctest --verbose

Run Benchmarks

./benchmarks --all

Implementation Roadmap

Phase 1: Core Components ✅

Phase 2: Optimizations 🔄

Phase 3: Variants

MAC (Memory as Context)
MAG (Memory as Gate)
MAL (Memory as Layer)

Phase 4: Advanced

Chunked parallel training
Tensor Core support (FP16/BF16)
Multi-GPU support
Python/PyTorch bindings

Usage Example

#include <titans/memory.cuh>

int main() {
    // Create memory module
    titans::NeuralMemory memory(
        /*dim=*/256,
        /*depth=*/2,
        /*hidden_mult=*/4.0f
    );
    
    // Create update config
    titans::UpdateConfig config{
        .lr = 0.01f,
        .momentum = 0.9f,
        .forgetting = 0.01f
    };
    
    // Process sequence
    for (int t = 0; t < seq_len; t++) {
        // Project input to key-value
        auto [key, value] = projections.forward(input[t]);
        
        // Query memory
        auto output = memory.forward(key);
        
        // Update memory (test-time learning!)
        memory.update(key, value, config);
    }
    
    return 0;
}

Benchmarks

Target performance vs PyTorch baseline:

Operation	PyTorch	Ours	Speedup
Memory Forward	TBD	TBD	-
Surprise Compute	TBD	TBD	-
Full Update	TBD	TBD	-
1K Sequence	TBD	TBD	-
10K Sequence	TBD	TBD	-

Key Optimizations

1. Fused Forward-Backward

Instead of separate forward and backward passes, compute both in one kernel to avoid memory round-trips.

2. Shared Memory MLP

For small memory MLPs, keep weights in shared memory during the update step.

3. Warp-Level Gradient Accumulation

Use CUB's warp-level primitives for efficient gradient accumulation across threads.

4. Vectorized Access

Use float4 loads/stores for coalesced memory access patterns.

5. Chunked Processing

Process sequences in chunks for better parallelism, following the TTT paper's approach.

References

Titans: Learning to Memorize at Test Time (Behrouz et al., 2024)
Learning to (Learn at Test Time) (Sun et al., 2024) - TTT
It's All Connected (Behrouz et al., 2025) - Miras Framework

License

MIT

Built with 🦀 for learning and research

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Titans CUDA Framework 🚀

Overview

Architecture

The Core Update Rule

Project Structure

Building

Prerequisites

Build

Run Tests

Run Benchmarks

Implementation Roadmap

Phase 1: Core Components ✅

Phase 2: Optimizations 🔄

Phase 3: Variants

Phase 4: Advanced

Usage Example

Benchmarks

Key Optimizations

1. Fused Forward-Backward

2. Shared Memory MLP

3. Warp-Level Gradient Accumulation

4. Vectorized Access

5. Chunked Processing

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
include		include
src		src
tests		tests
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Titans CUDA Framework 🚀

Overview

Architecture

The Core Update Rule

Project Structure

Building

Prerequisites

Build

Run Tests

Run Benchmarks

Implementation Roadmap

Phase 1: Core Components ✅

Phase 2: Optimizations 🔄

Phase 3: Variants

Phase 4: Advanced

Usage Example

Benchmarks

Key Optimizations

1. Fused Forward-Backward

2. Shared Memory MLP

3. Warp-Level Gradient Accumulation

4. Vectorized Access

5. Chunked Processing

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages