This repository contains a PyTorch reimplementation of HOPE (High-order Optimization & Perception Engine) , based on the paper "Nested Learning: The Illusion of Deep Learning".
Unlike traditional Deep Learning which stacks layers, Nested Learning (NL) decomposes the architecture into "levels" based on update frequency.
- Code Location:
src/utils/trainer.py - Implementation Note: The training loop must handle Multi-Time Scale Updates. Parameters in different levels are updated with specific frequencies. You cannot use a standard
optimizer.step()for all parameters every iteration. The trainer tracks the global step and only updates specific parameter groups (e.g., "High Frequency Neurons" vs "Low Frequency Neurons") whenstep % frequency == 0.
The CMS replaces the traditional Feed-Forward Network (FFN). It is a chain of MLP blocks where the
-
Code Location:
src/memory/cms.py - Key Equation: Implement Equation 30 for the forward pass and Equation 31 for the update logic.
-
Structure: A
nn.ModuleListof MLPs. The forward pass is:$y_t = MLP^{(f_k)}(...MLP^{(f_1)}(x_t))$ .
This is the sequence mixing core. It uses a "Neural Learning Module" that learns to modify itself by learning its own update algorithm.
-
Code Location:
src/models/titans.py - Internal Optimizer: The model uses a variant of gradient descent with regression loss as its internal forward pass mechanism.
-
Key Equation: The update rule for the weight
$W_{t+1}$ follows Equation 28:$$W_{t+1} = W_t(I - x_t x_t^\top) - \eta_{t+1} \nabla_{y_t} \mathcal{L}(W_t; x_t) \otimes x_t$$ .
The paper argues that optimizers (like Adam or Momentum) are associative memory modules.
- Code Location:
src/optimizers/deep_opt.py - Deep Momentum Gradient Descent (DMGD): Implement Equation 23, where the momentum term is replaced by a neural network (e.g., an MLP) that compresses gradients.
You need to define the frequency hierarchy for the CMS layers.
model:
dim: 768
depth: 12
vocab_size: 50257
cms:
# Define levels (Low to High frequency)
# Frequencies are relative to the unit time step
levels:
- name: "level_1"
frequency: 1 # Updates every step (High Freq)
chunk_size: 16
- name: "level_2"
frequency: 16 # Updates every 16 steps
chunk_size: 1000 # 1M in paper, scaled down for testing
- name: "level_3"
frequency: 1000 # Updates rarely (Low Freq)
chunk_size: 0 # 0 implies rarely updated (Pre-training knowledge)To train the HOPE model on Wikitext:
python train.py --config config/hope_config.yaml --dataset wikitext-103Note: If you are using a less powerful computer, then you can use config/hope_mini_config.yaml to train.
When implementing src/models/hope.py, ensure you distinguish it from a standard Transformer:
-
No fixed FFN: Replace standard FFN blocks with the
CMSmodule. -
Dynamic Projections: The
$Q, K, V$ projections are not static; they are part of the optimization flow described in the "Self-Modifying Titans" section.