Skip to content

Commit ba25dc0

Browse files
committed
submission: SFT — Random Feature Adapters (Learning adapters on random linear maps)
- RandomFeatureMLP: fixed random projection R (seed 314159+layer, 0 bytes stored), only output projection W_out is learned — halves MLP parameter storage - SP8192, 11L×512d, GQA 8H/4KV, RandomFeatureMLP 4×, partial RoPE 16/64 - Depth recurrence L3-5 ×2 (17 virtual layers), parallel residuals L7+ - MuonEq-R (momentum=0.99, WD=0.095), EMA (decay=0.9965), warmdown 72% - GPTQ int6 (SDClip k=12.85) + int8 embeddings + Brotli-11 - Score-first SGD TTT (lr=0.005, 3ep/32K, Issue #1017 compliant) - Sliding window eval (seq_len=2048, stride=64) - ~24.4M stored params + ~11.5M free RF params → ~13.4MB artifact Implements 'Learning adapters on random linear maps' from Requests for PRs
1 parent 75700cb commit ba25dc0

5 files changed

Lines changed: 1626 additions & 0 deletions

File tree

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
# SFT — Stochastic Feature Transformer (Random Feature Adapters)
2+
3+
## Key Innovation: Learning Adapters on Random Linear Maps
4+
5+
This submission implements **RandomFeatureMLP** — the specific approach OpenAI requested
6+
in "Requests for PRs" (Issue #942): *"Learning adapters on random linear maps"*.
7+
8+
### Core Idea
9+
10+
In a standard Transformer MLP:
11+
```
12+
y = W_out(σ(W_in(x))) # W_in and W_out are BOTH learned
13+
```
14+
15+
In our RandomFeatureMLP:
16+
```
17+
R = random_gaussian(seed=314159 + layer_idx) # Fixed, 0 bytes stored
18+
y = W_out(σ(R @ x)) # Only W_out is learned
19+
```
20+
21+
The random projection matrix `R` is regenerated from a deterministic seed at load time,
22+
costing **exactly 0 bytes** in the artifact. This halves MLP parameter storage per layer.
23+
24+
### Why This Works
25+
26+
- **Random Features Theory** (Rahimi & Recht, NeurIPS 2007): Random projections followed
27+
by nonlinear activations approximate kernel functions. The learned output projection
28+
`W_out` acts as an adapter that maps these random features to useful representations.
29+
- **Parameter Efficiency**: With 50% fewer MLP parameters per layer, we can fit more
30+
layers or use saved space for model quality improvements.
31+
- **Training Speed**: No gradients flow through `R` → backward pass is ~25% cheaper
32+
per MLP layer, enabling more training steps within the 10-minute budget.
33+
34+
### Architecture
35+
36+
| Component | Detail |
37+
|-----------|--------|
38+
| Tokenizer | SentencePiece BPE 8192 |
39+
| Layers | 11 physical (17 virtual with depth recurrence) |
40+
| Model dim | 512 |
41+
| Heads | 8 query / 4 KV (GQA) |
42+
| Head dim | 64 |
43+
| MLP | RandomFeatureMLP 4× expansion (2048 hidden) |
44+
| RoPE | Partial (16/64 dims), base=10000 |
45+
| QK-Gain | 5.25 (learnable per head) |
46+
| Logit softcap | 30.0 |
47+
| Embeddings | Tied input/output |
48+
| Depth recurrence | Loop layers 3-5, 2 extra iterations, enable at 35% |
49+
| Parallel residuals | From layer 7 (GPT-J style) |
50+
| U-net skip | Sigmoid-gated skip connections |
51+
| LN scale | 1/√(2L+1) residual scaling |
52+
53+
### Training
54+
55+
| Hyperparameter | Value |
56+
|----------------|-------|
57+
| Optimizer | Muon (row-normalized, WD=0.095) + Adam |
58+
| Matrix LR | 0.022 |
59+
| Scalar LR | 0.02 |
60+
| Embed LR | 0.03 (tied) |
61+
| Muon momentum | 0.99 (warmup from 0.92 over 1500 steps) |
62+
| Batch tokens | 786,432 |
63+
| Seq len | 2048 |
64+
| Warmdown | 72% of wallclock |
65+
| EMA | decay=0.9965 |
66+
| Grad clip | 0.3 |
67+
68+
### Quantization & Compression
69+
70+
- **GPTQ int6** for weight matrices (SDClip k=12.85)
71+
- **int8** for embeddings (SDClip k=20.0)
72+
- **Brotli-11** compression (fallback to zlib)
73+
74+
### Evaluation
75+
76+
- **Sliding window**: seq_len=2048, stride=64
77+
- **Score-first TTT**: SGD lr=0.005, momentum=0.9, 3 epochs, 32K token chunks
78+
(legal per Issue #1017 Condition 3: score-before-update)
79+
80+
### Parameter Budget
81+
82+
| Component | Params (stored) | Params (free) |
83+
|-----------|-----------------|---------------|
84+
| Embedding | 4,194,304 ||
85+
| Attention (×11) | 8,650,752 ||
86+
| MLP output proj (×11) | 11,534,336 ||
87+
| MLP random proj (×11) || 11,534,336 |
88+
| Controls/norms | ~50,000 ||
89+
| **Total** | **~24.4M** | **~11.5M** |
90+
91+
Stored parameters at int6+Brotli ≈ 13.4 MB (2.6 MB headroom under 16 MB cap).
92+
93+
### How to Run
94+
95+
```bash
96+
# Smoke test (1 GPU)
97+
torchrun --nproc_per_node=1 train_gpt.py
98+
99+
# Full run (8×H100 SXM)
100+
torchrun --nproc_per_node=8 train_gpt.py
101+
```
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
#!/usr/bin/env bash
2+
# SFT Random Feature Adapters — 8×H100 SXM leaderboard run
3+
# Requires: brotli sentencepiece torch-cuda flash-attn-3
4+
# pip install brotli sentencepiece
5+
# Data download (run once):
6+
# MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
7+
# python3 data/cached_challenge_fineweb.py --variant sp8192
8+
9+
set -euo pipefail
10+
cd "$(dirname "$0")/../../../.." # repo root
11+
12+
SCRIPT="records/track_10min_16mb/2026-04-17_SFT_RandomFeatureAdapters/train_gpt.py"
13+
export DATASETS_DIR="${DATASETS_DIR:-./data/datasets/fineweb10B_sp8192}"
14+
export TOKENIZER_PATH="${TOKENIZER_PATH:-./data/tokenizers/fineweb_8192_bpe.model}"
15+
export VOCAB_SIZE=8192
16+
export TTT_ENABLED=1
17+
export SEED="${SEED:-42}"
18+
19+
echo "=== SFT Random Feature Adapters seed=${SEED} ==="
20+
torchrun \
21+
--standalone \
22+
--nproc_per_node=8 \
23+
"$SCRIPT" \
24+
2>&1 | tee "train_seed${SEED}.log"
25+
26+
echo "=== Submission size ==="
27+
wc -c final_model.int6.ptz
28+
python3 -c "
29+
sz = $(wc -c < final_model.int6.ptz)
30+
code_sz = $(wc -c < $SCRIPT)
31+
total = sz + code_sz
32+
print(f'model: {sz:,} bytes')
33+
print(f'code: {code_sz:,} bytes')
34+
print(f'total: {total:,} bytes [{\"PASS\" if total <= 16_000_000 else \"FAIL\"}]')
35+
"
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#!/usr/bin/env bash
2+
# Quick smoke test on 1 GPU (60 second wall clock, no TTT, just trains + validates)
3+
# Run from repo root:
4+
# bash records/track_10min_16mb/2026-04-17_SFT_RandomFeatureAdapters/run_smoke_1gpu.sh
5+
6+
set -euo pipefail
7+
cd "$(dirname "$0")/../../../.." # repo root
8+
9+
SCRIPT="records/track_10min_16mb/2026-04-17_SFT_RandomFeatureAdapters/train_gpt.py"
10+
export DATASETS_DIR="${DATASETS_DIR:-./data/datasets/fineweb10B_sp8192}"
11+
export TOKENIZER_PATH="${TOKENIZER_PATH:-./data/tokenizers/fineweb_8192_bpe.model}"
12+
export VOCAB_SIZE=8192
13+
export MAX_WALLCLOCK_SECONDS=60
14+
export TTT_ENABLED=0
15+
export SLIDING_WINDOW_ENABLED=0
16+
export SEED=42
17+
export WARMUP_STEPS=5
18+
export VAL_LOSS_EVERY=25
19+
export TRAIN_LOG_EVERY=10
20+
export GPTQ_CALIBRATION_BATCHES=4
21+
export GPTQ_RESERVE_SECONDS=8
22+
export NUM_LOOPS=0
23+
24+
echo "=== Smoke test (1 GPU, 60s) ==="
25+
torchrun \
26+
--standalone \
27+
--nproc_per_node=1 \
28+
"$SCRIPT" \
29+
2>&1 | tee smoke_test.log
30+
echo "=== Smoke test done ==="
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"author": "AVINASH0052",
3+
"github_id": "AVINASH0052",
4+
"name": "SFT — Random Feature Adapters",
5+
"blurb": "Stochastic Feature Transformer: replaces learned MLP input projections with FIXED random Gaussian matrices (regenerated from seed, 0 bytes stored). Only the output projection is learned, halving MLP storage cost. Based on Random Features theory (Rahimi & Recht 2007). First submission implementing 'Learning adapters on random linear maps' as requested in Requests for PRs. Architecture: SP8192, 11L×512d, GQA 8H/4KV, RandomFeatureMLP 4×, depth recurrence L3-5, parallel residuals, EMA, GPTQ int6 + Brotli.",
6+
"date": "2026-04-17T00:00:00Z",
7+
"val_loss": 0.0,
8+
"val_bpb": 0.0,
9+
"bytes_total": 0,
10+
"bytes_code": 0
11+
}

0 commit comments

Comments
 (0)