|
| 1 | +# Record: Score-First TTT + PPM-D Byte Mixture + QK-Gain 5.25 |
| 2 | + |
| 3 | +**mix_bpb = 0.9946** (3-seed mean, std 0.0002) | **< 16 MB** | 8xH100 SXM |
| 4 | + |
| 5 | +## 3-Seed Results |
| 6 | + |
| 7 | +| Seed | **Mix BPB** | **TTT BPB** | **Sliding BPB** | **Quantized BPB** | Artifact | |
| 8 | +|------|------------|------------|-----------------|-------------------|----------| |
| 9 | +| 42 | **0.9944** | 1.0807 | 1.0820 | 1.0986 | 15,997,374 | |
| 10 | +| 314 | **0.9947** | 1.0812 | 1.0826 | 1.0992 | 15,997,007 | |
| 11 | +| 999 | **0.9948** | 1.0813 | 1.0827 | 1.0994 | 15,997,375 | |
| 12 | +| **Mean** | **0.9946** | **1.0811** | **1.0824** | **1.0991** | | |
| 13 | +| **Std** | **0.0002** | **0.0003** | **0.0004** | **0.0004** | | |
| 14 | + |
| 15 | +## Key Changes |
| 16 | + |
| 17 | +### 1. Legal Score-First TTT (3-epoch SGD per chunk) |
| 18 | +Post-quantization test-time training on the frozen quantized model. Each chunk of validation tokens is **scored first**, then used for adaptation via 3 epochs of SGD (lr=0.005, momentum=0.9, cosine decay). The model is updated only on already-scored tokens. Fully compliant with Issue #1017 Condition 3 (score-before-update). Contributes ~0.017 BPB improvement over sliding window baseline (1.0824 -> 1.0811). |
| 19 | + |
| 20 | +### 2. PPM-D Byte Mixture (eval-time bolt-on) |
| 21 | +Order-5 byte-level PPM-D model (Cleary-Witten 1984) mixed with neural token log-probs in probability space. Binary-lambda gate: when PPM confidence >= 0.9, trust PPM (lambda=0.05); otherwise trust neural (lambda=0.9). Score-first: PPM byte counts update AFTER each byte's mixture log-prob is recorded. No byte ever influences its own probability before being scored. Contributes ~0.086 BPB improvement over neural-only TTT score (1.0807 -> 0.9944). Port of the PPM-D technique from PR #1835 (@anmarhindi). |
| 22 | + |
| 23 | +### 3. LZMA-Compressed Code Wrapper |
| 24 | +The submission code is a self-extracting bootstrap (~20KB) that decompresses and exec's the full train_gpt.py (~58KB) via base85-encoded LZMA. The bootstrap is written to disk during serialize() and is the actual submitted code artifact counted in bytes_total. |
| 25 | + |
| 26 | +## Base Architecture |
| 27 | + |
| 28 | +Built on the SOTA foundation from: |
| 29 | +- **@clarkkev** -- SP8192 + GPTQ SDClip + MuonEq-R + depth recurrence (PR #1394) |
| 30 | +- **@dexhunter** -- 3-layer depth recurrence (PR #1331, #1437), legal TTT on SP8192 (PR #1413) |
| 31 | +- **@abaybektursun** -- Score-first TTT framework (PR #549) |
| 32 | +- **@Robby955** -- Parallel residuals on SP8192 (PR #1412) |
| 33 | +- **@msisovic** -- Parallel residuals concept (PR #1204) |
| 34 | +- **@anmarhindi** -- PPM-D byte mixture technique (PR #1835) |
| 35 | + |
| 36 | +## Architecture |
| 37 | + |
| 38 | +11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: layers 3-5 loop (num_loops=2, activated at frac=0.35). Parallel residuals from layer 7. Skip gates. XSA on all layers. QK_GAIN_INIT=5.25. |
| 39 | + |
| 40 | +## Training |
| 41 | + |
| 42 | +~4600 steps in ~588s on 8xH100 SXM. EMA decay 0.9965. Warmdown frac 0.72. WD=0.095. MuonEq-R (row-normalized, Newton-Schulz 5 steps). |
| 43 | + |
| 44 | +## Quantization |
| 45 | + |
| 46 | +Full-Hessian GPTQ: int6 for attention/MLP matrices, int8 for token embeddings. Brotli-11 compression. |
| 47 | + |
| 48 | +## Score-First TTT |
| 49 | + |
| 50 | +Post-quantization, chunk-wise sliding-window eval with 3-epoch SGD adaptation per chunk. Each chunk is scored on the frozen model BEFORE any updates. Training uses lr=0.005, momentum=0.9, cosine LR decay across chunks. 8-GPU synchronous gradient averaging. Total eval time: ~420-474s across seeds. |
| 51 | + |
| 52 | +## PPM-D Byte Mixture |
| 53 | + |
| 54 | +After TTT scoring, per-token NLL values are collected across all scored positions. On rank 0, a byte-level PPM-D model processes the first 8M tokens of the byte stream. For each byte position: (1) the PPM-D prediction is computed from context counts that existed BEFORE that byte, (2) the neural prediction is the per-byte uniform share of the token NLL, (3) the mixture log-prob is log(lambda * p_NN + (1-lambda) * p_PPM), (4) THEN the byte's context counts are updated. This strict ordering ensures score-before-update compliance. Mix time: ~111s. |
| 55 | + |
| 56 | +## Compliance |
| 57 | + |
| 58 | +Per Issue #1017 (Track B -- legal eval-time adaptation): |
| 59 | +- Condition 1 (Causality): Sliding-window eval is strictly causal |
| 60 | +- Condition 2 (Normalized distribution): PPM-D mixture is a convex combination of two normalized distributions over the 256-symbol byte alphabet, producing a normalized distribution |
| 61 | +- Condition 3 (Score before update): TTT scores each chunk before adapting on it. PPM-D reads byte counts before updating them. No token or byte influences its own probability before being scored |
| 62 | +- Condition 4 (Single pass): Each token scored exactly once in the TTT sliding-window pass; each byte processed exactly once in the PPM-D left-to-right pass |
| 63 | +- All artifacts under 16,000,000 bytes on all 3 seeds |
| 64 | +- Training under 600s on all 3 seeds (~588s actual) |
| 65 | + |
| 66 | +## Reproduction |
| 67 | + |
| 68 | +```bash |
| 69 | +pip install brotli sentencepiece |
| 70 | +pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ |
| 71 | +MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192 |
| 72 | + |
| 73 | +SEED=42 COMPRESSOR=brotli \ |
| 74 | + torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 75 | +``` |
0 commit comments