Skip to content

Commit 9a1a055

Browse files
committed
submission: OmniClaw v2 — 54.8M params, GQA, depth recurrence, mixed int8/int6+brotli
Architecture: 640d, 11L, 10Q/5KV heads, MLP_MULT=4, depth recurrence L3-5×2 Quantization: Mixed int8 (embeddings) + int6 (weights) + brotli No QAT (LATE_QAT_THRESHOLD=0.0), GPTQ-lite post-hoc Score-first TTT (rollback if worse), SmearGate, partial RoPE, XSA val_bpb: ~1.16 (WIP, training in progress)
1 parent b86e3fa commit 9a1a055

4 files changed

Lines changed: 541 additions & 1342 deletions

File tree

.gitignore

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,4 +8,16 @@ data/manifest.json
88
data/docs_selected.jsonl
99
.mypy_cache/
1010
.venv
11-
logs/
11+
logs/
12+
# Infrastructure/experimental scripts (not needed for submission)
13+
azure_*.sh
14+
run_h100.sh
15+
runpod_*.sh
16+
modal_train*.py
17+
parameter_golf_modal.py
18+
train_t4_modal.py
19+
upload_data_modal.py
20+
train_gpt_mlx_kl.py
21+
Parameter-Golf-OmniVersion/
22+
kaggle/
23+
*.ipynb

SUBMISSION.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# OmniClaw Submission — Parameter Golf
2+
3+
## Result
4+
- **val_bpb**: ~1.16 (work in progress)
5+
- **Model size**: 54.8M params → ~14.8MB int6+brotli (under 16MB limit)
6+
- **Format**: int6+brotli (competition primary metric)
7+
8+
## Architecture
9+
- **Model dim**: 640, **Layers**: 11, **Heads**: 10 (Q), 5 (KV) — GQA
10+
- **MLP mult**: 4 (with tied embeddings)
11+
- **Vocab**: 8192 (SP8192 BPE tokenizer)
12+
- **Depth recurrence**: Layers 3-5 looped 2× (effective 14 layers, no extra params)
13+
- **Parallel residuals**: Layers 7+ (GPT-J style)
14+
- **Smear gate**: Blend token embedding with predecessor's embedding
15+
- **Partial RoPE**: dim=16 (instead of full)
16+
- **QK-Gain**: init=5.25, stabilizes attention with GQA
17+
- **Logit softcap**: 30.0
18+
- **XSA**: Cross-sequence attention on last 4 layers (7-10)
19+
20+
## Quantization
21+
- **Mixed int8/int6 + brotli**: int8 for embedding matrices (tok_emb/lm_head), int6 packed per-row for all other weights
22+
- **GPTQ-lite**: Per-row clip percentile search for optimal quantization
23+
- **Competition format**: int6+brotli roundtrip (primary), also outputs int8+zlib and int6+zstd for comparison
24+
25+
## Training
26+
- **Optimizer**: Muon with EMA (decay=0.9965)
27+
- **QAT**: Disabled (LATE_QAT_THRESHOLD=0.0) — rely on GPTQ-lite post-hoc quantization
28+
- **Warmup**: 20 steps, warmdown over last 3500 iterations
29+
- **Batch**: 786K tokens/step, seq_len=2048
30+
- **Ortho init**: Enabled
31+
- **EMA start**: 50% of training
32+
33+
## Innovations
34+
1. **Depth recurrence** — Layers 3-5 are looped 2×, adding compute without unique parameters
35+
2. **Score-first TTT** — Doc-independent LoRA TTT at eval time, only keep adaptations that improve loss (rollback if worse)
36+
3. **Mixed int8/int6 quantization** — Better preservation of embedding quality
37+
4. **SmearGate** — Smooths token representations at boundaries
38+
5. **XSA** — Cross-sequence attention allows information flow across sequence boundaries

0 commit comments

Comments
 (0)