|
| 1 | +# OmniClaw Submission — Parameter Golf |
| 2 | + |
| 3 | +## Result |
| 4 | +- **val_bpb**: ~1.16 (work in progress) |
| 5 | +- **Model size**: 54.8M params → ~14.8MB int6+brotli (under 16MB limit) |
| 6 | +- **Format**: int6+brotli (competition primary metric) |
| 7 | + |
| 8 | +## Architecture |
| 9 | +- **Model dim**: 640, **Layers**: 11, **Heads**: 10 (Q), 5 (KV) — GQA |
| 10 | +- **MLP mult**: 4 (with tied embeddings) |
| 11 | +- **Vocab**: 8192 (SP8192 BPE tokenizer) |
| 12 | +- **Depth recurrence**: Layers 3-5 looped 2× (effective 14 layers, no extra params) |
| 13 | +- **Parallel residuals**: Layers 7+ (GPT-J style) |
| 14 | +- **Smear gate**: Blend token embedding with predecessor's embedding |
| 15 | +- **Partial RoPE**: dim=16 (instead of full) |
| 16 | +- **QK-Gain**: init=5.25, stabilizes attention with GQA |
| 17 | +- **Logit softcap**: 30.0 |
| 18 | +- **XSA**: Cross-sequence attention on last 4 layers (7-10) |
| 19 | + |
| 20 | +## Quantization |
| 21 | +- **Mixed int8/int6 + brotli**: int8 for embedding matrices (tok_emb/lm_head), int6 packed per-row for all other weights |
| 22 | +- **GPTQ-lite**: Per-row clip percentile search for optimal quantization |
| 23 | +- **Competition format**: int6+brotli roundtrip (primary), also outputs int8+zlib and int6+zstd for comparison |
| 24 | + |
| 25 | +## Training |
| 26 | +- **Optimizer**: Muon with EMA (decay=0.9965) |
| 27 | +- **QAT**: Disabled (LATE_QAT_THRESHOLD=0.0) — rely on GPTQ-lite post-hoc quantization |
| 28 | +- **Warmup**: 20 steps, warmdown over last 3500 iterations |
| 29 | +- **Batch**: 786K tokens/step, seq_len=2048 |
| 30 | +- **Ortho init**: Enabled |
| 31 | +- **EMA start**: 50% of training |
| 32 | + |
| 33 | +## Innovations |
| 34 | +1. **Depth recurrence** — Layers 3-5 are looped 2×, adding compute without unique parameters |
| 35 | +2. **Score-first TTT** — Doc-independent LoRA TTT at eval time, only keep adaptations that improve loss (rollback if worse) |
| 36 | +3. **Mixed int8/int6 quantization** — Better preservation of embedding quality |
| 37 | +4. **SmearGate** — Smooths token representations at boundaries |
| 38 | +5. **XSA** — Cross-sequence attention allows information flow across sequence boundaries |
0 commit comments