|
| 1 | +# Scylla Sub-1.05 Design |
| 2 | + |
| 3 | +## Objective |
| 4 | + |
| 5 | +Reach a defensible sub-`1.05` `val_bpb` submission by reproducing the measured |
| 6 | +Scylla frontier before adding any new ideas. Local March records top out at |
| 7 | +`1.12278022`; the SP8192 frontier in PR #1797 reaches `1.06157`; the only |
| 8 | +measured sub-`1.05` lane found in the current repo/PR landscape is PR #1813 at |
| 9 | +`0.94166052` over three seeds. |
| 10 | + |
| 11 | +## Decision |
| 12 | + |
| 13 | +Use the PR #1813 Scylla lane as the primary path. Keep the current SP1024/SP8192 |
| 14 | +scripts intact and create an isolated Scylla reproduction lane with explicit |
| 15 | +provenance, asset checks, artifact checks, and launcher scripts. Do not mutate |
| 16 | +`train_gpt_kl.py` until the Scylla lane has reproduced the reference behavior. |
| 17 | + |
| 18 | +## Reference Configuration |
| 19 | + |
| 20 | +The target record is: |
| 21 | + |
| 22 | +- Record path: `records/track_10min_16mb/2026-04-25_Scylla_QK525_DepthRecurrence_Experiment` |
| 23 | +- Score: `0.94166052` 3-seed mean, std `0.00066536` |
| 24 | +- Seeds: `1337`, `42`, `2025` |
| 25 | +- Worst artifact: `15,868,157` bytes, leaving only `131,843` bytes under the |
| 26 | + decimal `16,000,000` byte cap. |
| 27 | +- Architecture: 11 physical layers, 512 dim, 8 query heads, 4 KV heads, Scylla |
| 28 | + vocab size `998`, tied embeddings, train seq len `2048`, XSA on all layers. |
| 29 | +- Core knobs: `QK_GAIN_INIT=5.25`, `NUM_LOOPS=2`, `LOOP_START=3`, |
| 30 | + `LOOP_END=5`, `ENABLE_LOOPING_AT=0.35`, `BIGRAM_VOCAB_SIZE=2816`, |
| 31 | + `BIGRAM_DIM=40`, `USE_GPTQ=1`, `GPTQ_RESERVE_MS=9000`, `TTT_ENABLED=0`. |
| 32 | +- Compression: full GPTQ int6, `torch.save` quant payload, `lzma.compress` |
| 33 | + preset 6. |
| 34 | + |
| 35 | +## Why This Lane |
| 36 | + |
| 37 | +Compression-only work cannot close the gap from `1.1228` to sub-`1.05`. The |
| 38 | +useful compression findings are mostly negative: byte shuffle makes real |
| 39 | +artifacts larger, FP16 last-layer escape hatches exceed the cap, and INT4 hurts |
| 40 | +quality too much. PR #1813 crosses the target by changing the tokenizer/data |
| 41 | +regime and architecture schedule while still fitting under the size cap. |
| 42 | + |
| 43 | +## Architecture |
| 44 | + |
| 45 | +Create a separate lane with four responsibilities: |
| 46 | + |
| 47 | +1. **Provenance capture**: copy the PR #1813 `train_gpt.py`, logs, and |
| 48 | + `submission.json` into `frontier_sources/scylla_pr1813/` for local review. |
| 49 | +2. **Asset validation**: add a script that verifies the Scylla tokenizer and |
| 50 | + dataset assets exist before any paid GPU launch. |
| 51 | +3. **Run launch**: add a shell launcher that runs the exact reference config |
| 52 | + for one or all canonical seeds. |
| 53 | +4. **Artifact validation**: add a checker that computes code bytes + model bytes |
| 54 | + against `16,000,000` and fails below a configurable safety margin. |
| 55 | + |
| 56 | +## Data Flow |
| 57 | + |
| 58 | +The launcher passes explicit env vars into the copied Scylla `train_gpt.py`. |
| 59 | +Training reads only training shards during optimization and GPTQ calibration. |
| 60 | +Validation remains disabled during training via `VAL_LOSS_EVERY=0`; scoring |
| 61 | +runs only after the wallclock stop. The artifact checker runs after training and |
| 62 | +validates the generated compressed model plus script size. |
| 63 | + |
| 64 | +## Compliance Guardrails |
| 65 | + |
| 66 | +- Keep `TTT_ENABLED=0` for the first reproduction. |
| 67 | +- Reject cache/PPM/SLOT/ETLB-style additions in the Scylla preflight. |
| 68 | +- Keep Scylla recurrence + GPTQ allowed only for the exact proven loop schedule. |
| 69 | +- Use decimal bytes, not MiB. |
| 70 | +- Treat the `131,843` byte PR #1813 margin as fragile; any code or serializer |
| 71 | + growth must be offset by measured artifact savings. |
| 72 | + |
| 73 | +## Testing Strategy |
| 74 | + |
| 75 | +Local tests should run without GPU and without the Scylla dataset: |
| 76 | + |
| 77 | +- Python compile checks for new scripts. |
| 78 | +- Asset-check tests using temporary fake files. |
| 79 | +- Artifact-check tests using temporary fake model/code files. |
| 80 | +- Launcher dry-run or env-render check that confirms exact PR #1813 defaults. |
| 81 | + |
| 82 | +GPU validation is staged: |
| 83 | + |
| 84 | +1. Run one seed exactly, no ablations. |
| 85 | +2. Compare steps, artifact bytes, and final BPB to the PR #1813 logs. |
| 86 | +3. Run all three seeds only after the one-seed reproduction is within expected |
| 87 | + tolerance. |
| 88 | +4. Only then test narrow ablations: `BIGRAM_DIM=36/40/44`, |
| 89 | + `QK_GAIN_INIT=5.0/5.25`, loop `3-5` vs `4-5`, and LZMA preset `6/9`. |
| 90 | + |
| 91 | +## Rejected Approaches |
| 92 | + |
| 93 | +- **Compression-only local cleanup**: useful for hygiene, but not enough BPB. |
| 94 | +- **SP8192 as primary**: lower compliance risk, but the measured frontier is |
| 95 | + around `1.06157`, still above target. |
| 96 | +- **PPM/cache lane**: potentially strong but compliance-risky; keep it as a |
| 97 | + separate research lane, not the primary submission path. |
0 commit comments