|
| 1 | +# Hardik Top5 Run |
| 2 | + |
| 3 | +Draft submission package for the OpenAI Parameter Golf `track_10min_16mb` track. |
| 4 | + |
| 5 | +This folder contains a self-contained copy of the current best local training script, prepared in the expected `records/...` format for pull request submission. The copied `train_gpt.py` has been patched so its default dataset and tokenizer paths resolve from the repository root even when executed from inside this records folder. |
| 6 | + |
| 7 | +## Status |
| 8 | + |
| 9 | +- Submission structure: ready |
| 10 | +- Script packaging: ready |
| 11 | +- Relative-path cleanup: ready |
| 12 | +- Reproducibility notes: ready |
| 13 | +- Final leaderboard claim: pending a fresh logged run for this exact script |
| 14 | + |
| 15 | +## Architecture Summary |
| 16 | + |
| 17 | +The current model is a Parameter Golf "podium build" based on the LeakyReLU^2 + TTT + Parallel Muon family, with the following default stack: |
| 18 | + |
| 19 | +- Vocabulary: SentencePiece 8192-token model |
| 20 | +- Backbone: 11 transformer layers, 512 hidden dim, 8 attention heads, 4 KV heads |
| 21 | +- MLP: 3.0x expansion with LeakyReLU(0.5)^2 activation |
| 22 | +- Residual layout: parallel residual attention + MLP path |
| 23 | +- Recurrence: block recurrence enabled by default on layers `4,5` with `RECURRENCE_LOOPS=3` |
| 24 | +- Attention extras: QK gain, partial RoPE (`ROPE_DIMS=16`), XSA on the last 4 layers |
| 25 | +- Token enrichments: Bigram hash embedding and shared value embeddings |
| 26 | +- Optimizers: AdamW for token/scalar groups + custom Parallel Muon for matrix banks |
| 27 | +- Averaging: EMA by default, optional SWA/LAWA |
| 28 | +- Compression path: mixed int6 / int8 quantization with lzma export |
| 29 | +- Eval extras: sliding-window validation and optional legal score-first TTT |
| 30 | + |
| 31 | +## Default Model Size |
| 32 | + |
| 33 | +- Parameter count (default config): `31,581,276` |
| 34 | +- Code bytes for this packaged `train_gpt.py`: `97,310` |
| 35 | + |
| 36 | +Note: the contest artifact limit is code bytes plus compressed model bytes. This folder does not yet include a verified compressed artifact size for the exact packaged script because a fresh training/eval run has not been logged for this copy yet. |
| 37 | + |
| 38 | +## Innovations Used |
| 39 | + |
| 40 | +1. SP8192 tokenizer defaults |
| 41 | +2. Depth recurrence through repeated middle blocks |
| 42 | +3. Parallel residual transformer blocks |
| 43 | +4. Learned QK gain scaling |
| 44 | +5. Parallel Muon / MuonEq-R optimizer path |
| 45 | +6. Hessian SDClip for GPTQ-style clipping |
| 46 | +7. GPTQ-style embedding quantization |
| 47 | +8. Optional legal score-first TTT with SGD or Adam |
| 48 | + |
| 49 | +## Hardware Used |
| 50 | + |
| 51 | +- Packaging/validation of this submission folder: local Windows machine |
| 52 | +- Target contest hardware: 8x H100 80GB SXM |
| 53 | +- Final authoritative leaderboard run hardware for this exact script: `TBD` |
| 54 | + |
| 55 | +## Training Time |
| 56 | + |
| 57 | +- Default script wallclock cap: `600` seconds (`MAX_WALLCLOCK_SECONDS=600`) |
| 58 | +- Fresh measured 8xH100 runtime for this packaged copy: `TBD` |
| 59 | + |
| 60 | +## Achieved Score |
| 61 | + |
| 62 | +- Fresh logged `val_bpb` for this packaged copy: `TBD` |
| 63 | +- Fresh logged `val_loss` for this packaged copy: `TBD` |
| 64 | +- Verified total submission bytes for this packaged copy: `TBD` |
| 65 | + |
| 66 | +Do not claim a leaderboard score from this folder until `train.log` and `submission.json` are updated from a real run of the included script. |
| 67 | + |
| 68 | +## Reproducibility |
| 69 | + |
| 70 | +The script is designed to be configurable through environment variables and avoids absolute machine-specific paths. |
| 71 | + |
| 72 | +- Seed default: `1337` |
| 73 | +- Dataset default: resolved from repo root as `data/datasets/fineweb10B_sp8192` |
| 74 | +- Tokenizer default: resolved from repo root as `data/tokenizers/fineweb_8192_bpe.model` |
| 75 | +- Optional acceleration imports (`flash_attn_interface`, `zstandard`) have safe fallbacks |
| 76 | +- No network calls are made during training or evaluation |
| 77 | + |
| 78 | +## Run From This Folder |
| 79 | + |
| 80 | +From repository root: |
| 81 | + |
| 82 | +```bash |
| 83 | +cd records/track_10min_16mb/hardik_top5_run |
| 84 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 85 | +``` |
| 86 | + |
| 87 | +Example with explicit paths: |
| 88 | + |
| 89 | +```bash |
| 90 | +cd records/track_10min_16mb/hardik_top5_run |
| 91 | +DATA_PATH=../../../data/datasets/fineweb10B_sp8192 \ |
| 92 | +TOKENIZER_PATH=../../../data/tokenizers/fineweb_8192_bpe.model \ |
| 93 | +RUN_ID=hardik_top5_run_seed1337 \ |
| 94 | +SEED=1337 \ |
| 95 | +torchrun --standalone --nproc_per_node=8 train_gpt.py |
| 96 | +``` |
| 97 | + |
| 98 | +## What To Update Before PR |
| 99 | + |
| 100 | +1. Run the packaged script on the intended hardware. |
| 101 | +2. Replace the placeholder `train.log` with the real training log. |
| 102 | +3. Update `submission.json` with real `val_loss`, `val_bpb`, and `bytes_total`. |
| 103 | +4. If submitting as a new record, include enough independent seeds to satisfy the repo significance rule. |
| 104 | + |
| 105 | +## Notes |
| 106 | + |
| 107 | +- This folder intentionally mirrors the structure of existing successful records in `records/track_10min_16mb/`. |
| 108 | +- The root-level `train_gpt.py` remains your active development file; this folder is the frozen submission copy for PR review. |
0 commit comments