OpenAI Parameter Golf challenge: train the best small LM that fits in a 16MB artifact (code + compressed model), trains in ≤10 minutes on 8×H100 SXM GPUs, evaluated by val_bpb (bits-per-byte) on FineWeb validation set. Lower is better.
Challenge: March 18 – April 30, 2026. Prize: $1M in OpenAI compute credits.
- Conda env:
conda activate opg(MUST be activated before any command) - Python deps: see
requirements.txt - Data:
./data/datasets/fineweb10B_sp1024/, tokenizer at./data/tokenizers/fineweb_1024_bpe.model
train_gpt.py— Base training script (reference). Agent modifies a working copy.data/— Dataset and tokenizer (READ-ONLY, never modify)records/— Historical leaderboard submissions (READ-ONLY reference)results.tsv— Experiment log (untracked by git)
val_bpb = 1.1428 (thwu1, 2026-03-20)
This is the consensus of the top 3 leaderboard entries. Use as starting point.
| Parameter | Value |
|---|---|
| num_layers | 10 |
| model_dim | 512 |
| num_heads | 8 |
| num_kv_heads | 4 |
| mlp_mult | 3 (hidden=1536) |
| train_seq_len | 2048 |
| train_batch_tokens | 786,432 |
| vocab_size | 1024 |
| tie_embeddings | yes |
| Parameter | Value |
|---|---|
| matrix_lr | 0.02 |
| scalar_lr | 0.02 |
| tied_embed_lr | 0.03 |
| muon_momentum | 0.99 |
| momentum_warmup_start | 0.92 |
| momentum_warmup_steps | 1500 |
| muon_weight_decay | 0.04 |
| grad_clip_norm | 0.3 |
| warmdown_iters | 3000 |
- int6 per-row quantization + zstd-22 compression
- FP16 tied embeddings
- SmearGate + BigramHash(4096+) + OrthoInit
- SWA every 50 steps, start_frac=0.4-0.5
- Sliding window eval (stride=64)
conda activate opg
# Single GPU:
python train_gpt.py
# 2 GPUs:
torchrun --standalone --nproc_per_node=2 train_gpt.pyconda activate opg
torchrun --standalone --nproc_per_node=8 train_gpt.pygrep "val_bpb:" run.log
grep "peak_vram_mb:\|artifact.*bytes" run.log- Create branch:
git checkout -b autoresearch/<tag>from main - Read
train_gpt.py,results.tsv, andgit logfor full context - Verify data exists in
./data/datasets/fineweb10B_sp1024/ - Initialize
results.tsvwith header row - Run iteration 0 (converged baseline) to establish baseline val_bpb
- Read git state:
git log --oneline -20+results.tsv - Make ONE focused change to
train_gpt.py - Write/update tests (TDD — tests BEFORE implementation)
git committhe change- Run: redirect output to
run.log(do NOT flood context) - Read results:
grep "^val_bpb:\|^peak_vram_mb:" run.log - If grep empty -> crash. Run
tail -n 50 run.log, attempt fix - Log to
results.tsv(do NOT commit results.tsv) - If val_bpb improved AND artifact <= 16MB -> run
/simplify, then keep - If val_bpb equal or worse ->
git revertto previous good state - NEVER STOP — run indefinitely until manually interrupted
- 10 minutes max per experiment (wall clock training)
- If a run exceeds 15 minutes, kill and treat as failure
- ~6 experiments/hour on dev hardware
commit val_bpb artifact_bytes status description
a1b2c3d 1.142760 15900000 keep baseline converged config
b2c3d4e 1.140000 15800000 keep added RevDEQ fixed-point layer
c3d4e5f 1.150000 16100000 discard MLA attention (artifact too large)
d4e5f6g 0.000000 0 crash soft routing OOM
- Paper: https://arxiv.org/abs/2509.12917
- The main transformer backbone MUST use RevDEQ: model output defined as fixed point of a learned function
- Enables exact gradients, no regularization needed, fewer function evaluations
- Reversible design allows memory-efficient training
- Inspired by Soft MoE (arxiv:2308.00951) but fully dense — ALL experts process ALL tokens
- No top-k selection, no token dropping, no sparse gating
- Routing weights via softmax over experts, but encourage additional non-linearities:
- Sigmoid gating on routing weights (like gated attention, arxiv:2505.06708)
- Learned gate scalars per expert
- Fully differentiable, no discrete routing decisions
- Replace standard MHA/GQA with MLA
- Low-rank KV compression: project to latent space, cache compressed, decompress on-the-fly
- Decoupled RoPE: split heads into RoPE and non-RoPE components
- Absorb decompression into subsequent linear layers where possible
- Gated Attention (arxiv:2505.06708): apply head-specific sigmoid gate after SDPA
- Query-dependent sparse gating scores modulate attention output per head
- Mitigates attention sinks, improves long-context performance
- Enables larger learning rates and better training stability
- Negligible parameter overhead (one gate vector per head)
- Artifact size <= 16,000,000 bytes (code + compressed model)
- Training time <= 600 seconds on 8xH100 SXM
- Must use FineWeb validation set for evaluation
- Tokenizer: SentencePiece BPE, vocab=1024
- All code MUST work with both single-GPU and multi-GPU (torchrun DDP)
- Dev on 1-2x L40S, validate on 8xH100
- Never use GPU-count-specific logic without proper world_size handling
- Write tests BEFORE implementation for each architectural change
- Test categories:
- Shape tests: verify tensor dimensions through the model
- Gradient tests: verify gradients flow (especially through RevDEQ fixed-point)
- Quantization roundtrip tests: verify model survives int6+zstd
- DDP tests: verify multi-GPU correctness
- Artifact size tests: verify <= 16MB after compression
- Run
/simplifyon every successful experiment before committing - Keep code clean and minimal — complexity must earn its keep
- All else equal, simpler is better
- A 0.001 bpb improvement adding 20 lines of hacky code? Probably not worth it
- Removing code and getting equal results? Definitely keep
- Favor removing complexity over adding it
- Run 3 seeds (e.g., 42, 1337, 2024) on 8xH100
- Compute mean and std of val_bpb
- Create folder in
records/track_10min_16mb/YYYY-MM-DD_<name>/ - Include: README.md, submission.json, train_gpt.py, training logs
- Submit PR to main