Parameter Golf — Autoresearch Project

Project Overview

OpenAI Parameter Golf challenge: train the best small LM that fits in a 16MB artifact (code + compressed model), trains in ≤10 minutes on 8×H100 SXM GPUs, evaluated by val_bpb (bits-per-byte) on FineWeb validation set. Lower is better.

Challenge: March 18 – April 30, 2026. Prize: $1M in OpenAI compute credits.

Environment

Conda env: conda activate opg (MUST be activated before any command)
Python deps: see requirements.txt
Data: ./data/datasets/fineweb10B_sp1024/, tokenizer at ./data/tokenizers/fineweb_1024_bpe.model

Key Files

train_gpt.py — Base training script (reference). Agent modifies a working copy.
data/ — Dataset and tokenizer (READ-ONLY, never modify)
records/ — Historical leaderboard submissions (READ-ONLY reference)
results.tsv — Experiment log (untracked by git)

Current SOTA

val_bpb = 1.1428 (thwu1, 2026-03-20)

Converged Best-Known Config (Iteration 0 Baseline)

This is the consensus of the top 3 leaderboard entries. Use as starting point.

Architecture

Parameter	Value
num_layers	10
model_dim	512
num_heads	8
num_kv_heads	4
mlp_mult	3 (hidden=1536)
train_seq_len	2048
train_batch_tokens	786,432
vocab_size	1024
tie_embeddings	yes

Optimizer

Parameter	Value
matrix_lr	0.02
scalar_lr	0.02
tied_embed_lr	0.03
muon_momentum	0.99
momentum_warmup_start	0.92
momentum_warmup_steps	1500
muon_weight_decay	0.04
grad_clip_norm	0.3
warmdown_iters	3000

Quantization & Techniques

int6 per-row quantization + zstd-22 compression
FP16 tied embeddings
SmearGate + BigramHash(4096+) + OrthoInit
SWA every 50 steps, start_frac=0.4-0.5
Sliding window eval (stride=64)

How to Run

Dev mode (1-2x L40S)

conda activate opg
# Single GPU:
python train_gpt.py
# 2 GPUs:
torchrun --standalone --nproc_per_node=2 train_gpt.py

Full mode (8xH100 — final validation only)

conda activate opg
torchrun --standalone --nproc_per_node=8 train_gpt.py

Evaluate results

grep "val_bpb:" run.log
grep "peak_vram_mb:\|artifact.*bytes" run.log

Autoresearch Protocol

Setup

Create branch: git checkout -b autoresearch/<tag> from main
Read train_gpt.py, results.tsv, and git log for full context
Verify data exists in ./data/datasets/fineweb10B_sp1024/
Initialize results.tsv with header row
Run iteration 0 (converged baseline) to establish baseline val_bpb

Experiment Loop (LOOP FOREVER)

Read git state: git log --oneline -20 + results.tsv
Make ONE focused change to train_gpt.py
Write/update tests (TDD — tests BEFORE implementation)
git commit the change
Run: redirect output to run.log (do NOT flood context)
Read results: grep "^val_bpb:\|^peak_vram_mb:" run.log
If grep empty -> crash. Run tail -n 50 run.log, attempt fix
Log to results.tsv (do NOT commit results.tsv)
If val_bpb improved AND artifact <= 16MB -> run /simplify, then keep
If val_bpb equal or worse -> git revert to previous good state
NEVER STOP — run indefinitely until manually interrupted

Time Budget

10 minutes max per experiment (wall clock training)
If a run exceeds 15 minutes, kill and treat as failure
~6 experiments/hour on dev hardware

results.tsv Format (tab-separated)

commit	val_bpb	artifact_bytes	status	description
a1b2c3d	1.142760	15900000	keep	baseline converged config
b2c3d4e	1.140000	15800000	keep	added RevDEQ fixed-point layer
c3d4e5f	1.150000	16100000	discard	MLA attention (artifact too large)
d4e5f6g	0.000000	0	crash	soft routing OOM

Architectural Constraints (MUST SATISFY)

1. RevDEQ (Reversible Deep Equilibrium Model)

Paper: https://arxiv.org/abs/2509.12917
The main transformer backbone MUST use RevDEQ: model output defined as fixed point of a learned function
Enables exact gradients, no regularization needed, fewer function evaluations
Reversible design allows memory-efficient training

2. Soft Dense Routing (Dense MoE — no sparsity)

Inspired by Soft MoE (arxiv:2308.00951) but fully dense — ALL experts process ALL tokens
No top-k selection, no token dropping, no sparse gating
Routing weights via softmax over experts, but encourage additional non-linearities:
- Sigmoid gating on routing weights (like gated attention, arxiv:2505.06708)
- Learned gate scalars per expert
Fully differentiable, no discrete routing decisions

3. Multi-head Latent Attention (MLA) with Gated Attention — DeepSeek

Replace standard MHA/GQA with MLA
Low-rank KV compression: project to latent space, cache compressed, decompress on-the-fly
Decoupled RoPE: split heads into RoPE and non-RoPE components
Absorb decompression into subsequent linear layers where possible
Gated Attention (arxiv:2505.06708): apply head-specific sigmoid gate after SDPA
- Query-dependent sparse gating scores modulate attention output per head
- Mitigates attention sinks, improves long-context performance
- Enables larger learning rates and better training stability
- Negligible parameter overhead (one gate vector per head)

4. Parameter Golf Hard Constraints (ENFORCED)

Artifact size <= 16,000,000 bytes (code + compressed model)
Training time <= 600 seconds on 8xH100 SXM
Must use FineWeb validation set for evaluation
Tokenizer: SentencePiece BPE, vocab=1024

5. DDP Compatibility

All code MUST work with both single-GPU and multi-GPU (torchrun DDP)
Dev on 1-2x L40S, validate on 8xH100
Never use GPU-count-specific logic without proper world_size handling

Development Practices

TDD (Test-Driven Development)

Write tests BEFORE implementation for each architectural change
Test categories:
- Shape tests: verify tensor dimensions through the model
- Gradient tests: verify gradients flow (especially through RevDEQ fixed-point)
- Quantization roundtrip tests: verify model survives int6+zstd
- DDP tests: verify multi-GPU correctness
- Artifact size tests: verify <= 16MB after compression

Simplify Before Commit

Run /simplify on every successful experiment before committing
Keep code clean and minimal — complexity must earn its keep

Simplicity Criterion

All else equal, simpler is better
A 0.001 bpb improvement adding 20 lines of hacky code? Probably not worth it
Removing code and getting equal results? Definitely keep
Favor removing complexity over adding it

Submission Process (when ready)

Run 3 seeds (e.g., 42, 1337, 2024) on 8xH100
Compute mean and std of val_bpb
Create folder in records/track_10min_16mb/YYYY-MM-DD_<name>/
Include: README.md, submission.json, train_gpt.py, training logs
Submit PR to main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter Golf — Autoresearch Project

Project Overview

Environment

Key Files

Current SOTA

Converged Best-Known Config (Iteration 0 Baseline)

Architecture

Optimizer

Quantization & Techniques

How to Run

Dev mode (1-2x L40S)

Full mode (8xH100 — final validation only)

Evaluate results

Autoresearch Protocol

Setup

Experiment Loop (LOOP FOREVER)

Time Budget

results.tsv Format (tab-separated)

Architectural Constraints (MUST SATISFY)

1. RevDEQ (Reversible Deep Equilibrium Model)

2. Soft Dense Routing (Dense MoE — no sparsity)

3. Multi-head Latent Attention (MLA) with Gated Attention — DeepSeek

4. Parameter Golf Hard Constraints (ENFORCED)

5. DDP Compatibility

Development Practices

TDD (Test-Driven Development)

Simplify Before Commit

Simplicity Criterion

Submission Process (when ready)

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

Parameter Golf — Autoresearch Project

Project Overview

Environment

Key Files

Current SOTA

Converged Best-Known Config (Iteration 0 Baseline)

Architecture

Optimizer

Quantization & Techniques

How to Run

Dev mode (1-2x L40S)

Full mode (8xH100 — final validation only)

Evaluate results

Autoresearch Protocol

Setup

Experiment Loop (LOOP FOREVER)

Time Budget

results.tsv Format (tab-separated)

Architectural Constraints (MUST SATISFY)

1. RevDEQ (Reversible Deep Equilibrium Model)

2. Soft Dense Routing (Dense MoE — no sparsity)

3. Multi-head Latent Attention (MLA) with Gated Attention — DeepSeek

4. Parameter Golf Hard Constraints (ENFORCED)

5. DDP Compatibility

Development Practices

TDD (Test-Driven Development)

Simplify Before Commit

Simplicity Criterion

Submission Process (when ready)