parameter-golf/records/track_10min_16mb/2026-03-18_LongContextSeq2048/README.md at main · henrycashe26/parameter-golf

This record submission is called Long Context Seq2048 v2.

Configuration:

Layout: VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2
Tied output/input embeddings: TIE_EMBEDDINGS=1
Sequence length: TRAIN_SEQ_LEN=2048
Batching: TRAIN_BATCH_TOKENS=524288
Learning rates: TIED_EMBED_LR=0.04 MATRIX_LR=0.032 SCALAR_LR=0.032

Command:

NCCL_IB_DISABLE=1 \
RUN_ID=seq2048_sxm28_full_20260319a \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MAX_WALLCLOCK_SECONDS=600 \
torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-03-18_LongContextSeq2048/train_gpt.py

Verification environment:

8x H100 80GB HBM3
all-to-all NV18 topology
torch 2.8.0+cu128

Key metrics (from train.log in this folder, rerun on the target SXM-class box):

Timed training stopped at 11564/20000 steps due to the wallclock cap.
Pre-quant eval at stop: val_loss:2.0269, val_bpb:1.2005
Post-quant roundtrip eval: val_loss:2.0359, val_bpb:1.2058
Exact printed metric: final_int8_zlib_roundtrip_exact val_bpb:1.20576485
Train time: 600038ms (step_avg:51.89ms)
Peak memory: 10247 MiB allocated, 10488 MiB reserved
Serialized model int8+zlib: 15819554 bytes
Code size for this standalone record script: 47716 bytes
Total submission size int8+zlib: 15867270 bytes

Additional full-run reproducibility logs included in this folder:

train.log: canonical SXM rerun, SEED=1337, val_bpb=1.20576485
train_seed1338.log: SXM rerun, SEED=1338, val_bpb=1.20617460
train_seed1339.log: SXM rerun, SEED=1339, val_bpb=1.20715923

Record-track significance note:

The public repo state for this submission has Naive Baseline at 1.2243657.
The challenge therefore requires beating 1.2193657 to claim a new record.
All three included SXM full runs clear that threshold:
- SEED=1337: 1.20576485
- SEED=1338: 1.20617460
- SEED=1339: 1.20715923
Sample mean across the three runs: 1.20636623
Sample standard deviation: 0.00071667
One-sided one-sample t-test against 1.2193657: t=31.42 with df=2, which gives p=0.00051

Why this folder is standalone:

train_gpt.py compiles from inside this record folder and was used for the canonical rerun whose output is saved as train.log.
No extra Python source files are required for the training path.
The only inputs expected at runtime are the cached dataset and tokenizer paths described in the main repo README.

Included files:

train_gpt.py (standalone winning recipe with defaults baked in)
README.md (this file)
submission.json (leaderboard metadata)
train.log (canonical full log from the standalone record script)
train_seed1338.log, train_seed1339.log (extra full reruns for reproducibility)
logs/seq2048_sxm28_* (raw per-run tee output and trainer text logs from the SXM verification box)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls