This record captures the Simple Baseline.
Trainer changes in this snapshot:
- current repository
train_gpt.pysnapshot copied into the record folder - published
fineweb10B_sp1024dataset and tokenizer loaded from the new Hugging Face export - 10-minute wallclock cap on
8xH100 - periodic validation every
200steps on the fullfineweb_val_*split
Configuration:
- Layout:
VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2 - Tied output/input embeddings:
TIE_EMBEDDINGS=1 - Tied embedding LR:
TIED_EMBED_LR=0.05 - Batching:
TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024
Command (track-relevant params):
NCCL_IB_DISABLE=1 \
RUN_ID=hf_verify_sp1024_8gpu \
DATA_PATH=/root/code/parameter-golf/data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=/root/code/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
MAX_WALLCLOCK_SECONDS=600 \
TRAIN_LOG_EVERY=50 \
VAL_LOSS_EVERY=200 \
torchrun --standalone --nproc_per_node=8 /root/code/parameter-golf/train_gpt.pyKey metrics (from train.log):
- Timed training stopped at
13780/20000steps due to the wallclock cap. - Pre-quant eval at stop:
val_loss:2.0606,val_bpb:1.2172 - Post-quant roundtrip eval:
val_loss:2.0727,val_bpb:1.2244 - Exact printed metric:
final_int8_zlib_roundtrip_exact val_bpb:1.22436570 - Train time:
600038ms(step_avg:43.54ms) - Peak memory:
10184 MiB allocated,10200 MiB reserved - Serialized model int8+zlib:
15815847 bytes - Code size:
47642 bytes - Total submission size int8+zlib:
15863489 bytes
Training volume:
- Global batch:
524288tokens/step - Total train tokens seen:
7224688640
Included files:
train_gpt.py(code snapshot used for the run)train.log(exact remote training log)submission.json(leaderboard metadata)