Skip to content

Latest commit

 

History

History
78 lines (59 loc) · 3.42 KB

File metadata and controls

78 lines (59 loc) · 3.42 KB

This record implements sliding window evaluation, showing that eval strategies alone can provide significant improvements.

Note on train_gpt.py: The included script contains some unused experimental code paths (QAT, looped architectures) that are all disabled by default and were not active during this run. Only the sliding window evaluation code (eval_val_sliding, forward_logits, EVAL_STRIDE, EVAL_BATCH_SEQS) is used. The command below shows the exact invocation.

Key Idea: Sliding Window Evaluation

The baseline evaluates by chopping the validation set into non-overlapping 1024-token chunks. The problem is that the first token in each chunk has zero context. On average, each token gets ~512 tokens of context.

Sliding window evaluation uses overlapping windows with a configurable stride. With EVAL_STRIDE=64 and TRAIN_SEQ_LEN=1024, each window advances by 64 tokens, but only the rightmost 64 tokens (which have 960+ tokens of context) are scored. Every token in the validation set is scored exactly once, but with near-maximum context.

Results

Metric Naive Baseline This Submission
Pre-quant val_bpb 1.2172 1.2196
Post-quant val_bpb 1.2244 1.1925
Improvement -0.0319
Training steps 13,780 13,450
Eval time (8xH100) ~16s 70s
Artifact size 15,863,489 bytes 15,874,829 bytes

The pre-quant BPB is nearly identical (training is the same). The 0.032 improvement comes entirely from scoring tokens with richer context during evaluation.

Configuration

Architecture and training are identical to the Naive Baseline:

  • Layout: VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2
  • Tied output/input embeddings: TIE_EMBEDDINGS=1
  • Tied embedding LR: TIED_EMBED_LR=0.05
  • Batching: TRAIN_BATCH_TOKENS=524288 TRAIN_SEQ_LEN=1024

Evaluation-specific parameters:

  • EVAL_STRIDE=64 (sliding window stride; baseline uses non-overlapping = stride 1024)
  • EVAL_BATCH_SEQS=1024 (number of windows per forward pass for GPU utilization)

Command

RUN_ID=8xh100_slide64_v2 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
NUM_LOOPS=1 \
LORA_RANK=0 \
QAT=0 \
EVAL_STRIDE=64 \
EVAL_BATCH_SEQS=1024 \
MAX_WALLCLOCK_SECONDS=600 \
TRAIN_LOG_EVERY=200 \
VAL_LOSS_EVERY=1000 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

The NUM_LOOPS=1 LORA_RANK=0 QAT=0 flags explicitly disable all unused code paths (these are also the defaults).

Key Metrics (from train.log)

  • Timed training stopped at 13450/20000 steps due to the wallclock cap.
  • Pre-quant eval at stop: val_loss:2.0592, val_bpb:1.2196
  • Post-quant sliding window eval: val_loss:2.0135, val_bpb:1.1925
  • Exact printed metric: final_int8_zlib_roundtrip_exact val_bpb:1.19250007
  • Train time: 600028ms (step_avg:44.61ms)
  • Peak memory: 10119 MiB allocated, 10294 MiB reserved
  • Eval time: 69881ms (sliding window, stride=64, batch_seqs=1024)
  • Serialized model int8+zlib: 15816489 bytes
  • Code size: 58340 bytes
  • Total submission size int8+zlib: 15874829 bytes

Training Volume

  • Global batch: 524288 tokens/step
  • Total train tokens seen: 7,055,769,600

Included Files

  • train_gpt.py (code snapshot used for the run, includes eval_val_sliding function)
  • train.log (exact remote training log)
  • submission.json (leaderboard metadata)