Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
version: 2
updates:
- package-ecosystem: "pip"
directory: "/"
schedule:
interval: "weekly"
groups:
all-updates:
patterns:
- "*"
14 changes: 14 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: CI
on:
push:
branches: [main]
pull_request:
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
This record captures a finished non-record smoke submission built from the current root `train_gpt.py`, with a small CUDA compatibility patch in the local copy used for the run.

This run is not intended for the 10-minute leaderboard. It is a short, fully completed non-record baseline on a single RTX 4090 using the fixed full FineWeb validation split and a single training shard. The main purpose is to document a clean, reproducible CUDA submission path with final metrics, artifact bytes, and logs.

## Why this script differs slightly from root

The Vast.ai image used for this run shipped with a PyTorch build that does not accept the `enable_gqa=` argument on `scaled_dot_product_attention`. To keep the run reproducible on that image, the copied `train_gpt.py` expands KV heads manually when `num_kv_heads != num_heads` and then calls `scaled_dot_product_attention` without `enable_gqa`.

The model, tokenizer, data, and training setup otherwise follow the baseline configuration.

## Configuration

- Track: `non-record`, unlimited compute, still under the `16,000,000` byte artifact cap
- Hardware: `1x RTX 4090` on Vast.ai
- Tokenizer / dataset: `sp1024`, full fixed `fineweb_val_*`, `1` training shard
- Layout: `VOCAB_SIZE=1024 NUM_LAYERS=9 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=2`
- Tied embeddings: `TIE_EMBEDDINGS=1`
- Batching: `TRAIN_BATCH_TOKENS=8192 TRAIN_SEQ_LEN=1024`
- Validation cadence: final-only validation on the full fixed validation split
- Training length: `ITERATIONS=50`

## Command

```bash
RUN_ID=stukenov_4090_smoke50 \
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
TRAIN_BATCH_TOKENS=8192 \
TRAIN_SEQ_LEN=1024 \
VAL_BATCH_SIZE=65536 \
VAL_LOSS_EVERY=0 \
TRAIN_LOG_EVERY=25 \
ITERATIONS=50 \
MAX_WALLCLOCK_SECONDS=0 \
OMP_NUM_THREADS=1 \
TORCH_NCCL_ASYNC_ERROR_HANDLING=1 \
python train_gpt.py
```

## Key Metrics

- Training stopped at `50/50` steps.
- Pre-quant eval at stop: `val_loss:5.3102`, `val_bpb:3.1450`
- Post-quant int8+zlib roundtrip: `val_loss:5.7139`, `val_bpb:3.3841`
- Exact printed metric: `final_int8_zlib_roundtrip_exact val_loss:5.71391837 val_bpb:3.38410431`
- Train time: `12070ms` (`step_avg:241.40ms`)
- Eval time: `28444ms`
- Peak memory: `565 MiB allocated`, `750 MiB reserved`
- Serialized model int8+zlib: `5121054 bytes`
- Code size: `47999 bytes`
- Total submission size int8+zlib: `5169053 bytes`

## Included Files

- `train_gpt.py` (exact code snapshot used for the run)
- `train.log` (exact training log)
- `submission.json` (metadata for this non-record run)
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"author": "Saken Tukenov",
"github_id": "stukenov",
"name": "1x RTX 4090 Compat Smoke (50 steps)",
"blurb": "Finished non-record smoke run on 1x RTX 4090 using the baseline 9x512 SP-1024 architecture, one FineWeb training shard, and the full fixed validation split. Uses a small compatibility fallback in train_gpt.py to expand KV heads manually on a PyTorch 2.4 image that lacks enable_gqa support. Post-quant int8+zlib artifact remains under the 16,000,000-byte cap.",
"date": "2026-03-20",
"track": "non-record-unlimited-compute-16mb",
"val_loss": 5.71391837,
"val_bpb": 3.38410431,
"pre_quant_val_loss": 5.3102,
"pre_quant_val_bpb": 3.1450,
"step_stop": 50,
"wallclock_seconds": 12.070,
"eval_time_seconds": 28.444,
"bytes_total": 5169053,
"bytes_model_int8_zlib": 5121054,
"bytes_code": 47999
}
Loading