Skip to content

Commit 2443851

Browse files
authored
Merge pull request #1019 from abaybektursun/record/ar-selfgen-gptq-xsa-bigramhash3072
Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean)
2 parents 50390d6 + d7fbe3d commit 2443851

7 files changed

Lines changed: 2546 additions & 0 deletions

File tree

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112
2+
3+
**val_bpb: 1.1147** (3-seed mean, std 0.0004) | **~15.91 MB** | 8×H100 SXM, 600s | No TTT
4+
5+
**This submission uses only AR (autoregressive) self-generated calibration data.** After training, the model autoregressively generates its own calibration tokens (64 seqs × 2048 tokens, temp=0.8). No val data and no train data are accessed during quantization.
6+
7+
**Improvement over current SOTA ([PR #549](https://github.com/openai/parameter-golf/pull/549), 1.1194 BPB):** −0.0078 nats (−0.0046 BPB)
8+
9+
## Results
10+
11+
| Seed | Steps | ms/step | Pre-quant BPB | **Sliding BPB** | Artifact |
12+
|------|-------|---------|---------------|-----------------|----------|
13+
| 314 | 6,927 | 86.6 | 1.1354 | **1.1151** | 15,863,278 |
14+
| 42 | 6,922 | 86.7 | 1.1349 | **1.1144** | 15,984,850 |
15+
| 999 | 6,917 | 86.8 | 1.1353 | **1.1148** | 15,876,310 |
16+
| **Mean** | | | | **1.1147** | |
17+
18+
Current SOTA (PR #549, exact 3-seed mean): **1.11937967 BPB** (**1.89002068 nats**). This run's exact 3-seed mean is **1.11473509 BPB** (**1.88217853 nats**). Delta: **−0.00784215 nats** (**−0.00464458 BPB**).
19+
20+
Using the exact per-seed scores from the PR #549 logs (`1.11922988`, `1.12002032`, `1.11888882`) and this run (`1.11508120`, `1.11437394`, `1.11475014`), Welch's t-test gives **t = -11.83**, **df ≈ 3.31**.
21+
22+
---
23+
24+
## Main Changes
25+
26+
The comparison baseline is [PR #549](https://github.com/openai/parameter-golf/pull/549), the current legal leaderboard entry at **1.1194 BPB**. The implementation lineage is closer to [PR #609](https://github.com/openai/parameter-golf/pull/609): this run keeps the XSA-all + Full GPTQ + selective-pruning stack, but uses AR self-generated GPTQ calibration (no external data), bumps BigramHash to **3072 × 112**, and uses `lzma preset=9`.
27+
28+
### 1. AR Self-Generated Full Hessian GPTQ
29+
30+
PR #549 used GPTQ-lite (diagonal Hessian approximation). We use Full Hessian GPTQ with Cholesky error compensation and column reordering — a strictly better quantizer.
31+
32+
The calibration problem: prior Full Hessian GPTQ implementations (PRs #535, #569, #593, #609) calibrated on training data, ruled illegal after the 600s window. We solve this by having the model generate its own calibration data. After training completes, the model autoregressively generates 64 sequences of 2048 tokens (temperature=0.8, fixed seed). Hessians H = X^T X are collected from these self-generated sequences. No val data, no train data accessed during quantization.
33+
34+
### 2. BigramHash 3072 × dim=112 (up from 1536)
35+
36+
Lineage: [PR #549](https://github.com/openai/parameter-golf/pull/549) (1536) → [PR #609](https://github.com/openai/parameter-golf/pull/609) (2048) → this run (**3072 × dim=112**). Fits under 16MB; going wider increased artifact pressure past the break-even point.
37+
38+
### 3. XSA on all 11 layers (up from last 4)
39+
40+
PR #549 applied XSA to the last 4 layers. Extending to all 11 layers forces cross-position information mixing from layer 0 at zero parameter cost. Source: [PR #478](https://github.com/openai/parameter-golf/pull/478) by @gowtham0992.
41+
42+
### Dropped: TTT
43+
44+
PR #549 used Legal Score-First TTT for −0.0025 BPB. On this stack, TTT is neutral or negative (25 failed attempts across two stacks — see our [PR #756](https://github.com/openai/parameter-golf/pull/756)). The Full Hessian GPTQ improvement more than compensates for dropping TTT.
45+
46+
---
47+
48+
## Architecture
49+
50+
| Component | Setting | First introduced by |
51+
|-----------|---------|---------------------|
52+
| Layers | 11 (512d, 8 GQA heads, 4 KV heads) | Baseline |
53+
| MLP | 3× (1536) with LeakyReLU(0.5)² | [#493](https://github.com/openai/parameter-golf/pull/493) @parinzee |
54+
| Attention | XSA on all 11 layers | [#478](https://github.com/openai/parameter-golf/pull/478) @gowtham0992 |
55+
| BigramHash | **3072 × dim=112** | **This work** (concept: [#162](https://github.com/openai/parameter-golf/pull/162) @raahilshah) |
56+
| RoPE | Partial (16/64 dims) | [#315](https://github.com/openai/parameter-golf/pull/315) @jfprincz |
57+
| LN Scale | 1/√(layer+1) | [#315](https://github.com/openai/parameter-golf/pull/315) @jfprincz |
58+
| VE128 | Layers 9-10 | [#374](https://github.com/openai/parameter-golf/pull/374) @unnir |
59+
| SmearGate | Position-mixing gate | [#65](https://github.com/openai/parameter-golf/pull/65) @aquariouseworkman |
60+
| U-Net skips | Encoder-decoder connections | [#289](https://github.com/openai/parameter-golf/pull/289) |
61+
| Weight avg | EMA(0.997) + Tight SWA(every 50) | [#401](https://github.com/openai/parameter-golf/pull/401) @newjordan |
62+
| Quantization | **Full Hessian GPTQ int6 (AR self-gen calibration)** | **This work** (GPTQ: [#535](https://github.com/openai/parameter-golf/pull/535) @raahilshah) |
63+
| Compression | LZMA preset=9 | [#160](https://github.com/openai/parameter-golf/pull/160) @ChaseWNorton |
64+
| Warmdown | 4000 iterations | [#364](https://github.com/openai/parameter-golf/pull/364) @shikhar1729 |
65+
| Optimizer | **Parallel Muon + Parameter Banking** | **[#399](https://github.com/openai/parameter-golf/pull/399) @abaybektursun** |
66+
| Late QAT | STE at LR scale < 0.15 | [#286](https://github.com/openai/parameter-golf/pull/286) @chris-buckley |
67+
| Selective pruning | ±1 values by reconstruction error | [#609](https://github.com/openai/parameter-golf/pull/609) @saml212 |
68+
| Flash Attention 3 | Hopper warp-specialized kernels | [#122](https://github.com/openai/parameter-golf/pull/122) @mtybadger |
69+
70+
## Requirements
71+
72+
**Flash Attention 3 (Hopper) is required.** The script imports `flash_attn_interface` directly and was run with PyTorch 2.9.1+cu128.
73+
74+
```bash
75+
pip install --break-system-packages flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
76+
pip install sentencepiece zstandard
77+
python3 -c "from flash_attn_interface import flash_attn_func; import sentencepiece, zstandard; print('deps OK')"
78+
```
79+
80+
## Run Command
81+
82+
```bash
83+
BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 WARMDOWN_ITERS=4000 \
84+
TARGET_MB=15.9 SEED=314 \
85+
torchrun --standalone --nproc_per_node=8 train_gpt.py
86+
```
87+
88+
## Lineage
89+
90+
```
91+
PR #549 (Legal SOTA, 1.1194) — our Parallel Muon base with LeakyReLU² + legal TTT
92+
└── This work adds:
93+
├── AR self-gen GPTQ calibration (no external data during quantization)
94+
├── BigramHash 3072 × 112 (wider setting that still fits under 16MB)
95+
├── XSA-all (from #478/@gowtham0992, applied via #609/@saml212)
96+
├── Selective ±1 pruning (from #609/@saml212)
97+
├── warmdown=4000, LZMA=9 (from #364/@shikhar1729, #160/@ChaseWNorton)
98+
└── Guided by PR #670 negative results (30+ failed experiments)
99+
```
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# FlashAttention 3 must be installed separately; see README.md
2+
sentencepiece
3+
zstandard
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
{
2+
"author": "abaybektursun",
3+
"github_id": "abaybektursun",
4+
"name": "AR Self-Gen GPTQ + XSA-all + BigramHash 3072x112",
5+
"blurb": "11L XSA-all + Full Hessian GPTQ with autoregressive self-generated calibration (no val/train data accessed during quantization) + selective-pruning stack. BigramHash(3072,112), warmdown=4000, lzma preset=9. 3-seed exact mean: 1.11473509 BPB / 1.88217853 nats, beating PR549's exact 3-seed mean 1.11937967 BPB / 1.89002068 nats by 0.00784215 nats (Welch t=-11.83, df=3.31).",
6+
"date": "2026-03-25",
7+
"track": "10min_16mb",
8+
"val_loss": 1.88217853,
9+
"val_bpb": 1.11473509,
10+
"val_loss_std": 0.00059750,
11+
"val_bpb_std": 0.00035387,
12+
"seeds": [314, 42, 999],
13+
"seed_results": {
14+
"314": {
15+
"val_loss": 1.88276292,
16+
"val_bpb": 1.11508120,
17+
"artifact_bytes": 15863278,
18+
"steps": 6927,
19+
"step_avg_ms": 86.6
20+
},
21+
"42": {
22+
"val_loss": 1.88156874,
23+
"val_bpb": 1.11437394,
24+
"artifact_bytes": 15984850,
25+
"steps": 6922,
26+
"step_avg_ms": 86.7
27+
},
28+
"999": {
29+
"val_loss": 1.88220393,
30+
"val_bpb": 1.11475014,
31+
"artifact_bytes": 15876310,
32+
"steps": 6917,
33+
"step_avg_ms": 86.8
34+
}
35+
},
36+
"comparison_baseline_pr": 549,
37+
"implementation_lineage_pr": 609,
38+
"negative_results_pr": 670,
39+
"delta_vs_pr549_nats": -0.00784215,
40+
"delta_vs_pr549_bpb": -0.00464458,
41+
"t_statistic": -11.8339,
42+
"welch_df": 3.3063,
43+
"artifact_bytes_mean": 15908146,
44+
"artifact_bytes_max": 15984850,
45+
"bytes_total": 15984850,
46+
"train_steps_mean": 6922.00,
47+
"step_avg_ms_mean": 86.69,
48+
"hardware": "8xH100 80GB SXM",
49+
"pytorch_version": "2.9.1+cu128",
50+
"cuda_version": "12.8",
51+
"flash_attn_version": "2.8.3 (FA3 Hopper kernels)",
52+
"calibration": "AR self-generated (64 seqs x 2048 tokens, temp=0.8, no external data)",
53+
"technique_summary": "AR self-gen GPTQ calibration + XSA-all + BigramHash 3072x112 + Parallel Muon + LZMA9"
54+
}

0 commit comments

Comments
 (0)