You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: records/track_non_record_16mb/2026-04-06_StateSpaceHybrid_AttentionAnchors/README.md
+94-81Lines changed: 94 additions & 81 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
# Non-Record: State-Space Hybrid with Attention Anchors
2
2
3
-
This folder records a local continuation of the README wishlist `state-space models` lane.
3
+
This folder records a V8 promotion of the README wishlist `state-space models` lane.
4
4
5
5
This is **not** a leaderboard record attempt.
6
6
This is **not** an official 8xH100 / 10-minute lane run.
7
-
This is **not** a full-train-shards claim.
7
+
This is **not** a full-train-shards claim for the kept run.
8
8
This is **not** a statistical-significance claim for a record.
9
9
10
10
Track label for this folder:
@@ -18,8 +18,9 @@ What it is:
18
18
- A scorer-clean hybrid architecture study on the standard `train_gpt.py` path.
19
19
- Standard primary metric: `final_int8_zlib_roundtrip_exact val_bpb`.
20
20
- Full official validation split (`fineweb_val_*.bin`, `62,021,632` scored tokens).
21
-
- Local one-shard training on the single available `fineweb_train_000000.bin` shard.
22
-
- A Blackwell workstation continuation that refreshed the strongest legal all-attention control package before promoting a stronger hybrid point.
21
+
- The kept promoted result still trains on the single locally available `fineweb_train_000000.bin` shard.
22
+
- Phase 0 now includes a three-seed rerun package for the current kept hybrid family and the refreshed strongest legal all-attention control family on the same Blackwell lane.
23
+
- Phase 1 now includes a bounded Modal H100 continuation over an 80-shard cached train view, improving realism without changing the non-record status of the kept point.
23
24
- A compile-friendly architecture family, while the kept run below explicitly used `ENABLE_TORCH_COMPILE=0`.
24
25
25
26
## Kept Result
@@ -32,94 +33,91 @@ What it is:
32
33
- Kept transfer: `SmearGate`, retained as a fixed-predictor one-at-a-time transfer
- Training budget: `2200` steps on the single available train shard
38
+
- Training budget: `2200` steps on the single locally available train shard
38
39
- Kept run compile setting: `ENABLE_TORCH_COMPILE=0`, `SDP_BACKEND=math`
39
40
- GPU lane: single `NVIDIA RTX PRO 6000 Blackwell Workstation Edition`
40
-
- Primary score: `val_bpb = 1.50465667`
41
-
- Primary loss: `val_loss = 2.54054976`
41
+
- Primary score: `val_bpb = 1.50126339`
42
+
- Primary loss: `val_loss = 2.53482035`
42
43
- Model params: `17,119,784`
43
44
44
45
## Controlled Comparison
45
46
46
-
All rows below use the same scorer path, the same tokenizer, the same full validation split, and the same one-shard local training data.
47
+
All rows below use the same scorer path, the same tokenizer, the same full validation split, and the same one-shard local training data unless explicitly labeled as the separate H100 realism probe.
47
48
48
-
The strongest retained legal all-attention control in this continuation is now the leaner `top1blockfp16` family that preserves `tok_emb` and only the top attention block in `float16`, then spends the recovered byte budget on more Blackwell training steps.
49
+
The strongest retained legal all-attention control in this continuation remains the leaner `top1blockfp16` family that preserves `tok_emb` and only the top attention block in `float16`, then spends the recovered byte budget on more Blackwell training steps.
49
50
50
51
| Run | Layout | Train time | Eval time | Total bytes | val_bpb | val_loss | Legality |
Delta vs the refreshed strongest legal all-attention control: `-0.06192494` BPB.
59
+
Delta vs the strongest legal all-attention control: `-0.06531822` BPB.
59
60
60
-
Delta vs the previous promoted kept result: `-0.09209028` BPB.
61
+
Delta vs the previous public winner: `-0.00339328` BPB.
61
62
62
63
Important legality note:
63
64
64
-
- the retained legal control is now the `1420`-step `top1blockfp16` point at `1.56658161` BPB
65
+
- the strongest retained legal control remains the `1420`-step `top1blockfp16` point at `1.56658161` BPB
66
+
- the additional `4242` rerun stayed legal but was weaker at `1.56838339` BPB
65
67
- the nearby `1425`-step control was slightly better on raw BPB but crossed the cap at `16,006,424` bytes
66
-
- the older `740`-step and `800`-step `top2blocksfp16` points remain retained legality references only
67
-
- all higher-score illegal controls are documentation only and are not admissible as counted controls
68
+
- all higher-score illegal controls remain documentation only and are not admissible as counted controls
68
69
69
-
## Variance / Stability Package
70
+
## Phase 0 Variance Package
70
71
71
-
Before promoting a new winner, the previous public winner `AAAASASSS` + `SSM_KERNEL_SIZE=96` + `SmearGate` at `1200` steps was rerun on the same Blackwell lane to verify that its gain survived more seeds.
72
+
The current kept hybrid family was rerun twice more on the same Blackwell lane before promoting V8.
72
73
73
-
Retained reruns for the previous public winner at `1200` steps:
74
+
Retained reruns for the current kept hybrid family at `2200` steps:
74
75
75
-
- seed `2027`: `1.59674695` BPB
76
-
- seed `1337`: `1.60406053` BPB
77
-
- seed `4242`: `1.59435882` BPB
78
-
- seed `9001`: `1.60437754` BPB
79
-
- mean: `1.59988596`
80
-
- stddev: `0.00509915`
81
-
82
-
The earlier strongest control package at `730` steps was also rerun:
83
-
84
-
- seed `2027`: `1.65376228` BPB
85
-
- seed `4242`: `1.64550320` BPB
86
-
- seed `9001`: `1.65947334` BPB
87
-
- mean: `1.65291294`
88
-
- stddev: `0.00573482`
89
-
90
-
Mean edge for the previous public winner over that prior control package: `-0.05302698` BPB.
76
+
- seed `2027`: `1.50465667` BPB
77
+
- seed `1337`: `1.50615600` BPB
78
+
- seed `4242`: `1.50126339` BPB
79
+
- mean: `1.50402535`
80
+
- sample stddev: `0.00250666`
91
81
92
-
The refreshed strongest control family also has a retained rerun package:
82
+
The refreshed strongest legal control family at `1420` steps was also rerun to three seeds:
93
83
94
84
- seed `2027`: `1.56658161` BPB
95
85
- seed `1337`: `1.56865945` BPB
96
-
- mean: `1.56762053`
97
-
- stddev: `0.00146925`
86
+
- seed `4242`: `1.56838339` BPB
87
+
- mean: `1.56787482`
88
+
- sample stddev: `0.00112842`
98
89
99
-
The promoted `2200`-step hybrid continuation has a retained rerun package:
90
+
Mean paired hybrid-minus-control edge across the three matching seeds: `-0.06384946` BPB.
100
91
101
-
- seed `2027`: `1.50465667` BPB
102
-
- seed `1337`: `1.50615600` BPB
103
-
- mean: `1.50540634`
104
-
- stddev: `0.00106019`
105
-
106
-
Mean edge for the promoted candidate over the refreshed control mean: `-0.06221419` BPB.
92
+
Paired edge sample stddev: `0.00284710`.
107
93
108
94
## Data / Scale Reality
109
95
110
-
The biggest realism bottleneck in this local campaign remains unchanged:
96
+
The biggest local realism bottleneck remains the same:
111
97
112
-
- detected local train shards: `1`
98
+
- local dataset directory: `C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\data\datasets\fineweb10B_sp1024`
99
+
- local train shards detected: `1`
113
100
- available local shard: `fineweb_train_000000.bin`
101
+
- manifest-declared train shards for the full dataset: `195`
114
102
115
-
This continuation again checked bounded alternate-machine options before accepting the one-shard limit:
103
+
This continuation again checked bounded alternate-machine options before accepting the local one-shard limit:
116
104
117
105
-`vm-ubuntu-pitlab`: reachable, zero visible `fineweb_train_*.bin` shards, no visible `nvidia-smi`
118
106
-`ubuntu-dev`: reachable, zero visible `fineweb_train_*.bin` shards, no visible `nvidia-smi`
119
107
-`widelab-mac`: reachable, Apple `M4`, zero visible `fineweb_train_*.bin` shards
120
-
-`runpodctl`: installed locally but not configured with an API key, so no usable remote H100 lane was available from this workspace
108
+
-`runpodctl`: installed locally but not configured with an API key, so no usable RunPod path was available from this workspace
109
+
110
+
A usable remote H100 path did exist through Modal without new human setup:
121
111
122
-
No additional local or alternate-machine multi-shard continuation path was accessible during this run, so the kept result is still a one-shard non-record Blackwell result.
This improves the realism package because the same fixed-predictor hybrid recipe was exercised on a real multi-shard H100 path. It does **not** convert the kept result into an official-lane claim, and it does **not** replace the local Blackwell kept run as the promoted non-record point.
119
+
120
+
Phase 6 official-lane feasibility was not triggered in this campaign because the raw improvement over `1.50465667` was `0.00339328` BPB, below the `0.01` threshold required to force an official-lane feasibility attempt.
123
121
124
122
## Refreshed Control Frontier
125
123
@@ -136,7 +134,8 @@ Retained `top1blockfp16` controls on the same Blackwell lane with `tok_emb,block
136
134
This matters for interpretation:
137
135
138
136
- the public lane is no longer being compared only against the older `730`-step `top2blocksfp16` baseline
139
-
- the kept hybrid now clears a much stronger legal all-attention control by `0.06192494` BPB
137
+
- the kept hybrid now clears a much stronger legal all-attention control by `0.06531822` BPB
138
+
- the strongest legal control remained stable enough across three seeds that the hybrid still keeps a material edge on the refreshed package
140
139
141
140
## Export Granularity Study
142
141
@@ -165,9 +164,10 @@ The longer `128`-tap kernel regressed. A modest rank increase to `14` was slight
165
164
166
165
Scaling the stronger rank-14 point on the same lane produced:
167
166
168
-
-`1800` steps: `1.53097696` BPB, `14,765,396` total bytes
169
-
-`2000` steps: `1.51685767` BPB, `15,051,906` total bytes
170
-
-`2200` steps: `1.50465667` BPB, `15,260,268` total bytes
167
+
-`1800` steps, seed `2027`: `1.53097696` BPB, `14,765,396` total bytes
168
+
-`2000` steps, seed `2027`: `1.51685767` BPB, `15,051,906` total bytes
169
+
-`2200` steps, seed `2027`: `1.50465667` BPB, `15,260,268` total bytes
170
+
-`2200` steps, seed `4242`: `1.50126339` BPB, `15,272,426` total bytes
171
171
172
172
The kept result therefore spends the remaining legal budget on recurrent capacity plus more same-lane Blackwell scale, not on more attention.
173
173
@@ -189,12 +189,12 @@ Smear tuning check:
189
189
190
190
The lighter smear init was negative.
191
191
192
-
New transfer check on the stronger rank-14 branch:
192
+
QK-gain tuning check on the stronger rank-14 branch:
0 commit comments