Skip to content

Commit bb6cc10

Browse files
author
Codex
committed
Promote state-space hybrid V8 non-record keep
1 parent cc191ad commit bb6cc10

7 files changed

Lines changed: 641 additions & 765 deletions

File tree

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ Happy training!
5858
|-----|------:|--------|---------|------|------|
5959
| 1 Bit Quantization | 1.1239 | Ciprian-Florin Ifrim | 106M params quantized to 1 bit + misc arch changes + 2hr training | 2026-03-24 | [info](records/track_non_record_16mb/2026-03-24_106M_Binary_Asymmetric_UNet_FP8_15L_8192BPE_YaRN_NeoMuon_Smear/README.md) |
6060
| 4-Hour Baseline | 1.2074 | Will DePue | Testing unlimited compute, 4 hours on 8xH100 | 2026-03-18 | [info](records/track_non_record_16mb/2026-03-18_Quasi10Bfrom50B_SP1024_9x512_KV4_4h_pgut3/README.md) |
61+
| State-Space Hybrid + Attention Anchors | 1.5013 | Greg / Codex AutoResearch | Fixed-predictor AAAASASSS S4D hybrid on the standard scorer path; local one-shard Blackwell keep plus bounded 80-shard H100 continuation evidence | 2026-04-09 | [info](records/track_non_record_16mb/2026-04-06_StateSpaceHybrid_AttentionAnchors/README.md) |
6162

6263
#### Requests for PRs
6364

@@ -72,7 +73,9 @@ We'd love to see weird & creative ideas in the challenge, since you never know w
7273
- [ ] H-net tokenization
7374
- [ ] Universal transformer - [We have lots of depth recurrence submissions, but I'd love to see one 4 hour
7475
- [ ] Megakernels
75-
- [ ] State-space models, E2E TTT, super long context for evaluation or training
76+
- [x] State-space models - [implementation](records/track_non_record_16mb/2026-04-06_StateSpaceHybrid_AttentionAnchors/README.md)
77+
- [ ] E2E TTT
78+
- [ ] super long context for evaluation or training
7679
- [ ] Learning adapters on random linear maps
7780

7881
## Getting Started

records/track_non_record_16mb/2026-04-06_StateSpaceHybrid_AttentionAnchors/README.md

Lines changed: 94 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# Non-Record: State-Space Hybrid with Attention Anchors
22

3-
This folder records a local continuation of the README wishlist `state-space models` lane.
3+
This folder records a V8 promotion of the README wishlist `state-space models` lane.
44

55
This is **not** a leaderboard record attempt.
66
This is **not** an official 8xH100 / 10-minute lane run.
7-
This is **not** a full-train-shards claim.
7+
This is **not** a full-train-shards claim for the kept run.
88
This is **not** a statistical-significance claim for a record.
99

1010
Track label for this folder:
@@ -18,8 +18,9 @@ What it is:
1818
- A scorer-clean hybrid architecture study on the standard `train_gpt.py` path.
1919
- Standard primary metric: `final_int8_zlib_roundtrip_exact val_bpb`.
2020
- Full official validation split (`fineweb_val_*.bin`, `62,021,632` scored tokens).
21-
- Local one-shard training on the single available `fineweb_train_000000.bin` shard.
22-
- A Blackwell workstation continuation that refreshed the strongest legal all-attention control package before promoting a stronger hybrid point.
21+
- The kept promoted result still trains on the single locally available `fineweb_train_000000.bin` shard.
22+
- Phase 0 now includes a three-seed rerun package for the current kept hybrid family and the refreshed strongest legal all-attention control family on the same Blackwell lane.
23+
- Phase 1 now includes a bounded Modal H100 continuation over an 80-shard cached train view, improving realism without changing the non-record status of the kept point.
2324
- A compile-friendly architecture family, while the kept run below explicitly used `ENABLE_TORCH_COMPILE=0`.
2425

2526
## Kept Result
@@ -32,94 +33,91 @@ What it is:
3233
- Kept transfer: `SmearGate`, retained as a fixed-predictor one-at-a-time transfer
3334
- Kept attention tuning: default `QK_GAIN_INIT=1.5`
3435
- Kept export policy: keep only `ssm_coeff`, `ssm_log_decay`, and `ssm_d` in `float16`; quantize the rest with the standard int8+zlib path
35-
- Seed: `2027`
36+
- Seed: `4242`
3637
- Fixed config: `TRAIN_BATCH_TOKENS=32768`, `TRAIN_SEQ_LEN=1024`, `VAL_BATCH_SIZE=262144`
37-
- Training budget: `2200` steps on the single available train shard
38+
- Training budget: `2200` steps on the single locally available train shard
3839
- Kept run compile setting: `ENABLE_TORCH_COMPILE=0`, `SDP_BACKEND=math`
3940
- GPU lane: single `NVIDIA RTX PRO 6000 Blackwell Workstation Edition`
40-
- Primary score: `val_bpb = 1.50465667`
41-
- Primary loss: `val_loss = 2.54054976`
41+
- Primary score: `val_bpb = 1.50126339`
42+
- Primary loss: `val_loss = 2.53482035`
4243
- Model params: `17,119,784`
4344

4445
## Controlled Comparison
4546

46-
All rows below use the same scorer path, the same tokenizer, the same full validation split, and the same one-shard local training data.
47+
All rows below use the same scorer path, the same tokenizer, the same full validation split, and the same one-shard local training data unless explicitly labeled as the separate H100 realism probe.
4748

48-
The strongest retained legal all-attention control in this continuation is now the leaner `top1blockfp16` family that preserves `tok_emb` and only the top attention block in `float16`, then spends the recovered byte budget on more Blackwell training steps.
49+
The strongest retained legal all-attention control in this continuation remains the leaner `top1blockfp16` family that preserves `tok_emb` and only the top attention block in `float16`, then spends the recovered byte budget on more Blackwell training steps.
4950

5051
| Run | Layout | Train time | Eval time | Total bytes | val_bpb | val_loss | Legality |
5152
|---|---|---:|---:|---:|---:|---:|---|
52-
| Previous promoted hybrid | `AAAASASSS` | `312,374 ms` | `124,259 ms` | `13,249,267` | `1.59674695` | `2.69604033` | legal |
53-
| Previous strongest legal control | `AAAAAAAAA` | `293,093 ms` | `233,089 ms` | `15,979,462` | `1.65376228` | `2.79230833` | legal |
54-
| **Refreshed strongest legal control** | **`AAAAAAAAA`** | **`442,368 ms`** | **`222,490 ms`** | **`15,993,409`** | **`1.56658161`** | **`2.64510742`** | **legal** |
55-
| Nearest refreshed byte-cap boundary control | `AAAAAAAAA` | `492,617 ms` | `212,123 ms` | `16,006,424` | `1.56591641` | `2.64398426` | illegal |
56-
| **Kept hybrid** | **`AAAASASSS`** | **`609,292 ms`** | **`141,595 ms`** | **`15,260,268`** | **`1.50465667`** | **`2.54054976`** | **legal** |
53+
| Previous public winner | `AAAASASSS` | `609,292 ms` | `141,595 ms` | `15,260,268` | `1.50465667` | `2.54054976` | legal |
54+
| Strongest legal all-attention control | `AAAAAAAAA` | `442,368 ms` | `222,490 ms` | `15,993,409` | `1.56658161` | `2.64510742` | legal |
55+
| Control rerun on same lane | `AAAAAAAAA` | `431,256 ms` | `190,006 ms` | `15,996,880` | `1.56838339` | `2.64814966` | legal |
56+
| Nearest byte-cap boundary control | `AAAAAAAAA` | `492,617 ms` | `212,123 ms` | `16,006,424` | `1.56591641` | `2.64398426` | illegal |
57+
| **Kept hybrid** | **`AAAASASSS`** | **`1,207,089 ms`** | **`124,606 ms`** | **`15,272,426`** | **`1.50126339`** | **`2.53482035`** | **legal** |
5758

58-
Delta vs the refreshed strongest legal all-attention control: `-0.06192494` BPB.
59+
Delta vs the strongest legal all-attention control: `-0.06531822` BPB.
5960

60-
Delta vs the previous promoted kept result: `-0.09209028` BPB.
61+
Delta vs the previous public winner: `-0.00339328` BPB.
6162

6263
Important legality note:
6364

64-
- the retained legal control is now the `1420`-step `top1blockfp16` point at `1.56658161` BPB
65+
- the strongest retained legal control remains the `1420`-step `top1blockfp16` point at `1.56658161` BPB
66+
- the additional `4242` rerun stayed legal but was weaker at `1.56838339` BPB
6567
- the nearby `1425`-step control was slightly better on raw BPB but crossed the cap at `16,006,424` bytes
66-
- the older `740`-step and `800`-step `top2blocksfp16` points remain retained legality references only
67-
- all higher-score illegal controls are documentation only and are not admissible as counted controls
68+
- all higher-score illegal controls remain documentation only and are not admissible as counted controls
6869

69-
## Variance / Stability Package
70+
## Phase 0 Variance Package
7071

71-
Before promoting a new winner, the previous public winner `AAAASASSS` + `SSM_KERNEL_SIZE=96` + `SmearGate` at `1200` steps was rerun on the same Blackwell lane to verify that its gain survived more seeds.
72+
The current kept hybrid family was rerun twice more on the same Blackwell lane before promoting V8.
7273

73-
Retained reruns for the previous public winner at `1200` steps:
74+
Retained reruns for the current kept hybrid family at `2200` steps:
7475

75-
- seed `2027`: `1.59674695` BPB
76-
- seed `1337`: `1.60406053` BPB
77-
- seed `4242`: `1.59435882` BPB
78-
- seed `9001`: `1.60437754` BPB
79-
- mean: `1.59988596`
80-
- stddev: `0.00509915`
81-
82-
The earlier strongest control package at `730` steps was also rerun:
83-
84-
- seed `2027`: `1.65376228` BPB
85-
- seed `4242`: `1.64550320` BPB
86-
- seed `9001`: `1.65947334` BPB
87-
- mean: `1.65291294`
88-
- stddev: `0.00573482`
89-
90-
Mean edge for the previous public winner over that prior control package: `-0.05302698` BPB.
76+
- seed `2027`: `1.50465667` BPB
77+
- seed `1337`: `1.50615600` BPB
78+
- seed `4242`: `1.50126339` BPB
79+
- mean: `1.50402535`
80+
- sample stddev: `0.00250666`
9181

92-
The refreshed strongest control family also has a retained rerun package:
82+
The refreshed strongest legal control family at `1420` steps was also rerun to three seeds:
9383

9484
- seed `2027`: `1.56658161` BPB
9585
- seed `1337`: `1.56865945` BPB
96-
- mean: `1.56762053`
97-
- stddev: `0.00146925`
86+
- seed `4242`: `1.56838339` BPB
87+
- mean: `1.56787482`
88+
- sample stddev: `0.00112842`
9889

99-
The promoted `2200`-step hybrid continuation has a retained rerun package:
90+
Mean paired hybrid-minus-control edge across the three matching seeds: `-0.06384946` BPB.
10091

101-
- seed `2027`: `1.50465667` BPB
102-
- seed `1337`: `1.50615600` BPB
103-
- mean: `1.50540634`
104-
- stddev: `0.00106019`
105-
106-
Mean edge for the promoted candidate over the refreshed control mean: `-0.06221419` BPB.
92+
Paired edge sample stddev: `0.00284710`.
10793

10894
## Data / Scale Reality
10995

110-
The biggest realism bottleneck in this local campaign remains unchanged:
96+
The biggest local realism bottleneck remains the same:
11197

112-
- detected local train shards: `1`
98+
- local dataset directory: `C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\data\datasets\fineweb10B_sp1024`
99+
- local train shards detected: `1`
113100
- available local shard: `fineweb_train_000000.bin`
101+
- manifest-declared train shards for the full dataset: `195`
114102

115-
This continuation again checked bounded alternate-machine options before accepting the one-shard limit:
103+
This continuation again checked bounded alternate-machine options before accepting the local one-shard limit:
116104

117105
- `vm-ubuntu-pitlab`: reachable, zero visible `fineweb_train_*.bin` shards, no visible `nvidia-smi`
118106
- `ubuntu-dev`: reachable, zero visible `fineweb_train_*.bin` shards, no visible `nvidia-smi`
119107
- `widelab-mac`: reachable, Apple `M4`, zero visible `fineweb_train_*.bin` shards
120-
- `runpodctl`: installed locally but not configured with an API key, so no usable remote H100 lane was available from this workspace
108+
- `runpodctl`: installed locally but not configured with an API key, so no usable RunPod path was available from this workspace
109+
110+
A usable remote H100 path did exist through Modal without new human setup:
121111

122-
No additional local or alternate-machine multi-shard continuation path was accessible during this run, so the kept result is still a one-shard non-record Blackwell result.
112+
- cached volume: `pg-data`
113+
- cached train view: `fineweb10B_sp1024_train080`
114+
- train shards visible on that view: `80`
115+
- bounded H100 continuation: `modal_hybrid_aaaasasss_rank14_k96_smear_train080_400steps_seed4242_v8`
116+
- exact result: `val_bpb = 1.84995753`, `val_loss = 3.12357579`, `9,265,387` total bytes
117+
118+
This improves the realism package because the same fixed-predictor hybrid recipe was exercised on a real multi-shard H100 path. It does **not** convert the kept result into an official-lane claim, and it does **not** replace the local Blackwell kept run as the promoted non-record point.
119+
120+
Phase 6 official-lane feasibility was not triggered in this campaign because the raw improvement over `1.50465667` was `0.00339328` BPB, below the `0.01` threshold required to force an official-lane feasibility attempt.
123121

124122
## Refreshed Control Frontier
125123

@@ -136,7 +134,8 @@ Retained `top1blockfp16` controls on the same Blackwell lane with `tok_emb,block
136134
This matters for interpretation:
137135

138136
- the public lane is no longer being compared only against the older `730`-step `top2blocksfp16` baseline
139-
- the kept hybrid now clears a much stronger legal all-attention control by `0.06192494` BPB
137+
- the kept hybrid now clears a much stronger legal all-attention control by `0.06531822` BPB
138+
- the strongest legal control remained stable enough across three seeds that the hybrid still keeps a material edge on the refreshed package
140139

141140
## Export Granularity Study
142141

@@ -165,9 +164,10 @@ The longer `128`-tap kernel regressed. A modest rank increase to `14` was slight
165164

166165
Scaling the stronger rank-14 point on the same lane produced:
167166

168-
- `1800` steps: `1.53097696` BPB, `14,765,396` total bytes
169-
- `2000` steps: `1.51685767` BPB, `15,051,906` total bytes
170-
- `2200` steps: `1.50465667` BPB, `15,260,268` total bytes
167+
- `1800` steps, seed `2027`: `1.53097696` BPB, `14,765,396` total bytes
168+
- `2000` steps, seed `2027`: `1.51685767` BPB, `15,051,906` total bytes
169+
- `2200` steps, seed `2027`: `1.50465667` BPB, `15,260,268` total bytes
170+
- `2200` steps, seed `4242`: `1.50126339` BPB, `15,272,426` total bytes
171171

172172
The kept result therefore spends the remaining legal budget on recurrent capacity plus more same-lane Blackwell scale, not on more attention.
173173

@@ -189,12 +189,12 @@ Smear tuning check:
189189

190190
The lighter smear init was negative.
191191

192-
New transfer check on the stronger rank-14 branch:
192+
QK-gain tuning check on the stronger rank-14 branch:
193193

194-
- `2200`-step rank-14 hybrid, default `QK_GAIN_INIT=1.5`: `1.50465667` BPB
194+
- `1800`-step rank-14 hybrid, default `QK_GAIN_INIT=1.5`: `1.53097696` BPB
195195
- same branch + `QK_GAIN_INIT=1.7`: `1.53303405` BPB
196196

197-
This QK-gain increase was strongly negative on the stronger branch, so the kept configuration stays on the default attention-gain setting.
197+
This QK-gain increase was negative on the stronger branch, so the kept configuration stays on the default attention-gain setting.
198198

199199
Negative reference transfer retained from earlier in the lane:
200200

@@ -212,48 +212,55 @@ Passed for this non-record folder:
212212
- Artifact byte audit under the decimal `16,000,000` byte cap for the kept promoted run
213213
- No validation-data training
214214
- No evaluation-time downloads or external services
215-
- Quantization/export policy explicitly accounted for in bytes
215+
- Recurrent export policy explicitly accounted for separately from the attention/MLP export policy
216216
- Kept run configuration explicitly recorded with `ENABLE_TORCH_COMPILE=0`
217217
- Fixed-predictor labeling explicit; no eval-time adaptation or TTT
218-
- Expanded rerun package for the previous public winner completed
219-
- Refreshed legal all-attention control frontier completed, including legal and illegal byte-boundary points
220-
- SSM-side headroom study completed
221-
- Fixed-predictor transfer study completed
222-
- Alternate-machine realism probe completed
223-
- Local H100 / official-lane feasibility was not possible from the accessible environments during this run
218+
- Phase 0 expanded rerun package completed
219+
- Phase 1 realism package completed with a bounded Modal H100 multi-shard continuation
220+
- Phase 2 refreshed legal all-attention control frontier completed, including legal and illegal byte-boundary points
221+
- Phase 3 fixed-predictor transfer study completed
222+
- Phase 4 SSM-side headroom study completed
223+
- Phase 6 official-lane feasibility was not triggered by this promotion
224224

225225
Not claimed here:
226226

227-
- full training set usage
227+
- full training set usage for the kept run
228228
- official record-lane legality
229-
- H100 or 8xH100 confirmation
229+
- official-lane feasibility confirmation
230230
- statistical significance for a record claim
231231

232232
## Artifact Size
233233

234234
- Code bytes: `57,941`
235-
- Model bytes (`final_model.int8.ptz`): `15,202,327`
236-
- Total bytes: `15,260,268`
235+
- Model bytes (`final_model.int8.ptz`): `15,214,485`
236+
- Total bytes: `15,272,426`
237+
- Remaining legal headroom: `727,574`
237238

238239
## Wallclock Breakdown
239240

240241
From the kept promoted run:
241242

242-
- Training time: `609,292 ms`
243-
- Evaluation time: `141,595 ms`
244-
- Export / serialization / roundtrip overhead: about `5,135 ms`
245-
- End-to-end run duration: `756.02 s`
243+
- Training time: `1,207,089 ms`
244+
- Evaluation time: `124,606 ms`
245+
- Export / serialization / roundtrip overhead: about `5,764 ms`
246+
- End-to-end run duration: `1,337.46 s`
246247

247-
Refreshed strongest legal all-attention control:
248+
Strongest legal all-attention control:
248249

249250
- Training time: `442,368 ms`
250251
- Evaluation time: `222,490 ms`
251252
- Export / serialization / roundtrip overhead: about `8,179 ms`
252253
- End-to-end run duration: `673.04 s`
253254

254-
## Exact Command
255+
Modal H100 realism continuation:
256+
257+
- Training time: `88,650 ms`
258+
- Evaluation time: `79,800 ms`
259+
- End-to-end run duration: `264.98 s`
260+
261+
## Exact Commands
255262

256-
PowerShell command used for the kept run from the research workspace:
263+
PowerShell command used for the kept promoted run from the research workspace:
257264

258265
```powershell
259266
$env:CUDA_VISIBLE_DEVICES='1'
@@ -278,7 +285,13 @@ $env:SSM_RANK='14'
278285
$env:PARALLEL_ATTN_BIAS_INIT='1.5'
279286
$env:SMEAR_ENABLED='1'
280287
$env:INT8_FORCE_FLOAT_NAME_PATTERNS='ssm_coeff,ssm_log_decay,ssm_d'
281-
$env:SEED='2027'
282-
$env:RUN_ID='full_anchor_s4d_aaaasasss_rank14_k96_corefp16_smear_2200steps_blackwell_seed2027'
288+
$env:SEED='4242'
289+
$env:RUN_ID='full_anchor_s4d_aaaasasss_rank14_k96_corefp16_smear_2200steps_blackwell_seed4242'
283290
C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\.venv\Scripts\python.exe C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf-ssm-hybrid-research-scale\train_gpt.py
284291
```
292+
293+
PowerShell command used for the bounded Modal H100 realism continuation:
294+
295+
```powershell
296+
C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\.venv\Scripts\python.exe C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf-ssm-hybrid-research-scale\experiments\state_space_hybrid\modal_phase1_probe.py --mode train --run-name modal_hybrid_aaaasasss_rank14_k96_smear_train080_400steps_seed4242_v8 --iterations 400 --seed 4242
297+
```

0 commit comments

Comments
 (0)