Skip to content

Commit b4cc281

Browse files
author
Codex
committed
docs: clarify state-space hybrid campaign outcome
1 parent bb6cc10 commit b4cc281

1 file changed

Lines changed: 65 additions & 260 deletions

File tree

  • records/track_non_record_16mb/2026-04-06_StateSpaceHybrid_AttentionAnchors
Lines changed: 65 additions & 260 deletions
Original file line numberDiff line numberDiff line change
@@ -1,297 +1,102 @@
11
# Non-Record: State-Space Hybrid with Attention Anchors
22

3-
This folder records a V8 promotion of the README wishlist `state-space models` lane.
3+
This folder records a wishlist-aligned `state-space models` lane for the non-record 16 MB track.
44

55
This is **not** a leaderboard record attempt.
66
This is **not** an official 8xH100 / 10-minute lane run.
77
This is **not** a full-train-shards claim for the kept run.
8-
This is **not** a statistical-significance claim for a record.
8+
This is **not** a claim that the currently promoted public checkpoint is the latest strongest internal finding.
99

10-
Track label for this folder:
10+
Track label:
1111

1212
- fixed-predictor state-space hybrid
13-
- not an adaptive-compression result
14-
- no eval-time adaptation or TTT
13+
- no adaptive compression
14+
- no eval-time adaptation
15+
- no TTT
1516

16-
What it is:
17+
## Summary
1718

18-
- A scorer-clean hybrid architecture study on the standard `train_gpt.py` path.
19-
- Standard primary metric: `final_int8_zlib_roundtrip_exact val_bpb`.
20-
- Full official validation split (`fineweb_val_*.bin`, `62,021,632` scored tokens).
21-
- The kept promoted result still trains on the single locally available `fineweb_train_000000.bin` shard.
22-
- Phase 0 now includes a three-seed rerun package for the current kept hybrid family and the refreshed strongest legal all-attention control family on the same Blackwell lane.
23-
- Phase 1 now includes a bounded Modal H100 continuation over an 80-shard cached train view, improving realism without changing the non-record status of the kept point.
24-
- A compile-friendly architecture family, while the kept run below explicitly used `ENABLE_TORCH_COMPILE=0`.
19+
This PR keeps a conservative, non-record state-space sign-of-life on the public lane. The current public promoted checkpoint remains `STRONGER_VALID_STATE_SPACE_HYBRID_NON_RECORD_V8`, using the `AAAASASSS` fixed-predictor hybrid layout on the standard `train_gpt.py` scorer path.
2520

26-
## Kept Result
21+
A later long local campaign refreshed the legal all-attention control frontier and did **not** produce a reviewer-defensible new public promotion. The public checkpoint below should therefore be read as a historical promoted checkpoint for this lane, not as the strongest known internal result after the latest control refresh.
2722

28-
- Kept layout: `AAAASASSS`
23+
## Current Promoted Public Checkpoint
24+
25+
- Classification: `STRONGER_VALID_STATE_SPACE_HYBRID_NON_RECORD_V8`
26+
- Layout: `AAAASASSS`
2927
- `A` = exact attention block
3028
- `S` = compile-friendly S4D-style state-space block
3129
- Interpretation: four early exact-attention blocks, one mid attention anchor, and a four-block SSM tail
32-
- SSM core: `S4D-Lin` descendant using learned exponential depthwise conv kernels
33-
- Kept transfer: `SmearGate`, retained as a fixed-predictor one-at-a-time transfer
34-
- Kept attention tuning: default `QK_GAIN_INIT=1.5`
35-
- Kept export policy: keep only `ssm_coeff`, `ssm_log_decay`, and `ssm_d` in `float16`; quantize the rest with the standard int8+zlib path
30+
- SSM core: `s4d`
31+
- SSM kernel size: `96`
32+
- SSM rank: `14`
33+
- Fixed-predictor transfer: `SMEAR_ENABLED=1`
34+
- Export policy: keep only `ssm_coeff`, `ssm_log_decay`, and `ssm_d` in `float16`; quantize the rest with the standard int8+zlib path
3635
- Seed: `4242`
37-
- Fixed config: `TRAIN_BATCH_TOKENS=32768`, `TRAIN_SEQ_LEN=1024`, `VAL_BATCH_SIZE=262144`
38-
- Training budget: `2200` steps on the single locally available train shard
39-
- Kept run compile setting: `ENABLE_TORCH_COMPILE=0`, `SDP_BACKEND=math`
40-
- GPU lane: single `NVIDIA RTX PRO 6000 Blackwell Workstation Edition`
41-
- Primary score: `val_bpb = 1.50126339`
42-
- Primary loss: `val_loss = 2.53482035`
43-
- Model params: `17,119,784`
44-
45-
## Controlled Comparison
46-
47-
All rows below use the same scorer path, the same tokenizer, the same full validation split, and the same one-shard local training data unless explicitly labeled as the separate H100 realism probe.
48-
49-
The strongest retained legal all-attention control in this continuation remains the leaner `top1blockfp16` family that preserves `tok_emb` and only the top attention block in `float16`, then spends the recovered byte budget on more Blackwell training steps.
50-
51-
| Run | Layout | Train time | Eval time | Total bytes | val_bpb | val_loss | Legality |
52-
|---|---|---:|---:|---:|---:|---:|---|
53-
| Previous public winner | `AAAASASSS` | `609,292 ms` | `141,595 ms` | `15,260,268` | `1.50465667` | `2.54054976` | legal |
54-
| Strongest legal all-attention control | `AAAAAAAAA` | `442,368 ms` | `222,490 ms` | `15,993,409` | `1.56658161` | `2.64510742` | legal |
55-
| Control rerun on same lane | `AAAAAAAAA` | `431,256 ms` | `190,006 ms` | `15,996,880` | `1.56838339` | `2.64814966` | legal |
56-
| Nearest byte-cap boundary control | `AAAAAAAAA` | `492,617 ms` | `212,123 ms` | `16,006,424` | `1.56591641` | `2.64398426` | illegal |
57-
| **Kept hybrid** | **`AAAASASSS`** | **`1,207,089 ms`** | **`124,606 ms`** | **`15,272,426`** | **`1.50126339`** | **`2.53482035`** | **legal** |
58-
59-
Delta vs the strongest legal all-attention control: `-0.06531822` BPB.
60-
61-
Delta vs the previous public winner: `-0.00339328` BPB.
62-
63-
Important legality note:
64-
65-
- the strongest retained legal control remains the `1420`-step `top1blockfp16` point at `1.56658161` BPB
66-
- the additional `4242` rerun stayed legal but was weaker at `1.56838339` BPB
67-
- the nearby `1425`-step control was slightly better on raw BPB but crossed the cap at `16,006,424` bytes
68-
- all higher-score illegal controls remain documentation only and are not admissible as counted controls
69-
70-
## Phase 0 Variance Package
71-
72-
The current kept hybrid family was rerun twice more on the same Blackwell lane before promoting V8.
73-
74-
Retained reruns for the current kept hybrid family at `2200` steps:
75-
76-
- seed `2027`: `1.50465667` BPB
77-
- seed `1337`: `1.50615600` BPB
78-
- seed `4242`: `1.50126339` BPB
79-
- mean: `1.50402535`
80-
- sample stddev: `0.00250666`
81-
82-
The refreshed strongest legal control family at `1420` steps was also rerun to three seeds:
83-
84-
- seed `2027`: `1.56658161` BPB
85-
- seed `1337`: `1.56865945` BPB
86-
- seed `4242`: `1.56838339` BPB
87-
- mean: `1.56787482`
88-
- sample stddev: `0.00112842`
89-
90-
Mean paired hybrid-minus-control edge across the three matching seeds: `-0.06384946` BPB.
91-
92-
Paired edge sample stddev: `0.00284710`.
93-
94-
## Data / Scale Reality
95-
96-
The biggest local realism bottleneck remains the same:
97-
98-
- local dataset directory: `C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\data\datasets\fineweb10B_sp1024`
99-
- local train shards detected: `1`
100-
- available local shard: `fineweb_train_000000.bin`
101-
- manifest-declared train shards for the full dataset: `195`
102-
103-
This continuation again checked bounded alternate-machine options before accepting the local one-shard limit:
104-
105-
- `vm-ubuntu-pitlab`: reachable, zero visible `fineweb_train_*.bin` shards, no visible `nvidia-smi`
106-
- `ubuntu-dev`: reachable, zero visible `fineweb_train_*.bin` shards, no visible `nvidia-smi`
107-
- `widelab-mac`: reachable, Apple `M4`, zero visible `fineweb_train_*.bin` shards
108-
- `runpodctl`: installed locally but not configured with an API key, so no usable RunPod path was available from this workspace
109-
110-
A usable remote H100 path did exist through Modal without new human setup:
111-
112-
- cached volume: `pg-data`
113-
- cached train view: `fineweb10B_sp1024_train080`
114-
- train shards visible on that view: `80`
115-
- bounded H100 continuation: `modal_hybrid_aaaasasss_rank14_k96_smear_train080_400steps_seed4242_v8`
116-
- exact result: `val_bpb = 1.84995753`, `val_loss = 3.12357579`, `9,265,387` total bytes
117-
118-
This improves the realism package because the same fixed-predictor hybrid recipe was exercised on a real multi-shard H100 path. It does **not** convert the kept result into an official-lane claim, and it does **not** replace the local Blackwell kept run as the promoted non-record point.
119-
120-
Phase 6 official-lane feasibility was not triggered in this campaign because the raw improvement over `1.50465667` was `0.00339328` BPB, below the `0.01` threshold required to force an official-lane feasibility attempt.
121-
122-
## Refreshed Control Frontier
123-
124-
This continuation materially strengthened the all-attention control package before promoting a new hybrid point.
125-
126-
Retained `top1blockfp16` controls on the same Blackwell lane with `tok_emb,blocks.8. -> float16`:
127-
128-
- `900` steps: `1.61215746` BPB, `14,397,629` total bytes, legal
129-
- `1200` steps: `1.60052451` BPB, `15,371,862` total bytes, legal
130-
- `1400` steps: `1.56884979` BPB, `15,941,727` total bytes, legal
131-
- `1420` steps: `1.56658161` BPB, `15,993,409` total bytes, legal
132-
- `1425` steps: `1.56591641` BPB, `16,006,424` total bytes, illegal
36+
- Training budget: `2200` steps
37+
- Training data actually used by the kept run: the single locally available `fineweb_train_000000.bin` shard
38+
- Primary metric: `final_int8_zlib_roundtrip_exact val_bpb`
39+
- Public promoted score: `1.50126339`
40+
- Public promoted loss: `2.53482035`
41+
- Artifact bytes: `57,941` code + `15,214,485` model = `15,272,426` total bytes
42+
- Wallclock: `1,207,089 ms` training + `124,606 ms` evaluation + about `5,764 ms` export / roundtrip overhead
13343

134-
This matters for interpretation:
44+
At the time V8 was promoted, the strongest retained legal all-attention matched control was:
13545

136-
- the public lane is no longer being compared only against the older `730`-step `top2blocksfp16` baseline
137-
- the kept hybrid now clears a much stronger legal all-attention control by `0.06531822` BPB
138-
- the strongest legal control remained stable enough across three seeds that the hybrid still keeps a material edge on the refreshed package
46+
- Control: `full_baseline_1420steps_blackwell_seed2027_top1blockfp16_v7`
47+
- Control score: `1.56658161`
48+
- Control artifact bytes: `15,993,409`
49+
- Historical V8 margin vs that control: `-0.06531822` BPB
13950

140-
## Export Granularity Study
51+
That historical control comparison is retained for provenance. It is no longer the strongest known internal legal control after the later local campaign described below.
14152

142-
This continuation revisited recurrent export granularity on the stronger `AAAASASSS` branch before committing to more scale.
53+
## Latest Internal Local-Campaign Finding
14354

144-
At `1200` steps on `AAAASASSS`, `SSM_RANK=12`, `SSM_KERNEL_SIZE=96`, `SMEAR_ENABLED=1`:
55+
After V8, a longer local-only Blackwell campaign refreshed the legal all-attention frontier and blocked a new public promotion.
14556

146-
- narrow recurrent-core fp16 (`ssm_coeff,ssm_log_decay,ssm_d`): `1.59674695` BPB, `13,249,267` total bytes
147-
- topmost full recurrent block fp16 (`ssm_coeff,ssm_log_decay,ssm_d,blocks.8.ssm.`): `1.59674681` BPB, `14,114,480` total bytes
57+
The strongest known internal legal all-attention control from that campaign is:
14858

149-
The broader policy was effectively neutral on score while costing `865,213` extra bytes.
59+
- Control: `full_baseline_2600steps_blackwell_seed2027_allint8_v9r3`
60+
- Layout: `AAAAAAAAA`
61+
- Score: `1.48142748`
62+
- Artifact bytes: `57,941` code + `15,319,767` model = `15,377,708` total bytes
63+
- Wallclock: `895,800 ms` training + `216,725 ms` evaluation + about `5,176 ms` export / roundtrip overhead
15064

151-
The kept export policy therefore remains the narrow recurrent-core allowlist.
65+
The best unpromoted hybrid candidate from that campaign was:
15266

153-
## SSM Headroom Study
67+
- Candidate: `full_anchor_s4d_aaaasasss_rank14_k96_corefp16_smear_2600steps_blackwell_seed4242_v9r3`
68+
- Layout: `AAAASASSS`
69+
- Score: `1.48097508`
70+
- Artifact bytes: `57,941` code + `15,471,811` model = `15,529,752` total bytes
71+
- Wallclock: `736,141 ms` training + `124,986 ms` evaluation + about `9,076 ms` export / roundtrip overhead
15472

155-
This continuation then spent the remaining legal headroom on the SSM side before adding any new public architectural claims.
73+
The candidate was legal and lower on raw BPB than the refreshed control, but only by `0.00045240` BPB. That is too small to satisfy the promotion rule requiring either a large matched-control advantage or a clearly documented control package proving that the hybrid still materially matters.
15674

157-
At `1200` steps on `AAAASASSS`:
75+
No new public promotion was made from that campaign.
15876

159-
- `SSM_RANK=12`, `SSM_KERNEL_SIZE=96`: `1.59674695` BPB, `13,249,267` total bytes
160-
- `SSM_RANK=12`, `SSM_KERNEL_SIZE=128`: `1.60650831` BPB, `13,247,332` total bytes
161-
- `SSM_RANK=14`, `SSM_KERNEL_SIZE=96`: `1.59638095` BPB, `13,279,961` total bytes
77+
## Validity / Scope
16278

163-
The longer `128`-tap kernel regressed. A modest rank increase to `14` was slightly positive.
79+
Passed for the current public non-record checkpoint:
16480

165-
Scaling the stronger rank-14 point on the same lane produced:
166-
167-
- `1800` steps, seed `2027`: `1.53097696` BPB, `14,765,396` total bytes
168-
- `2000` steps, seed `2027`: `1.51685767` BPB, `15,051,906` total bytes
169-
- `2200` steps, seed `2027`: `1.50465667` BPB, `15,260,268` total bytes
170-
- `2200` steps, seed `4242`: `1.50126339` BPB, `15,272,426` total bytes
171-
172-
The kept result therefore spends the remaining legal budget on recurrent capacity plus more same-lane Blackwell scale, not on more attention.
173-
174-
## Fixed-Predictor Transfer Study
175-
176-
This continuation kept the transfer study strictly fixed-predictor and one-at-a-time.
177-
178-
Retained positive transfer from earlier in the lane:
179-
180-
- `AAAASASSS`, `SSM_KERNEL_SIZE=96`, `1000` steps, no SmearGate: `1.62386862` BPB
181-
- same branch + `SMEAR_ENABLED=1`: `1.61118819` BPB
182-
183-
`SmearGate` improved score by `0.01268043` BPB and remains part of the kept branch family.
184-
185-
Smear tuning check:
186-
187-
- default `SMEAR_INIT=0.0`: `1.61118819` BPB
188-
- lighter `SMEAR_INIT=-0.5`: `1.61534070` BPB
189-
190-
The lighter smear init was negative.
191-
192-
QK-gain tuning check on the stronger rank-14 branch:
193-
194-
- `1800`-step rank-14 hybrid, default `QK_GAIN_INIT=1.5`: `1.53097696` BPB
195-
- same branch + `QK_GAIN_INIT=1.7`: `1.53303405` BPB
196-
197-
This QK-gain increase was negative on the stronger branch, so the kept configuration stays on the default attention-gain setting.
198-
199-
Negative reference transfer retained from earlier in the lane:
200-
201-
- `AAAASASSS`, `SSM_RANK=8`, `600` steps, no BigramHash: `1.71464907` BPB
202-
- same branch + `BIGRAM_VOCAB_SIZE=1024`, `BIGRAM_DIM=64`: `1.71975209` BPB
203-
204-
The small fixed-predictor BigramHash side path remains negative evidence in this lane.
205-
206-
## Validity Notes
207-
208-
Passed for this non-record folder:
209-
210-
- Same scorer path for control and hybrid (`train_gpt.py`, `final_int8_zlib_roundtrip_exact`)
211-
- Full official validation split, standard `val_bpb`
212-
- Artifact byte audit under the decimal `16,000,000` byte cap for the kept promoted run
81+
- Same scorer path for control and hybrid: `train_gpt.py`, `final_int8_zlib_roundtrip_exact`
82+
- Full official validation split for the promoted public checkpoint
83+
- Artifact byte audit under the decimal `16,000,000` byte cap
84+
- All counted code for the artifact lives in `train_gpt.py`
21385
- No validation-data training
214-
- No evaluation-time downloads or external services
215-
- Recurrent export policy explicitly accounted for separately from the attention/MLP export policy
216-
- Kept run configuration explicitly recorded with `ENABLE_TORCH_COMPILE=0`
217-
- Fixed-predictor labeling explicit; no eval-time adaptation or TTT
218-
- Phase 0 expanded rerun package completed
219-
- Phase 1 realism package completed with a bounded Modal H100 multi-shard continuation
220-
- Phase 2 refreshed legal all-attention control frontier completed, including legal and illegal byte-boundary points
221-
- Phase 3 fixed-predictor transfer study completed
222-
- Phase 4 SSM-side headroom study completed
223-
- Phase 6 official-lane feasibility was not triggered by this promotion
224-
225-
Not claimed here:
226-
227-
- full training set usage for the kept run
228-
- official record-lane legality
229-
- official-lane feasibility confirmation
230-
- statistical significance for a record claim
231-
232-
## Artifact Size
233-
234-
- Code bytes: `57,941`
235-
- Model bytes (`final_model.int8.ptz`): `15,214,485`
236-
- Total bytes: `15,272,426`
237-
- Remaining legal headroom: `727,574`
238-
239-
## Wallclock Breakdown
240-
241-
From the kept promoted run:
242-
243-
- Training time: `1,207,089 ms`
244-
- Evaluation time: `124,606 ms`
245-
- Export / serialization / roundtrip overhead: about `5,764 ms`
246-
- End-to-end run duration: `1,337.46 s`
247-
248-
Strongest legal all-attention control:
249-
250-
- Training time: `442,368 ms`
251-
- Evaluation time: `222,490 ms`
252-
- Export / serialization / roundtrip overhead: about `8,179 ms`
253-
- End-to-end run duration: `673.04 s`
254-
255-
Modal H100 realism continuation:
256-
257-
- Training time: `88,650 ms`
258-
- Evaluation time: `79,800 ms`
259-
- End-to-end run duration: `264.98 s`
86+
- No evaluation-time downloads or hidden services
87+
- Fixed-predictor labeling remains explicit
88+
- No eval-time adaptation or TTT
89+
- Recurrent export policy is accounted for separately from the attention / MLP export policy
26090

261-
## Exact Commands
91+
Main scope limits:
26292

263-
PowerShell command used for the kept promoted run from the research workspace:
93+
- The kept promoted result is still one-shard local: only `fineweb_train_000000.bin` was locally usable.
94+
- The local dataset manifest reports `195` train shards, so one-shard training remains the biggest realism bottleneck.
95+
- A bounded remote realism package existed earlier through Modal on an 80-shard cached view, but cloud-credit-backed continuation is unavailable now and is not part of any current promotion gate.
96+
- No official-lane H100 feasibility result is claimed.
26497

265-
```powershell
266-
$env:CUDA_VISIBLE_DEVICES='1'
267-
$env:DATA_PATH='C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\data\datasets\fineweb10B_sp1024'
268-
$env:TOKENIZER_PATH='C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\data\tokenizers\fineweb_1024_bpe.model'
269-
$env:TRAIN_BATCH_TOKENS='32768'
270-
$env:VAL_BATCH_SIZE='262144'
271-
$env:TRAIN_SEQ_LEN='1024'
272-
$env:ITERATIONS='2200'
273-
$env:TRAIN_LOG_EVERY='20'
274-
$env:VAL_LOSS_EVERY='0'
275-
$env:MAX_WALLCLOCK_SECONDS='0'
276-
$env:WARMUP_STEPS='0'
277-
$env:ENABLE_TORCH_COMPILE='0'
278-
$env:SDP_BACKEND='math'
279-
$env:SAVE_RAW_MODEL='0'
280-
$env:FINAL_PREQUANT_EVAL='0'
281-
$env:BLOCK_LAYOUT='AAAASASSS'
282-
$env:SSM_CORE='s4d'
283-
$env:SSM_KERNEL_SIZE='96'
284-
$env:SSM_RANK='14'
285-
$env:PARALLEL_ATTN_BIAS_INIT='1.5'
286-
$env:SMEAR_ENABLED='1'
287-
$env:INT8_FORCE_FLOAT_NAME_PATTERNS='ssm_coeff,ssm_log_decay,ssm_d'
288-
$env:SEED='4242'
289-
$env:RUN_ID='full_anchor_s4d_aaaasasss_rank14_k96_corefp16_smear_2200steps_blackwell_seed4242'
290-
C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\.venv\Scripts\python.exe C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf-ssm-hybrid-research-scale\train_gpt.py
291-
```
98+
## Notes
29299

293-
PowerShell command used for the bounded Modal H100 realism continuation:
100+
This PR stays intentionally conservative and draft.
294101

295-
```powershell
296-
C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\.venv\Scripts\python.exe C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf-ssm-hybrid-research-scale\experiments\state_space_hybrid\modal_phase1_probe.py --mode train --run-name modal_hybrid_aaaasasss_rank14_k96_smear_train080_400steps_seed4242_v8 --iterations 400 --seed 4242
297-
```
102+
The lane remains interesting as a wishlist-aligned, non-record state-space models sign-of-life, but the latest internal evidence says the next public promotion likely needs either stronger realism, a more orthogonal state-space contribution, or a control package that shows the SSM tail matters by more than the tiny refreshed-control margin found in the long local campaign.

0 commit comments

Comments
 (0)