|
1 | 1 | # Non-Record: State-Space Hybrid with Attention Anchors |
2 | 2 |
|
3 | | -This folder records a V8 promotion of the README wishlist `state-space models` lane. |
| 3 | +This folder records a wishlist-aligned `state-space models` lane for the non-record 16 MB track. |
4 | 4 |
|
5 | 5 | This is **not** a leaderboard record attempt. |
6 | 6 | This is **not** an official 8xH100 / 10-minute lane run. |
7 | 7 | This is **not** a full-train-shards claim for the kept run. |
8 | | -This is **not** a statistical-significance claim for a record. |
| 8 | +This is **not** a claim that the currently promoted public checkpoint is the latest strongest internal finding. |
9 | 9 |
|
10 | | -Track label for this folder: |
| 10 | +Track label: |
11 | 11 |
|
12 | 12 | - fixed-predictor state-space hybrid |
13 | | -- not an adaptive-compression result |
14 | | -- no eval-time adaptation or TTT |
| 13 | +- no adaptive compression |
| 14 | +- no eval-time adaptation |
| 15 | +- no TTT |
15 | 16 |
|
16 | | -What it is: |
| 17 | +## Summary |
17 | 18 |
|
18 | | -- A scorer-clean hybrid architecture study on the standard `train_gpt.py` path. |
19 | | -- Standard primary metric: `final_int8_zlib_roundtrip_exact val_bpb`. |
20 | | -- Full official validation split (`fineweb_val_*.bin`, `62,021,632` scored tokens). |
21 | | -- The kept promoted result still trains on the single locally available `fineweb_train_000000.bin` shard. |
22 | | -- Phase 0 now includes a three-seed rerun package for the current kept hybrid family and the refreshed strongest legal all-attention control family on the same Blackwell lane. |
23 | | -- Phase 1 now includes a bounded Modal H100 continuation over an 80-shard cached train view, improving realism without changing the non-record status of the kept point. |
24 | | -- A compile-friendly architecture family, while the kept run below explicitly used `ENABLE_TORCH_COMPILE=0`. |
| 19 | +This PR keeps a conservative, non-record state-space sign-of-life on the public lane. The current public promoted checkpoint remains `STRONGER_VALID_STATE_SPACE_HYBRID_NON_RECORD_V8`, using the `AAAASASSS` fixed-predictor hybrid layout on the standard `train_gpt.py` scorer path. |
25 | 20 |
|
26 | | -## Kept Result |
| 21 | +A later long local campaign refreshed the legal all-attention control frontier and did **not** produce a reviewer-defensible new public promotion. The public checkpoint below should therefore be read as a historical promoted checkpoint for this lane, not as the strongest known internal result after the latest control refresh. |
27 | 22 |
|
28 | | -- Kept layout: `AAAASASSS` |
| 23 | +## Current Promoted Public Checkpoint |
| 24 | + |
| 25 | +- Classification: `STRONGER_VALID_STATE_SPACE_HYBRID_NON_RECORD_V8` |
| 26 | +- Layout: `AAAASASSS` |
29 | 27 | - `A` = exact attention block |
30 | 28 | - `S` = compile-friendly S4D-style state-space block |
31 | 29 | - Interpretation: four early exact-attention blocks, one mid attention anchor, and a four-block SSM tail |
32 | | -- SSM core: `S4D-Lin` descendant using learned exponential depthwise conv kernels |
33 | | -- Kept transfer: `SmearGate`, retained as a fixed-predictor one-at-a-time transfer |
34 | | -- Kept attention tuning: default `QK_GAIN_INIT=1.5` |
35 | | -- Kept export policy: keep only `ssm_coeff`, `ssm_log_decay`, and `ssm_d` in `float16`; quantize the rest with the standard int8+zlib path |
| 30 | +- SSM core: `s4d` |
| 31 | +- SSM kernel size: `96` |
| 32 | +- SSM rank: `14` |
| 33 | +- Fixed-predictor transfer: `SMEAR_ENABLED=1` |
| 34 | +- Export policy: keep only `ssm_coeff`, `ssm_log_decay`, and `ssm_d` in `float16`; quantize the rest with the standard int8+zlib path |
36 | 35 | - Seed: `4242` |
37 | | -- Fixed config: `TRAIN_BATCH_TOKENS=32768`, `TRAIN_SEQ_LEN=1024`, `VAL_BATCH_SIZE=262144` |
38 | | -- Training budget: `2200` steps on the single locally available train shard |
39 | | -- Kept run compile setting: `ENABLE_TORCH_COMPILE=0`, `SDP_BACKEND=math` |
40 | | -- GPU lane: single `NVIDIA RTX PRO 6000 Blackwell Workstation Edition` |
41 | | -- Primary score: `val_bpb = 1.50126339` |
42 | | -- Primary loss: `val_loss = 2.53482035` |
43 | | -- Model params: `17,119,784` |
44 | | - |
45 | | -## Controlled Comparison |
46 | | - |
47 | | -All rows below use the same scorer path, the same tokenizer, the same full validation split, and the same one-shard local training data unless explicitly labeled as the separate H100 realism probe. |
48 | | - |
49 | | -The strongest retained legal all-attention control in this continuation remains the leaner `top1blockfp16` family that preserves `tok_emb` and only the top attention block in `float16`, then spends the recovered byte budget on more Blackwell training steps. |
50 | | - |
51 | | -| Run | Layout | Train time | Eval time | Total bytes | val_bpb | val_loss | Legality | |
52 | | -|---|---|---:|---:|---:|---:|---:|---| |
53 | | -| Previous public winner | `AAAASASSS` | `609,292 ms` | `141,595 ms` | `15,260,268` | `1.50465667` | `2.54054976` | legal | |
54 | | -| Strongest legal all-attention control | `AAAAAAAAA` | `442,368 ms` | `222,490 ms` | `15,993,409` | `1.56658161` | `2.64510742` | legal | |
55 | | -| Control rerun on same lane | `AAAAAAAAA` | `431,256 ms` | `190,006 ms` | `15,996,880` | `1.56838339` | `2.64814966` | legal | |
56 | | -| Nearest byte-cap boundary control | `AAAAAAAAA` | `492,617 ms` | `212,123 ms` | `16,006,424` | `1.56591641` | `2.64398426` | illegal | |
57 | | -| **Kept hybrid** | **`AAAASASSS`** | **`1,207,089 ms`** | **`124,606 ms`** | **`15,272,426`** | **`1.50126339`** | **`2.53482035`** | **legal** | |
58 | | - |
59 | | -Delta vs the strongest legal all-attention control: `-0.06531822` BPB. |
60 | | - |
61 | | -Delta vs the previous public winner: `-0.00339328` BPB. |
62 | | - |
63 | | -Important legality note: |
64 | | - |
65 | | -- the strongest retained legal control remains the `1420`-step `top1blockfp16` point at `1.56658161` BPB |
66 | | -- the additional `4242` rerun stayed legal but was weaker at `1.56838339` BPB |
67 | | -- the nearby `1425`-step control was slightly better on raw BPB but crossed the cap at `16,006,424` bytes |
68 | | -- all higher-score illegal controls remain documentation only and are not admissible as counted controls |
69 | | - |
70 | | -## Phase 0 Variance Package |
71 | | - |
72 | | -The current kept hybrid family was rerun twice more on the same Blackwell lane before promoting V8. |
73 | | - |
74 | | -Retained reruns for the current kept hybrid family at `2200` steps: |
75 | | - |
76 | | -- seed `2027`: `1.50465667` BPB |
77 | | -- seed `1337`: `1.50615600` BPB |
78 | | -- seed `4242`: `1.50126339` BPB |
79 | | -- mean: `1.50402535` |
80 | | -- sample stddev: `0.00250666` |
81 | | - |
82 | | -The refreshed strongest legal control family at `1420` steps was also rerun to three seeds: |
83 | | - |
84 | | -- seed `2027`: `1.56658161` BPB |
85 | | -- seed `1337`: `1.56865945` BPB |
86 | | -- seed `4242`: `1.56838339` BPB |
87 | | -- mean: `1.56787482` |
88 | | -- sample stddev: `0.00112842` |
89 | | - |
90 | | -Mean paired hybrid-minus-control edge across the three matching seeds: `-0.06384946` BPB. |
91 | | - |
92 | | -Paired edge sample stddev: `0.00284710`. |
93 | | - |
94 | | -## Data / Scale Reality |
95 | | - |
96 | | -The biggest local realism bottleneck remains the same: |
97 | | - |
98 | | -- local dataset directory: `C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\data\datasets\fineweb10B_sp1024` |
99 | | -- local train shards detected: `1` |
100 | | -- available local shard: `fineweb_train_000000.bin` |
101 | | -- manifest-declared train shards for the full dataset: `195` |
102 | | - |
103 | | -This continuation again checked bounded alternate-machine options before accepting the local one-shard limit: |
104 | | - |
105 | | -- `vm-ubuntu-pitlab`: reachable, zero visible `fineweb_train_*.bin` shards, no visible `nvidia-smi` |
106 | | -- `ubuntu-dev`: reachable, zero visible `fineweb_train_*.bin` shards, no visible `nvidia-smi` |
107 | | -- `widelab-mac`: reachable, Apple `M4`, zero visible `fineweb_train_*.bin` shards |
108 | | -- `runpodctl`: installed locally but not configured with an API key, so no usable RunPod path was available from this workspace |
109 | | - |
110 | | -A usable remote H100 path did exist through Modal without new human setup: |
111 | | - |
112 | | -- cached volume: `pg-data` |
113 | | -- cached train view: `fineweb10B_sp1024_train080` |
114 | | -- train shards visible on that view: `80` |
115 | | -- bounded H100 continuation: `modal_hybrid_aaaasasss_rank14_k96_smear_train080_400steps_seed4242_v8` |
116 | | -- exact result: `val_bpb = 1.84995753`, `val_loss = 3.12357579`, `9,265,387` total bytes |
117 | | - |
118 | | -This improves the realism package because the same fixed-predictor hybrid recipe was exercised on a real multi-shard H100 path. It does **not** convert the kept result into an official-lane claim, and it does **not** replace the local Blackwell kept run as the promoted non-record point. |
119 | | - |
120 | | -Phase 6 official-lane feasibility was not triggered in this campaign because the raw improvement over `1.50465667` was `0.00339328` BPB, below the `0.01` threshold required to force an official-lane feasibility attempt. |
121 | | - |
122 | | -## Refreshed Control Frontier |
123 | | - |
124 | | -This continuation materially strengthened the all-attention control package before promoting a new hybrid point. |
125 | | - |
126 | | -Retained `top1blockfp16` controls on the same Blackwell lane with `tok_emb,blocks.8. -> float16`: |
127 | | - |
128 | | -- `900` steps: `1.61215746` BPB, `14,397,629` total bytes, legal |
129 | | -- `1200` steps: `1.60052451` BPB, `15,371,862` total bytes, legal |
130 | | -- `1400` steps: `1.56884979` BPB, `15,941,727` total bytes, legal |
131 | | -- `1420` steps: `1.56658161` BPB, `15,993,409` total bytes, legal |
132 | | -- `1425` steps: `1.56591641` BPB, `16,006,424` total bytes, illegal |
| 36 | +- Training budget: `2200` steps |
| 37 | +- Training data actually used by the kept run: the single locally available `fineweb_train_000000.bin` shard |
| 38 | +- Primary metric: `final_int8_zlib_roundtrip_exact val_bpb` |
| 39 | +- Public promoted score: `1.50126339` |
| 40 | +- Public promoted loss: `2.53482035` |
| 41 | +- Artifact bytes: `57,941` code + `15,214,485` model = `15,272,426` total bytes |
| 42 | +- Wallclock: `1,207,089 ms` training + `124,606 ms` evaluation + about `5,764 ms` export / roundtrip overhead |
133 | 43 |
|
134 | | -This matters for interpretation: |
| 44 | +At the time V8 was promoted, the strongest retained legal all-attention matched control was: |
135 | 45 |
|
136 | | -- the public lane is no longer being compared only against the older `730`-step `top2blocksfp16` baseline |
137 | | -- the kept hybrid now clears a much stronger legal all-attention control by `0.06531822` BPB |
138 | | -- the strongest legal control remained stable enough across three seeds that the hybrid still keeps a material edge on the refreshed package |
| 46 | +- Control: `full_baseline_1420steps_blackwell_seed2027_top1blockfp16_v7` |
| 47 | +- Control score: `1.56658161` |
| 48 | +- Control artifact bytes: `15,993,409` |
| 49 | +- Historical V8 margin vs that control: `-0.06531822` BPB |
139 | 50 |
|
140 | | -## Export Granularity Study |
| 51 | +That historical control comparison is retained for provenance. It is no longer the strongest known internal legal control after the later local campaign described below. |
141 | 52 |
|
142 | | -This continuation revisited recurrent export granularity on the stronger `AAAASASSS` branch before committing to more scale. |
| 53 | +## Latest Internal Local-Campaign Finding |
143 | 54 |
|
144 | | -At `1200` steps on `AAAASASSS`, `SSM_RANK=12`, `SSM_KERNEL_SIZE=96`, `SMEAR_ENABLED=1`: |
| 55 | +After V8, a longer local-only Blackwell campaign refreshed the legal all-attention frontier and blocked a new public promotion. |
145 | 56 |
|
146 | | -- narrow recurrent-core fp16 (`ssm_coeff,ssm_log_decay,ssm_d`): `1.59674695` BPB, `13,249,267` total bytes |
147 | | -- topmost full recurrent block fp16 (`ssm_coeff,ssm_log_decay,ssm_d,blocks.8.ssm.`): `1.59674681` BPB, `14,114,480` total bytes |
| 57 | +The strongest known internal legal all-attention control from that campaign is: |
148 | 58 |
|
149 | | -The broader policy was effectively neutral on score while costing `865,213` extra bytes. |
| 59 | +- Control: `full_baseline_2600steps_blackwell_seed2027_allint8_v9r3` |
| 60 | +- Layout: `AAAAAAAAA` |
| 61 | +- Score: `1.48142748` |
| 62 | +- Artifact bytes: `57,941` code + `15,319,767` model = `15,377,708` total bytes |
| 63 | +- Wallclock: `895,800 ms` training + `216,725 ms` evaluation + about `5,176 ms` export / roundtrip overhead |
150 | 64 |
|
151 | | -The kept export policy therefore remains the narrow recurrent-core allowlist. |
| 65 | +The best unpromoted hybrid candidate from that campaign was: |
152 | 66 |
|
153 | | -## SSM Headroom Study |
| 67 | +- Candidate: `full_anchor_s4d_aaaasasss_rank14_k96_corefp16_smear_2600steps_blackwell_seed4242_v9r3` |
| 68 | +- Layout: `AAAASASSS` |
| 69 | +- Score: `1.48097508` |
| 70 | +- Artifact bytes: `57,941` code + `15,471,811` model = `15,529,752` total bytes |
| 71 | +- Wallclock: `736,141 ms` training + `124,986 ms` evaluation + about `9,076 ms` export / roundtrip overhead |
154 | 72 |
|
155 | | -This continuation then spent the remaining legal headroom on the SSM side before adding any new public architectural claims. |
| 73 | +The candidate was legal and lower on raw BPB than the refreshed control, but only by `0.00045240` BPB. That is too small to satisfy the promotion rule requiring either a large matched-control advantage or a clearly documented control package proving that the hybrid still materially matters. |
156 | 74 |
|
157 | | -At `1200` steps on `AAAASASSS`: |
| 75 | +No new public promotion was made from that campaign. |
158 | 76 |
|
159 | | -- `SSM_RANK=12`, `SSM_KERNEL_SIZE=96`: `1.59674695` BPB, `13,249,267` total bytes |
160 | | -- `SSM_RANK=12`, `SSM_KERNEL_SIZE=128`: `1.60650831` BPB, `13,247,332` total bytes |
161 | | -- `SSM_RANK=14`, `SSM_KERNEL_SIZE=96`: `1.59638095` BPB, `13,279,961` total bytes |
| 77 | +## Validity / Scope |
162 | 78 |
|
163 | | -The longer `128`-tap kernel regressed. A modest rank increase to `14` was slightly positive. |
| 79 | +Passed for the current public non-record checkpoint: |
164 | 80 |
|
165 | | -Scaling the stronger rank-14 point on the same lane produced: |
166 | | - |
167 | | -- `1800` steps, seed `2027`: `1.53097696` BPB, `14,765,396` total bytes |
168 | | -- `2000` steps, seed `2027`: `1.51685767` BPB, `15,051,906` total bytes |
169 | | -- `2200` steps, seed `2027`: `1.50465667` BPB, `15,260,268` total bytes |
170 | | -- `2200` steps, seed `4242`: `1.50126339` BPB, `15,272,426` total bytes |
171 | | - |
172 | | -The kept result therefore spends the remaining legal budget on recurrent capacity plus more same-lane Blackwell scale, not on more attention. |
173 | | - |
174 | | -## Fixed-Predictor Transfer Study |
175 | | - |
176 | | -This continuation kept the transfer study strictly fixed-predictor and one-at-a-time. |
177 | | - |
178 | | -Retained positive transfer from earlier in the lane: |
179 | | - |
180 | | -- `AAAASASSS`, `SSM_KERNEL_SIZE=96`, `1000` steps, no SmearGate: `1.62386862` BPB |
181 | | -- same branch + `SMEAR_ENABLED=1`: `1.61118819` BPB |
182 | | - |
183 | | -`SmearGate` improved score by `0.01268043` BPB and remains part of the kept branch family. |
184 | | - |
185 | | -Smear tuning check: |
186 | | - |
187 | | -- default `SMEAR_INIT=0.0`: `1.61118819` BPB |
188 | | -- lighter `SMEAR_INIT=-0.5`: `1.61534070` BPB |
189 | | - |
190 | | -The lighter smear init was negative. |
191 | | - |
192 | | -QK-gain tuning check on the stronger rank-14 branch: |
193 | | - |
194 | | -- `1800`-step rank-14 hybrid, default `QK_GAIN_INIT=1.5`: `1.53097696` BPB |
195 | | -- same branch + `QK_GAIN_INIT=1.7`: `1.53303405` BPB |
196 | | - |
197 | | -This QK-gain increase was negative on the stronger branch, so the kept configuration stays on the default attention-gain setting. |
198 | | - |
199 | | -Negative reference transfer retained from earlier in the lane: |
200 | | - |
201 | | -- `AAAASASSS`, `SSM_RANK=8`, `600` steps, no BigramHash: `1.71464907` BPB |
202 | | -- same branch + `BIGRAM_VOCAB_SIZE=1024`, `BIGRAM_DIM=64`: `1.71975209` BPB |
203 | | - |
204 | | -The small fixed-predictor BigramHash side path remains negative evidence in this lane. |
205 | | - |
206 | | -## Validity Notes |
207 | | - |
208 | | -Passed for this non-record folder: |
209 | | - |
210 | | -- Same scorer path for control and hybrid (`train_gpt.py`, `final_int8_zlib_roundtrip_exact`) |
211 | | -- Full official validation split, standard `val_bpb` |
212 | | -- Artifact byte audit under the decimal `16,000,000` byte cap for the kept promoted run |
| 81 | +- Same scorer path for control and hybrid: `train_gpt.py`, `final_int8_zlib_roundtrip_exact` |
| 82 | +- Full official validation split for the promoted public checkpoint |
| 83 | +- Artifact byte audit under the decimal `16,000,000` byte cap |
| 84 | +- All counted code for the artifact lives in `train_gpt.py` |
213 | 85 | - No validation-data training |
214 | | -- No evaluation-time downloads or external services |
215 | | -- Recurrent export policy explicitly accounted for separately from the attention/MLP export policy |
216 | | -- Kept run configuration explicitly recorded with `ENABLE_TORCH_COMPILE=0` |
217 | | -- Fixed-predictor labeling explicit; no eval-time adaptation or TTT |
218 | | -- Phase 0 expanded rerun package completed |
219 | | -- Phase 1 realism package completed with a bounded Modal H100 multi-shard continuation |
220 | | -- Phase 2 refreshed legal all-attention control frontier completed, including legal and illegal byte-boundary points |
221 | | -- Phase 3 fixed-predictor transfer study completed |
222 | | -- Phase 4 SSM-side headroom study completed |
223 | | -- Phase 6 official-lane feasibility was not triggered by this promotion |
224 | | - |
225 | | -Not claimed here: |
226 | | - |
227 | | -- full training set usage for the kept run |
228 | | -- official record-lane legality |
229 | | -- official-lane feasibility confirmation |
230 | | -- statistical significance for a record claim |
231 | | - |
232 | | -## Artifact Size |
233 | | - |
234 | | -- Code bytes: `57,941` |
235 | | -- Model bytes (`final_model.int8.ptz`): `15,214,485` |
236 | | -- Total bytes: `15,272,426` |
237 | | -- Remaining legal headroom: `727,574` |
238 | | - |
239 | | -## Wallclock Breakdown |
240 | | - |
241 | | -From the kept promoted run: |
242 | | - |
243 | | -- Training time: `1,207,089 ms` |
244 | | -- Evaluation time: `124,606 ms` |
245 | | -- Export / serialization / roundtrip overhead: about `5,764 ms` |
246 | | -- End-to-end run duration: `1,337.46 s` |
247 | | - |
248 | | -Strongest legal all-attention control: |
249 | | - |
250 | | -- Training time: `442,368 ms` |
251 | | -- Evaluation time: `222,490 ms` |
252 | | -- Export / serialization / roundtrip overhead: about `8,179 ms` |
253 | | -- End-to-end run duration: `673.04 s` |
254 | | - |
255 | | -Modal H100 realism continuation: |
256 | | - |
257 | | -- Training time: `88,650 ms` |
258 | | -- Evaluation time: `79,800 ms` |
259 | | -- End-to-end run duration: `264.98 s` |
| 86 | +- No evaluation-time downloads or hidden services |
| 87 | +- Fixed-predictor labeling remains explicit |
| 88 | +- No eval-time adaptation or TTT |
| 89 | +- Recurrent export policy is accounted for separately from the attention / MLP export policy |
260 | 90 |
|
261 | | -## Exact Commands |
| 91 | +Main scope limits: |
262 | 92 |
|
263 | | -PowerShell command used for the kept promoted run from the research workspace: |
| 93 | +- The kept promoted result is still one-shard local: only `fineweb_train_000000.bin` was locally usable. |
| 94 | +- The local dataset manifest reports `195` train shards, so one-shard training remains the biggest realism bottleneck. |
| 95 | +- A bounded remote realism package existed earlier through Modal on an 80-shard cached view, but cloud-credit-backed continuation is unavailable now and is not part of any current promotion gate. |
| 96 | +- No official-lane H100 feasibility result is claimed. |
264 | 97 |
|
265 | | -```powershell |
266 | | -$env:CUDA_VISIBLE_DEVICES='1' |
267 | | -$env:DATA_PATH='C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\data\datasets\fineweb10B_sp1024' |
268 | | -$env:TOKENIZER_PATH='C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\data\tokenizers\fineweb_1024_bpe.model' |
269 | | -$env:TRAIN_BATCH_TOKENS='32768' |
270 | | -$env:VAL_BATCH_SIZE='262144' |
271 | | -$env:TRAIN_SEQ_LEN='1024' |
272 | | -$env:ITERATIONS='2200' |
273 | | -$env:TRAIN_LOG_EVERY='20' |
274 | | -$env:VAL_LOSS_EVERY='0' |
275 | | -$env:MAX_WALLCLOCK_SECONDS='0' |
276 | | -$env:WARMUP_STEPS='0' |
277 | | -$env:ENABLE_TORCH_COMPILE='0' |
278 | | -$env:SDP_BACKEND='math' |
279 | | -$env:SAVE_RAW_MODEL='0' |
280 | | -$env:FINAL_PREQUANT_EVAL='0' |
281 | | -$env:BLOCK_LAYOUT='AAAASASSS' |
282 | | -$env:SSM_CORE='s4d' |
283 | | -$env:SSM_KERNEL_SIZE='96' |
284 | | -$env:SSM_RANK='14' |
285 | | -$env:PARALLEL_ATTN_BIAS_INIT='1.5' |
286 | | -$env:SMEAR_ENABLED='1' |
287 | | -$env:INT8_FORCE_FLOAT_NAME_PATTERNS='ssm_coeff,ssm_log_decay,ssm_d' |
288 | | -$env:SEED='4242' |
289 | | -$env:RUN_ID='full_anchor_s4d_aaaasasss_rank14_k96_corefp16_smear_2200steps_blackwell_seed4242' |
290 | | -C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\.venv\Scripts\python.exe C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf-ssm-hybrid-research-scale\train_gpt.py |
291 | | -``` |
| 98 | +## Notes |
292 | 99 |
|
293 | | -PowerShell command used for the bounded Modal H100 realism continuation: |
| 100 | +This PR stays intentionally conservative and draft. |
294 | 101 |
|
295 | | -```powershell |
296 | | -C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf\.venv\Scripts\python.exe C:\Users\GreQ\.codex_playground\OpenAIGolf\parameter-golf-ssm-hybrid-research-scale\experiments\state_space_hybrid\modal_phase1_probe.py --mode train --run-name modal_hybrid_aaaasasss_rank14_k96_smear_train080_400steps_seed4242_v8 --iterations 400 --seed 4242 |
297 | | -``` |
| 102 | +The lane remains interesting as a wishlist-aligned, non-record state-space models sign-of-life, but the latest internal evidence says the next public promotion likely needs either stronger realism, a more orthogonal state-space contribution, or a control package that shows the SSM tail matters by more than the tiny refreshed-control margin found in the long local campaign. |
0 commit comments