You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .swarm/STATUS.md
+127Lines changed: 127 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -160,3 +160,130 @@ Then the user must still teardown the VM via nebius CLI after re-auth.
160
160
- Re-run cuda-fused on additional sm classes (sm_89 Ada / sm_90 H100 / sm_100 datacenter Blackwell) to confirm no arch regression in `CMAKE_CUDA_ARCHITECTURES=90a;90;89;86;80;100;120a`.
161
161
-`llama-bench` is not in the fused-target list — adding it would unblock `runtime_graph_smoke.sh --gen-check` against the fused install. Non-blocking; the cuda-verify-fused parity + e2e_loop_bench publish-gate pass cover the substance.
162
162
163
+
164
+
# H200-MONITOR-4 status — UPDATE 2026-05-13 03:36 UTC
165
+
166
+
## v4 RUN TERMINATED at step ~1241 due to driver's built-in 6h watchdog
167
+
168
+
### What happened
169
+
- 2026-05-13T03:33:34Z (6h after `run_remote` started polling): the driver hit `scripts/train_nebius.sh` line 439 cap: `if [ "$i" -gt 360 ]; then echo "ERROR: still running after 6h — bailing"; return 1; fi`.
170
+
-`run_remote` returned 1 → bash's `set -euo pipefail` aborted the `full` flow → `fetch` was SKIPPED → EXIT trap ran only `teardown`.
171
+
- The EXIT trap's `teardown` function (line 544) called `instance_id_by_name` → `nebius compute v1 instance list` → expired-auth hang → I had to kill the driver process (3652060) manually at 03:35:51Z.
- Note: the local launch.log showed step 1241 in the local tail — that's because my Ctrl-C-via-tmux send-keys at 03:34:48Z killed the python training between the 1003 eval finish and the next eval/save. Eval at step 1000 finished: **eval_loss=1.145 at epoch 0.104**.
174
+
175
+
### Artifacts on remote VM (89.169.122.196 still up)
### State of training quality at the artifact step
198
+
-~1000 steps done out of 9615 (10.4% of 1 epoch).
199
+
- eval_loss 1.145 — meaningful but very early. Gate's format_ok ≥ 0.70 threshold cannot be evaluated without the SFT pipeline running its eval gate (`run_pipeline.py` post-train eval). Since we hit the driver's 6h cap before SFT completed and triggered the eval gate, **no gate_report.json was produced**.
200
+
- This is a **Case 2 outcome** per the agent brief (partial checkpoint, no gate_report).
201
+
202
+
### Decision: relaunch as v5 with `--max-steps` is NOT yet possible
203
+
- Reasons:
204
+
- Need nebius auth restored before any new VM can be provisioned.
205
+
-`train_local.py` has no `--max-steps` flag (only `--epochs` / `--max-samples`); patching `train_nebius.sh` to plumb max-steps requires source edits to `scripts/run_pipeline.py` too.
206
+
- Bigger structural problem: the driver hit 6h not 12h. The `train_nebius.sh` 6h cap is hardcoded at line 439. A successful 1500-step run within the cap needs either: shorter eval (saw 50min eval at step 500), smaller test set, or per-step rate above ~6 it/s.
207
+
- Recommended v5 plan documented separately in v5 below.
208
+
209
+
## Cleanup state
210
+
- Driver (3652060): killed at 03:35:51Z
211
+
- Watcher v4 (3652514): self-terminated at 22:28:02Z (false positive)
212
+
- Watcher v4b (3768788): killed at 03:36:11Z (would fail teardown)
213
+
- Manual rsync (4024662): in progress
214
+
- v4 launch.log: still on disk at `/tmp/q35-0_8b-v4-launch.log` for forensics
215
+
- VM `eliza-train-h200-0_8b-v4`: **STILL UP, NEEDS USER MANUAL TEARDOWN**
216
+
217
+
## v5 plan (post user re-auth)
218
+
1. Patch `train_nebius.sh` line 439 to honor an `ELIZA_REMOTE_RUN_TIMEOUT_H` env var (default 12 to match watcher).
219
+
2. Patch `scripts/run_pipeline.py` (or `train_local.py`) to honor `MAX_STEPS` env (the trainer.Trainer supports `max_steps` kwarg).
220
+
3. Reduce eval frequency: change `save_steps` from 500 → 1500 (so single mid-run eval doesn't burn 50 min). OR keep save_steps=500 but reduce test set size.
221
+
4. Relaunch v5 with `MAX_STEPS=1500 ELIZA_REMOTE_RUN_TIMEOUT_H=12 NEBIUS_VM_NAME=eliza-train-h200-0_8b-v5 bash scripts/train_nebius.sh full --registry-key qwen3.5-0.8b ...`
222
+
5. Create proper v5 watcher with SSH-based liveness (use `/tmp/nebius-finish-q35-0_8b-v4b.sh` as template).
223
+
224
+
I am DONE for this agent run — handing off to next H200-MONITOR or to user for nebius re-auth.
225
+
226
+
227
+
# H200-MONITOR-4 — FINAL UPDATE 2026-05-13 05:14 UTC
228
+
229
+
## Manual rsync fetch COMPLETED
230
+
- Both checkpoints fully local at `/home/shaw/milady/eliza/packages/training/checkpoints/eliza-1-0_8b-apollo-fullcorpus-h200-1778619044/`
-**step 490 → 500**: train loss 8.95 → 8.82, eval_loss=1.255 (eval ran 37 min)
238
+
-**step 990 → 1000**: train loss 7.22 → 7.06, eval_loss=1.145 (eval ran 15 min, eval cache warm)
239
+
- LR schedule: linear warmup from 1e-5, currently 9.86e-6 at step 1000
240
+
- grad_norm 83→100→125→144 (volatile, expected at very early epoch)
241
+
- Conclusion: model is clearly learning, but eval_loss=1.145 at 10.4% of epoch 1 is too early for a quality `format_ok ≥ 0.70` gate clear. Loss curve trajectory looks sane and matches the 0.6b reference.
242
+
243
+
## Gate eval: NOT RUN
244
+
- The pipeline's gate eval (`run_pipeline.py --eval-mode full`) only runs AFTER training completes. We hit the driver's 6h cap mid-training. No `gate_report.json` exists.
245
+
- Per Case 2 in the brief, this is a "partial checkpoint, no gate_report" outcome → iterate (not publish).
246
+
247
+
## v5 cannot start until USER re-auths nebius
248
+
249
+
### Steps for user
250
+
1.`~/.nebius/bin/nebius iam get-access-token` (opens browser, complete federation OAuth)
1.**Patch `scripts/train_nebius.sh` line 439** — raise the 6h hardcoded cap to `${ELIZA_REMOTE_RUN_TIMEOUT_H:-12}*60`. Without this, any retry will hit the same 6h wall.
262
+
2.**Patch the EXIT trap (line 582)** — change `teardown || true` to `fetch || true; teardown || true` so a 6h-cap bail still pulls partial checkpoints back before attempting nebius teardown. Right now `set -euo pipefail` causes `fetch` to be skipped after `run_remote` returns 1.
263
+
3.**Patch `instance_up()` in watcher scripts** — don't swallow nebius CLI failures as "no" (the v4 watcher bug). Use SSH-based liveness as primary, nebius CLI as confirmation only.
264
+
4. Consider plumbing `MAX_STEPS` from env → `run_pipeline.py` → `train_local.py` for budget-bound runs (1500 steps in 12h target).
265
+
266
+
### v5 launch (after patches + auth)
267
+
```
268
+
NEBIUS_VM_NAME=eliza-train-h200-0_8b-v5 \
269
+
ELIZA_REMOTE_RUN_TIMEOUT_H=12 \
270
+
bash packages/training/scripts/train_nebius.sh full \
0 commit comments