Skip to content

Commit 44a1aa3

Browse files
lalaluneclaude
andcommitted
chore(swarm): H200-MONITOR-4 final report — v4 fetch complete, v5 plan ready
Manual rsync fetch of both partial checkpoints (500/1000, 7.14 GB total) completed successfully. trainer_state confirms loss is descending sanely (8.82 → 7.06 train, 1.255 → 1.145 eval over 500 steps). The driver's hardcoded 6h watchdog hit before the SFT eval gate ran, so no gate_report.json was produced — this is a Case 2 outcome (iterate, not publish). Detailed v5 patch plan written for the next agent. VM teardown still blocked on user nebius re-auth (federation token expired 22:17:56Z, no non-interactive refresh path available). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 279500e commit 44a1aa3

1 file changed

Lines changed: 64 additions & 0 deletions

File tree

.swarm/STATUS.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,3 +223,67 @@ Then the user must still teardown the VM via nebius CLI after re-auth.
223223

224224
I am DONE for this agent run — handing off to next H200-MONITOR or to user for nebius re-auth.
225225

226+
227+
# H200-MONITOR-4 — FINAL UPDATE 2026-05-13 05:14 UTC
228+
229+
## Manual rsync fetch COMPLETED
230+
- Both checkpoints fully local at `/home/shaw/milady/eliza/packages/training/checkpoints/eliza-1-0_8b-apollo-fullcorpus-h200-1778619044/`
231+
- `checkpoint-500/`: 3.3 GB (model.safetensors 1.5GB + optimizer.pt 2.0GB + tokenizer/config)
232+
- `checkpoint-1000/`: 3.4 GB (same shape)
233+
- Total fetched: 7.14 GB at ~1.3 MB/s avg (1h35m wall)
234+
- rsync log: `/tmp/q35-0_8b-v4-manual-fetch.log` (PID 4024662 exited cleanly)
235+
236+
## Training loss curve from trainer_state.json
237+
- **step 490 → 500**: train loss 8.95 → 8.82, eval_loss=1.255 (eval ran 37 min)
238+
- **step 990 → 1000**: train loss 7.22 → 7.06, eval_loss=1.145 (eval ran 15 min, eval cache warm)
239+
- LR schedule: linear warmup from 1e-5, currently 9.86e-6 at step 1000
240+
- grad_norm 83→100→125→144 (volatile, expected at very early epoch)
241+
- Conclusion: model is clearly learning, but eval_loss=1.145 at 10.4% of epoch 1 is too early for a quality `format_ok ≥ 0.70` gate clear. Loss curve trajectory looks sane and matches the 0.6b reference.
242+
243+
## Gate eval: NOT RUN
244+
- The pipeline's gate eval (`run_pipeline.py --eval-mode full`) only runs AFTER training completes. We hit the driver's 6h cap mid-training. No `gate_report.json` exists.
245+
- Per Case 2 in the brief, this is a "partial checkpoint, no gate_report" outcome → iterate (not publish).
246+
247+
## v5 cannot start until USER re-auths nebius
248+
249+
### Steps for user
250+
1. `~/.nebius/bin/nebius iam get-access-token` (opens browser, complete federation OAuth)
251+
2. `~/.nebius/bin/nebius iam whoami` (verify)
252+
3. Teardown v4 VM:
253+
```
254+
export PATH="$HOME/.nebius/bin:$PATH"
255+
export NEBIUS_PROJECT_ID=project-e00kfz6cpr00q21z892vec
256+
cd /home/shaw/milady/eliza/packages/training
257+
NEBIUS_VM_NAME=eliza-train-h200-0_8b-v4 bash scripts/train_nebius.sh teardown
258+
```
259+
260+
### v5 patch required first (before relaunch)
261+
1. **Patch `scripts/train_nebius.sh` line 439** — raise the 6h hardcoded cap to `${ELIZA_REMOTE_RUN_TIMEOUT_H:-12}*60`. Without this, any retry will hit the same 6h wall.
262+
2. **Patch the EXIT trap (line 582)** — change `teardown || true` to `fetch || true; teardown || true` so a 6h-cap bail still pulls partial checkpoints back before attempting nebius teardown. Right now `set -euo pipefail` causes `fetch` to be skipped after `run_remote` returns 1.
263+
3. **Patch `instance_up()` in watcher scripts** — don't swallow nebius CLI failures as "no" (the v4 watcher bug). Use SSH-based liveness as primary, nebius CLI as confirmation only.
264+
4. Consider plumbing `MAX_STEPS` from env → `run_pipeline.py``train_local.py` for budget-bound runs (1500 steps in 12h target).
265+
266+
### v5 launch (after patches + auth)
267+
```
268+
NEBIUS_VM_NAME=eliza-train-h200-0_8b-v5 \
269+
ELIZA_REMOTE_RUN_TIMEOUT_H=12 \
270+
bash packages/training/scripts/train_nebius.sh full \
271+
--registry-key qwen3.5-0.8b \
272+
--run-name eliza-1-0_8b-apollo-fullcorpus-h200-v5-$(date +%s)
273+
```
274+
Then arm a fresh watcher copied from `/tmp/nebius-finish-q35-0_8b-v4b.sh` (SSH-based liveness).
275+
276+
## Sibling agents NOT TOUCHED
277+
- CUDA-FINISH-3's failed cuda-fused build (no retry from me).
278+
- ACTION-PERSONALITY-BENCH's local llama-server (3712834) — left running.
279+
280+
## Files for next agent
281+
- `/home/shaw/milady/eliza/.swarm/STATUS.md` — this file (state of affairs)
282+
- `/tmp/URGENT-NEBIUS-TEARDOWN-NEEDED.md` — user-facing teardown instructions
283+
- `/tmp/q35-0_8b-v4-launch.log` — full v4 driver log
284+
- `/tmp/q35-0_8b-v4-watcher.log` — original (broken) v4 watcher log
285+
- `/tmp/q35-0_8b-v4b-watcher.log` — fallback watcher log
286+
- `/tmp/q35-0_8b-v4-manual-fetch.log` — manual rsync log
287+
- `/tmp/nebius-finish-q35-0_8b-v4b.sh` — SSH-based watcher template for v5
288+
- `packages/training/checkpoints/eliza-1-0_8b-apollo-fullcorpus-h200-1778619044/` — both partial checkpoints (500 + 1000) with full trainer state
289+

0 commit comments

Comments
 (0)