Skip to content

Commit 26f64a9

Browse files
committed
Merge branch 'develop' of https://github.com/elizaOS/eliza into develop
2 parents ad5f5bd + 8958e3d commit 26f64a9

23 files changed

Lines changed: 725 additions & 41 deletions

File tree

.github/workflows/local-inference-matrix.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -204,12 +204,12 @@ jobs:
204204
205205
- name: Install Hugging Face CLI
206206
run: |
207-
python3 -m pip install --user --upgrade "huggingface_hub[cli]"
208-
if ! command -v huggingface-cli >/dev/null 2>&1; then
207+
python3 -m pip install --user --upgrade "huggingface_hub[hf_transfer]"
208+
if ! command -v hf >/dev/null 2>&1; then
209209
export PATH="$HOME/.local/bin:$PATH"
210210
echo "$HOME/.local/bin" >> "$GITHUB_PATH"
211211
fi
212-
huggingface-cli --help >/dev/null
212+
hf --help >/dev/null
213213
214214
- name: Cache buun-llama-cpp checkout + build
215215
uses: actions/cache@v5
@@ -250,7 +250,7 @@ jobs:
250250
ls -lh "$MODEL_DIR/${SMOKE_MODEL_FILE}"
251251
exit 0
252252
fi
253-
huggingface-cli download "${SMOKE_MODEL_REPO}" "${SMOKE_MODEL_FILE}" \
253+
hf download "${SMOKE_MODEL_REPO}" "${SMOKE_MODEL_FILE}" \
254254
--local-dir "$MODEL_DIR" \
255255
--local-dir-use-symlinks False
256256
ls -lh "$MODEL_DIR/${SMOKE_MODEL_FILE}"

.swarm/STATUS.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,3 +160,130 @@ Then the user must still teardown the VM via nebius CLI after re-auth.
160160
- Re-run cuda-fused on additional sm classes (sm_89 Ada / sm_90 H100 / sm_100 datacenter Blackwell) to confirm no arch regression in `CMAKE_CUDA_ARCHITECTURES=90a;90;89;86;80;100;120a`.
161161
- `llama-bench` is not in the fused-target list — adding it would unblock `runtime_graph_smoke.sh --gen-check` against the fused install. Non-blocking; the cuda-verify-fused parity + e2e_loop_bench publish-gate pass cover the substance.
162162

163+
164+
# H200-MONITOR-4 status — UPDATE 2026-05-13 03:36 UTC
165+
166+
## v4 RUN TERMINATED at step ~1241 due to driver's built-in 6h watchdog
167+
168+
### What happened
169+
- 2026-05-13T03:33:34Z (6h after `run_remote` started polling): the driver hit `scripts/train_nebius.sh` line 439 cap: `if [ "$i" -gt 360 ]; then echo "ERROR: still running after 6h — bailing"; return 1; fi`.
170+
- `run_remote` returned 1 → bash's `set -euo pipefail` aborted the `full` flow → `fetch` was SKIPPED → EXIT trap ran only `teardown`.
171+
- The EXIT trap's `teardown` function (line 544) called `instance_id_by_name``nebius compute v1 instance list` → expired-auth hang → I had to kill the driver process (3652060) manually at 03:35:51Z.
172+
- **Final remote step: 1003** (per remote `run_eliza-1-0_8b-apollo-fullcorpus-h200-1778619044.log`).
173+
- Note: the local launch.log showed step 1241 in the local tail — that's because my Ctrl-C-via-tmux send-keys at 03:34:48Z killed the python training between the 1003 eval finish and the next eval/save. Eval at step 1000 finished: **eval_loss=1.145 at epoch 0.104**.
174+
175+
### Artifacts on remote VM (89.169.122.196 still up)
176+
- `/opt/training/checkpoints/eliza-1-0_8b-apollo-fullcorpus-h200-1778619044/checkpoint-500/` (3.4 GB, eval done)
177+
- `/opt/training/checkpoints/eliza-1-0_8b-apollo-fullcorpus-h200-1778619044/checkpoint-1000/` (eval done, eval_loss=1.145)
178+
- No checkpoint-1500 (only got to step 1241ish before my Ctrl-C; would have needed step 1500).
179+
- README.md, environment.json present.
180+
181+
### Manual rsync fetch in progress
182+
- PID 4024662: `rsync -avhz ubuntu@89.169.122.196:/opt/training/checkpoints/eliza-1-0_8b-apollo-fullcorpus-h200-1778619044/` → local.
183+
- Network speed thrashing 300-1500 KB/s. At ~500 KB/s sustained, the two 3.4GB checkpoints = ~7 GB total → **~4h fetch time worst case**.
184+
- Log: `/tmp/q35-0_8b-v4-manual-fetch.log`
185+
186+
### CRITICAL: VM is up and billing, nebius CLI auth still broken
187+
- VM `eliza-train-h200-0_8b-v4` (instance + boot disk) is still active. GPU idle (0%) since 03:34:48Z, but VM compute is billing at ~$3-4/h.
188+
- Driver's EXIT trap teardown hung indefinitely on expired-token `nebius compute v1 instance list`.
189+
- My fallback watcher (`/tmp/nebius-finish-q35-0_8b-v4b.sh`) cannot teardown either (same auth issue).
190+
- **USER ACTION REQUIRED**: complete nebius OAuth federation (browser at PID 3643529 from 13:42Z, expired URL — likely needs `nebius iam get-access-token` triggered fresh). Then run:
191+
```
192+
export NEBIUS_PROJECT_ID=project-e00kfz6cpr00q21z892vec
193+
cd /home/shaw/milady/eliza/packages/training
194+
NEBIUS_VM_NAME=eliza-train-h200-0_8b-v4 bash scripts/train_nebius.sh teardown
195+
```
196+
197+
### State of training quality at the artifact step
198+
- ~1000 steps done out of 9615 (10.4% of 1 epoch).
199+
- eval_loss 1.145 — meaningful but very early. Gate's format_ok ≥ 0.70 threshold cannot be evaluated without the SFT pipeline running its eval gate (`run_pipeline.py` post-train eval). Since we hit the driver's 6h cap before SFT completed and triggered the eval gate, **no gate_report.json was produced**.
200+
- This is a **Case 2 outcome** per the agent brief (partial checkpoint, no gate_report).
201+
202+
### Decision: relaunch as v5 with `--max-steps` is NOT yet possible
203+
- Reasons:
204+
- Need nebius auth restored before any new VM can be provisioned.
205+
- `train_local.py` has no `--max-steps` flag (only `--epochs` / `--max-samples`); patching `train_nebius.sh` to plumb max-steps requires source edits to `scripts/run_pipeline.py` too.
206+
- Bigger structural problem: the driver hit 6h not 12h. The `train_nebius.sh` 6h cap is hardcoded at line 439. A successful 1500-step run within the cap needs either: shorter eval (saw 50min eval at step 500), smaller test set, or per-step rate above ~6 it/s.
207+
- Recommended v5 plan documented separately in v5 below.
208+
209+
## Cleanup state
210+
- Driver (3652060): killed at 03:35:51Z
211+
- Watcher v4 (3652514): self-terminated at 22:28:02Z (false positive)
212+
- Watcher v4b (3768788): killed at 03:36:11Z (would fail teardown)
213+
- Manual rsync (4024662): in progress
214+
- v4 launch.log: still on disk at `/tmp/q35-0_8b-v4-launch.log` for forensics
215+
- VM `eliza-train-h200-0_8b-v4`: **STILL UP, NEEDS USER MANUAL TEARDOWN**
216+
217+
## v5 plan (post user re-auth)
218+
1. Patch `train_nebius.sh` line 439 to honor an `ELIZA_REMOTE_RUN_TIMEOUT_H` env var (default 12 to match watcher).
219+
2. Patch `scripts/run_pipeline.py` (or `train_local.py`) to honor `MAX_STEPS` env (the trainer.Trainer supports `max_steps` kwarg).
220+
3. Reduce eval frequency: change `save_steps` from 500 → 1500 (so single mid-run eval doesn't burn 50 min). OR keep save_steps=500 but reduce test set size.
221+
4. Relaunch v5 with `MAX_STEPS=1500 ELIZA_REMOTE_RUN_TIMEOUT_H=12 NEBIUS_VM_NAME=eliza-train-h200-0_8b-v5 bash scripts/train_nebius.sh full --registry-key qwen3.5-0.8b ...`
222+
5. Create proper v5 watcher with SSH-based liveness (use `/tmp/nebius-finish-q35-0_8b-v4b.sh` as template).
223+
224+
I am DONE for this agent run — handing off to next H200-MONITOR or to user for nebius re-auth.
225+
226+
227+
# H200-MONITOR-4 — FINAL UPDATE 2026-05-13 05:14 UTC
228+
229+
## Manual rsync fetch COMPLETED
230+
- Both checkpoints fully local at `/home/shaw/milady/eliza/packages/training/checkpoints/eliza-1-0_8b-apollo-fullcorpus-h200-1778619044/`
231+
- `checkpoint-500/`: 3.3 GB (model.safetensors 1.5GB + optimizer.pt 2.0GB + tokenizer/config)
232+
- `checkpoint-1000/`: 3.4 GB (same shape)
233+
- Total fetched: 7.14 GB at ~1.3 MB/s avg (1h35m wall)
234+
- rsync log: `/tmp/q35-0_8b-v4-manual-fetch.log` (PID 4024662 exited cleanly)
235+
236+
## Training loss curve from trainer_state.json
237+
- **step 490 → 500**: train loss 8.95 → 8.82, eval_loss=1.255 (eval ran 37 min)
238+
- **step 990 → 1000**: train loss 7.22 → 7.06, eval_loss=1.145 (eval ran 15 min, eval cache warm)
239+
- LR schedule: linear warmup from 1e-5, currently 9.86e-6 at step 1000
240+
- grad_norm 83→100→125→144 (volatile, expected at very early epoch)
241+
- Conclusion: model is clearly learning, but eval_loss=1.145 at 10.4% of epoch 1 is too early for a quality `format_ok ≥ 0.70` gate clear. Loss curve trajectory looks sane and matches the 0.6b reference.
242+
243+
## Gate eval: NOT RUN
244+
- The pipeline's gate eval (`run_pipeline.py --eval-mode full`) only runs AFTER training completes. We hit the driver's 6h cap mid-training. No `gate_report.json` exists.
245+
- Per Case 2 in the brief, this is a "partial checkpoint, no gate_report" outcome → iterate (not publish).
246+
247+
## v5 cannot start until USER re-auths nebius
248+
249+
### Steps for user
250+
1. `~/.nebius/bin/nebius iam get-access-token` (opens browser, complete federation OAuth)
251+
2. `~/.nebius/bin/nebius iam whoami` (verify)
252+
3. Teardown v4 VM:
253+
```
254+
export PATH="$HOME/.nebius/bin:$PATH"
255+
export NEBIUS_PROJECT_ID=project-e00kfz6cpr00q21z892vec
256+
cd /home/shaw/milady/eliza/packages/training
257+
NEBIUS_VM_NAME=eliza-train-h200-0_8b-v4 bash scripts/train_nebius.sh teardown
258+
```
259+
260+
### v5 patch required first (before relaunch)
261+
1. **Patch `scripts/train_nebius.sh` line 439** — raise the 6h hardcoded cap to `${ELIZA_REMOTE_RUN_TIMEOUT_H:-12}*60`. Without this, any retry will hit the same 6h wall.
262+
2. **Patch the EXIT trap (line 582)** — change `teardown || true` to `fetch || true; teardown || true` so a 6h-cap bail still pulls partial checkpoints back before attempting nebius teardown. Right now `set -euo pipefail` causes `fetch` to be skipped after `run_remote` returns 1.
263+
3. **Patch `instance_up()` in watcher scripts** — don't swallow nebius CLI failures as "no" (the v4 watcher bug). Use SSH-based liveness as primary, nebius CLI as confirmation only.
264+
4. Consider plumbing `MAX_STEPS` from env → `run_pipeline.py``train_local.py` for budget-bound runs (1500 steps in 12h target).
265+
266+
### v5 launch (after patches + auth)
267+
```
268+
NEBIUS_VM_NAME=eliza-train-h200-0_8b-v5 \
269+
ELIZA_REMOTE_RUN_TIMEOUT_H=12 \
270+
bash packages/training/scripts/train_nebius.sh full \
271+
--registry-key qwen3.5-0.8b \
272+
--run-name eliza-1-0_8b-apollo-fullcorpus-h200-v5-$(date +%s)
273+
```
274+
Then arm a fresh watcher copied from `/tmp/nebius-finish-q35-0_8b-v4b.sh` (SSH-based liveness).
275+
276+
## Sibling agents NOT TOUCHED
277+
- CUDA-FINISH-3's failed cuda-fused build (no retry from me).
278+
- ACTION-PERSONALITY-BENCH's local llama-server (3712834) — left running.
279+
280+
## Files for next agent
281+
- `/home/shaw/milady/eliza/.swarm/STATUS.md` — this file (state of affairs)
282+
- `/tmp/URGENT-NEBIUS-TEARDOWN-NEEDED.md` — user-facing teardown instructions
283+
- `/tmp/q35-0_8b-v4-launch.log` — full v4 driver log
284+
- `/tmp/q35-0_8b-v4-watcher.log` — original (broken) v4 watcher log
285+
- `/tmp/q35-0_8b-v4b-watcher.log` — fallback watcher log
286+
- `/tmp/q35-0_8b-v4-manual-fetch.log` — manual rsync log
287+
- `/tmp/nebius-finish-q35-0_8b-v4b.sh` — SSH-based watcher template for v5
288+
- `packages/training/checkpoints/eliza-1-0_8b-apollo-fullcorpus-h200-1778619044/` — both partial checkpoints (500 + 1000) with full trainer state
289+

packages/app-core/deploy/Dockerfile.cloud

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ WORKDIR /src
2424

2525
# Copy only the manifests Bun needs to resolve the production dependency graph.
2626
COPY package.json bun.lock ./
27+
# Root-level patches (e.g. @capacitor/barcode-scanner) referenced by package.json patchedDependencies.
28+
COPY patches ./patches
2729
COPY packages/app-core/patches ./packages/app-core/patches
2830
COPY packages/core/package.json ./packages/core/package.json
2931
COPY packages/agent/package.json ./packages/agent/package.json
@@ -42,17 +44,24 @@ ENV NODE_LLAMA_CPP_SKIP_DOWNLOAD=true
4244
# Materialize the standalone cloud-agent template at the Docker deploy
4345
# boundary. Source manifests stay workspace-local; the image install uses the
4446
# local package versions as exact npm dependencies.
45-
RUN node <<'EOF'
47+
# Note: bun's node shim does not support reading scripts from stdin (heredoc),
48+
# so we write to a temp file and execute with bun instead of `node <<'EOF'`.
49+
RUN <<'SCRIPT'
50+
set -e
51+
cat > /tmp/prune-workspace.mjs << 'JS'
4652
const fs = require("fs");
4753

4854
const KEEP = [
55+
"packages/core",
4956
"packages/agent",
5057
"packages/app-core",
5158
"packages/shared",
5259
"packages/ui",
5360
"packages/app-core/platforms/electrobun",
5461
"packages/app-core/deploy/cloud-agent-template",
5562
"packages/workflows",
63+
"plugins/plugin-sql",
64+
"plugins/plugin-elizacloud",
5665
"plugins/plugin-workflow",
5766
"packages/app",
5867
];
@@ -117,7 +126,7 @@ function materializeWorkspaceDeps(path, dependencyNames, versions) {
117126
const spec = pkg.dependencies?.[name];
118127
if (typeof spec !== "string" || !spec.startsWith("workspace:")) continue;
119128
const version = versions.get(name);
120-
if (!version) throw new Error(`No local package version found for ${name}`);
129+
if (!version) throw new Error("No local package version found for " + name);
121130
pkg.dependencies[name] = version;
122131
changed = true;
123132
}
@@ -131,10 +140,11 @@ function prune(path) {
131140
const deps = pkg[section];
132141
if (!deps) continue;
133142
for (const [name, spec] of Object.entries(deps)) {
134-
const isWorkspace =
135-
typeof spec === "string" && spec.startsWith("workspace:");
143+
const isLocal =
144+
typeof spec === "string" &&
145+
(spec.startsWith("workspace:") || spec.startsWith("file:"));
136146
const isStripName = STRIP_NAMES.test(name);
137-
if ((isWorkspace && !ALLOW.has(name)) || isStripName) {
147+
if ((isLocal && !ALLOW.has(name)) || isStripName) {
138148
delete deps[name];
139149
}
140150
}
@@ -162,7 +172,10 @@ writeJson("package.json", root);
162172
"packages/app-core/deploy/cloud-agent-template/package.json",
163173
].forEach(prune);
164174
if (fs.existsSync("bun.lock")) fs.unlinkSync("bun.lock");
165-
EOF
175+
JS
176+
bun /tmp/prune-workspace.mjs
177+
rm /tmp/prune-workspace.mjs
178+
SCRIPT
166179
# Drop --frozen-lockfile: bun.lock has been removed above so bun regenerates
167180
# fresh against the pruned workspace set.
168181
RUN bun install --ignore-scripts --no-progress
@@ -174,6 +187,7 @@ RUN npm install --prefix /opt/tsx --ignore-scripts --no-save tsx@4.21.0 \
174187
FROM node:${NODE_VERSION}-slim AS prep
175188
WORKDIR /src
176189
ARG VERSION_CLEAN
190+
ARG APP_ENTRYPOINT=""
177191
# Re-declare so it's in scope for this build stage.
178192
ARG VITE_ASSET_BASE_URL=""
179193
ENV VITE_ASSET_BASE_URL=${VITE_ASSET_BASE_URL}
@@ -185,7 +199,7 @@ COPY packages/app-core ./packages/app-core
185199
COPY packages/shared ./packages/shared
186200
COPY packages/workflows ./packages/workflows
187201
COPY plugins/plugin-workflow ./plugins/plugin-workflow
188-
COPY ${APP_ENTRYPOINT} package.json plugins.json ./
202+
COPY package.json plugins.json ./
189203
COPY packages/app-core/scripts/docker-entrypoint.sh ./scripts/docker-entrypoint.sh
190204

191205
RUN set -eux; \
@@ -204,7 +218,8 @@ RUN set -eux; \
204218
done; \
205219
cp -a packages/app/dist /runtime/packages/app/dist; \
206220
cp -a packages/app-core/scripts/docker-entrypoint.sh /runtime/scripts/docker-entrypoint.sh; \
207-
cp -a "${APP_ENTRYPOINT}" package.json plugins.json /runtime/; \
221+
cp -a package.json plugins.json /runtime/; \
222+
if [ -n "${APP_ENTRYPOINT}" ]; then cp -a "${APP_ENTRYPOINT}" /runtime/; fi; \
208223
sed -i 's/const stats = service.getCatalogStats();/const stats = typeof service.getCatalogStats === "function" ? service.getCatalogStats() : { loaded: 0, total: 0, storageType: "unknown" };/g' /runtime/node_modules/@elizaos/plugin-agent-skills/dist/index.js 2>/dev/null || true; \
209224
if [ -n "${VERSION_CLEAN}" ]; then \
210225
node -e "const fs=require('fs'); const v=process.argv[1]; const p='/runtime/node_modules/@elizaos/agent/package.json'; try { const pkg=JSON.parse(fs.readFileSync(p,'utf8')); pkg.version=v; fs.writeFileSync(p, JSON.stringify(pkg, null, 2)); console.log('Patched autonomous to', v); } catch (err) { console.log('skip:', err.message); }" "${VERSION_CLEAN}"; \
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
// Node.js ESM can't resolve extensionless directory imports without an index
2+
// file. Re-export everything from the parent module; rewriteRelativeImportExtensions
3+
// rewrites the `.ts` suffix to `.js` in the compiled output so the runtime finds it.
4+
export * from "../structured-output.ts";

packages/app-core/vitest.node-sqlite.config.ts

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,20 @@
77
* CI usage (test.yml — "Run node:sqlite tests" step):
88
* node node_modules/vitest/vitest.mjs run --config vitest.node-sqlite.config.ts
99
*/
10-
import { mergeConfig } from "vitest/config";
1110
import base from "./vitest.config.ts";
1211

13-
export default mergeConfig(base, {
12+
// mergeConfig concatenates arrays (include, exclude) rather than replacing
13+
// them, so the training-benchmarks.test.ts file would end up in both include
14+
// and exclude — and exclude wins, leaving no test files. Use object spread
15+
// to replace the arrays instead of appending to them.
16+
export default {
17+
...base,
1418
test: {
19+
...base.test,
1520
include: ["src/api/training-benchmarks.test.ts"],
16-
exclude: (base.test?.exclude ?? []).filter(
21+
exclude: ((base.test?.exclude ?? []) as unknown[]).filter(
1722
(p: unknown) =>
1823
typeof p !== "string" || !p.includes("training-benchmarks"),
1924
),
2025
},
21-
});
26+
};

packages/core/src/utils/state-dir.test.ts

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,13 +64,18 @@ describe("resolveStateDir", () => {
6464
expect(resolveStateDir({}, fakeHomedir)).toBe(join(FAKE_HOME, ".eliza"));
6565
});
6666

67-
it("treats whitespace-only env values as unset", () => {
67+
it("treats whitespace-only env values as unset, falling through to aliases", () => {
68+
// ELIZA_STATE_DIR is whitespace-only (treated as unset) → alias consulted.
6869
expect(
6970
resolveStateDir(
7071
{ ELIZA_STATE_DIR: " ", MILADY_STATE_DIR: "/tmp/bar" },
7172
fakeHomedir,
7273
),
73-
).toBe(join(FAKE_HOME, ".eliza"));
74+
).toBe("/tmp/bar");
75+
// When neither canonical nor alias is set the default applies.
76+
expect(resolveStateDir({ ELIZA_STATE_DIR: " " }, fakeHomedir)).toBe(
77+
join(FAKE_HOME, ".eliza"),
78+
);
7479
});
7580

7681
it("expands a leading ~ in env overrides via the real homedir", () => {

packages/shared/src/contracts/__tests__/schema-vs-type-drift.test.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
* Adding a new schema: import both sides + assert membership equality.
1616
*/
1717

18-
import { describe, expect, it } from "bun:test";
18+
import { describe, expect, it } from "vitest";
1919
import {
2020
ConversationAutomationTypeSchema,
2121
ConversationScopeSchema,

packages/shared/src/contracts/conversation-routes.test.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,16 @@ describe("PostConversationRequestSchema", () => {
1616
title: "Hello",
1717
includeGreeting: true,
1818
lang: "en",
19-
metadata: { scope: "user", taskId: "t1" },
19+
metadata: { scope: "general", taskId: "t1" },
2020
});
2121
expect(parsed.title).toBe("Hello");
22-
expect(parsed.metadata?.scope).toBe("user");
22+
expect(parsed.metadata?.scope).toBe("general");
2323
});
2424

2525
it("rejects unknown metadata field", () => {
2626
expect(() =>
2727
PostConversationRequestSchema.parse({
28-
metadata: { scope: "user", custom: 1 },
28+
metadata: { scope: "general", custom: 1 },
2929
}),
3030
).toThrow();
3131
});

plugins.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{ "plugins": [] }

0 commit comments

Comments
 (0)