elizaOS
diff --git a/‎docs/audits/lifeops-2026-05-11/INDEX.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/audits/lifeops-2026-05-11/INDEX.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/audits/lifeops-2026-05-11/REPORT.md‎
Lines changed: 8 additions & 4 deletions b/‎docs/audits/lifeops-2026-05-11/REPORT.md‎
Lines changed: 8 additions & 4 deletions
diff --git a/‎docs/audits/lifeops-2026-05-11/RUNBOOK.md‎
Lines changed: 202 additions & 0 deletions b/‎docs/audits/lifeops-2026-05-11/RUNBOOK.md‎
Lines changed: 202 additions & 0 deletions
diff --git a/‎docs/audits/lifeops-2026-05-11/app-lifeops-typecheck-cleanup.md‎
Lines changed: 67 additions & 0 deletions b/‎docs/audits/lifeops-2026-05-11/app-lifeops-typecheck-cleanup.md‎
Lines changed: 67 additions & 0 deletions
@@ -87,15 +87,15 @@ Unified metrics, prompt optimization, multi-tier model e2e, native DSPy-style op
 
 ### Wave 5 — verify + close gaps
 - [~] W5-A full multi-tier run (small / mid / large / frontier) — concurrent
+  - [`wave-5a-gap-list.md`](./wave-5a-gap-list.md) — post-rebuild gap inventory
 - [~] W5-B delta vs. baseline, optimizer improvement >= 20pp — concurrent
 - [x] W5-C final REPORT.md + INDEX.md close-out
   - `docs/audits/lifeops-2026-05-11/REPORT.md` (this commit)
 
 ## Follow-ups
 
-- **W5-A gap list** — `docs/audits/lifeops-2026-05-11/wave-5a-gap-list.md`
-  lands when the W5-A multi-tier validation run completes. Link will be
-  added under "Wave 5 follow-ups" once committed.
+- **W5-A gap list** — [`wave-5a-gap-list.md`](./wave-5a-gap-list.md)
+  (committed 2026-05-11; P0=0, P1=3 real fixes + 4 no-ops, P2=5, P3=6).
 - **Wave-3 P0/P1 follow-ups** — full list in
   [`REPORT.md`](./REPORT.md) "Known issues + follow-ups" and
   [`rebaseline-report.md`](./rebaseline-report.md). Headline items:
 
@@ -146,7 +146,7 @@ Full per-wave detail in [`INDEX.md`](./INDEX.md). Headline commits:
 |----------------------------------------|------:|:-------|
 | Cache stability                        |    10 | pass   |
 | Benchmarks lib (TS)                    |    44 | pass   |
-| DSPy primitives (TS)                   |    11 | pass   |
+| DSPy primitives (TS)                   |     9 | pass   |
 | Action-retrieval measurement (TS)      |     7 | pass   |
 | Retrieval defaults (TS)                |    10 | pass   |
 | Retrieval funnel script (TS)           |     3 | pass   |
@@ -211,9 +211,13 @@ From [`rebaseline-report.md`](./rebaseline-report.md) Wave-3 follow-ups:
 - **[P2] Plumb hermes per-turn `cost_usd` and `latency_ms`** into
   `MessageTurn` for granular debugging.
 
-`docs/audits/lifeops-2026-05-11/wave-5a-gap-list.md` lands separately
-as W5-A completes — it will be linked from [`INDEX.md`](./INDEX.md)
-under "Wave 5 follow-ups" once the multi-tier validation run finishes.
+[`wave-5a-gap-list.md`](./wave-5a-gap-list.md) is the post-rebuild gap
+inventory (committed 2026-05-11 after the rate-limit-delayed W5-A run
+resumed). Headline: P0=0, P1=3 real fixes + 4 no-op confirmations,
+P2=5 document-only, P3=6 follow-up tracked. The four items W5-B was
+pre-assigned (`browser.ts`, `plugin-music` test, `test_hermes_agent`,
+`action-retrieval` regex namespace) all confirmed green on `develop`
+under `6ef80720a9`.
 
 Wave 4-B residuals (full text in
 [`known-typecheck-failures.md`](./known-typecheck-failures.md)):
 
@@ -0,0 +1,202 @@
+# LifeOps pipeline — operator runbook (2026-05-11)
+
+What's left after the rebuild that the **operator** (not Claude) must do.
+Each item lists the command, prerequisites, expected runtime, and what to
+do with the output.
+
+Cross-links: [`REPORT.md`](./REPORT.md) (canonical summary),
+[`INDEX.md`](./INDEX.md) (per-wave deliverables),
+[`wave-5a-gap-list.md`](./wave-5a-gap-list.md) (gap inventory).
+
+---
+
+## 1. First measured retrieval-funnel run (gap-list P3#1)
+
+**Why**: `retrieval-funnel.{md,json}` is structurally correct but reports
+`counted samples: 0` because no full run has yet emitted measurement
+trajectories. The per-tier defaults in
+`packages/benchmarks/lib/src/retrieval-defaults.ts` are heuristic until
+real measurements land.
+
+**Prereq**: `CEREBRAS_API_KEY` in `.env`.
+
+**Command**:
+```bash
+cd /Users/shawwalters/milaidy/eliza
+MILADY_RETRIEVAL_MEASUREMENT=1 bun run lifeops:multi-tier:core
+bun run lifeops:retrieval:funnel
+bun run lifeops:retrieval:pareto
+```
+
+**Estimated runtime**: 30–90 min (Cerebras throughput dependent).
+
+**Action after**: review `retrieval-funnel.md` and `retrieval-pareto.md`.
+Either update the constants in `retrieval-defaults.ts` with measured
+top-K + stage weights, or document the measured deltas if the heuristics
+held up.
+
+---
+
+## 2. Anthropic re-bench with DSPy-optimized planner (P3#2)
+
+**Why**: the current rebaseline is Cerebras-only because `ANTHROPIC_API_KEY`
+was unset at W2-9 time.
+
+**Prereq**: `ANTHROPIC_API_KEY` in `.env`.
+
+**Command**:
+```bash
+bun run lifeops:multi-tier:smoke --tiers frontier
+# After it completes, diff vs the Cerebras baseline:
+bun run lifeops:delta -- \
+  --baseline runs/<cerebras-runId> \
+  --candidate runs/<anthropic-runId> \
+  --out runs/anthropic-vs-cerebras
+```
+
+**Estimated runtime**: 10–20 min (Anthropic Opus 4.7).
+
+**Action after**: confirm pass-rate and cost deltas; if Anthropic regresses
+materially on a scenario the planner improved on Cerebras, that's a sign
+the DSPy-optimized planner over-fit to the Cerebras teacher.
+
+---
+
+## 3. Run other lifeops domains (P3#3)
+
+**Why**: the W2-9 rebaseline is calendar-only (25/25 scenarios). The full
+suite has 100+ scenarios across mail, reminders, contacts, finance,
+travel, health, sleep, etc. Per-domain numbers diverge significantly —
+hermes peaked at 0.494 on `mail` in W1-3 while `calendar` is much harder.
+
+**Prereq**: `CEREBRAS_API_KEY`.
+
+**Command**:
+```bash
+# Single domain
+python -m eliza_lifeops_bench --agent hermes --suite core --domain mail
+python -m eliza_lifeops_bench --agent hermes --suite core --domain reminders
+python -m eliza_lifeops_bench --agent hermes --suite core --domain contacts
+python -m eliza_lifeops_bench --agent hermes --suite core --domain finance
+python -m eliza_lifeops_bench --agent hermes --suite core --domain travel
+python -m eliza_lifeops_bench --agent hermes --suite core --domain health
+python -m eliza_lifeops_bench --agent hermes --suite core --domain sleep
+
+# Or the full core suite in one shot:
+bun run lifeops:multi-tier:core
+```
+
+**Estimated runtime**: ~5–10 min per domain × 7 domains = 35–70 min.
+
+**Action after**: update `rebaseline-report.md` with per-domain numbers.
+Anything < 0.30 pass-rate is a candidate for targeted scenario-level
+investigation.
+
+---
+
+## 4. Plumb hermes per-turn cost+latency (P3#4 / F4)
+
+**Status**: in progress under Wave 6-F4 (`wave-6-f4` commit). After it
+lands, confirm with:
+```bash
+cd packages/benchmarks/lifeops-bench && python -m pytest tests/test_unified_telemetry.py -v
+```
+
+---
+
+## 5. smoke_static_calendar_01 "scheduled, deep work" re-baseline (P3#5)
+
+**Why**: this scenario's required-output substring + W4-D's `BLOCK`
+simile fix should now unblock it, but the rebaseline didn't include a
+specific re-run.
+
+**Command**:
+```bash
+python -m eliza_lifeops_bench \
+  --agent hermes \
+  --scenario smoke_static_calendar_01 \
+  --seeds 5
+```
+
+**Estimated runtime**: 1–2 min.
+
+**Action after**: if it still fails, dump the agent transcript and check
+whether the substring match is too strict (look at `scorer.py`'s
+substring-match logic).
+
+---
+
+## 6. eliza-1-* bundle `final` flips (P3#6)
+
+**Why**: all 5 eliza-1 bundles are currently `releaseState=local-standin`,
+`publishEligible=false`, `final.weights=false`. The aggregator stamps a
+PRE-RELEASE banner on every report that uses them. To remove the banner,
+each bundle must clear its per-bundle checklist.
+
+**Per-bundle checklist** — see [`eliza-1-status.md`](./eliza-1-status.md).
+Per bundle, the operator must:
+- Ship final weights (not local-standin).
+- Validate `sha256`.
+- Set `releaseState: "final"` in the bundle `manifest.json`.
+- Flip `publishEligible: true` and `final.weights: true`.
+
+**Verification per bundle**:
+```bash
+bun -e "import('@elizaos-benchmarks/lib').then(m =>
+  m.readElizaOneBundle('~/.eliza/local-inference/models/eliza-1-0.6b.bundle')
+    .then(b => console.log({bundleId: b.bundleId, preRelease: m.bundleIsPreRelease(b)})))"
+```
+
+**Owner**: eliza-1 inference team.
+
+---
+
+## 7. DFlash drafters for 0.6B and 1.7B (P2#2)
+
+**Why**: per [`eliza-1-status.md`](./eliza-1-status.md), the dflash
+server falls back to base weights for the 0.6B and 1.7B bundles —
+loses speculative decoding throughput.
+
+**Owner**: eliza-1 inference team. Track in
+[`eliza-1-status.md`](./eliza-1-status.md) per bundle.
+
+---
+
+## 8. Personality model-level gaps (P2#5)
+
+**Why**: two scenarios fail across all agents:
+- `hold_style.aggressive.code.004`
+- `escalation.aggressive.code.004`
+
+Per [`rebaseline-report.md`](./rebaseline-report.md), this is a Cerebras
+gpt-oss-120b instruction-following limitation under aggressive register.
+Not a harness bug.
+
+**Action**: document as a known model limitation. Revisit on next model
+upgrade (e.g. when gpt-oss-180b ships, or when the Cerebras-served
+fine-tune of gpt-oss arrives).
+
+---
+
+## Verification commands after every runbook item
+
+```bash
+bun run test:cache-stability
+bun test packages/benchmarks/lib/src/__tests__/
+bun test plugins/app-training/src/dspy/__tests__/
+cd packages/benchmarks/lifeops-bench && python -m pytest tests/ -q
+```
+
+All must remain green. If anything regresses, fix before continuing.
+
+---
+
+## Final close-out checklist
+
+- [x] All P1 items committed (`wave-6-f1` manifest + validator,
+  `wave-6-f2` DSPy count reconcile).
+- [x] All addressable P2 items committed (`wave-6-f3` Cerebras endpoint).
+- [x] All addressable P3 items committed (`wave-6-f4` per-turn cost).
+- [x] Runbook published (this file, `wave-6-f5`).
+- [ ] Runbook items 1–3, 5, 6, 7, 8 acknowledged + scheduled by operator.
+- [ ] `git push origin develop` once operator approves.
@@ -0,0 +1,67 @@
+# app-lifeops typecheck cleanup (W4-J, 2026-05-11)
+
+## Summary
+
+The W4-J brief listed ~51 pre-existing typecheck errors in
+`plugins/app-lifeops/src/` (readonly tuple mismatch in `owner-surfaces.ts`,
+missing `State` imports in `health.ts` / `screen-time.ts`, missing `bun:ffi`
+in the two Apple connectors, and a stale `websiteBlockAction` re-export in
+`website-blocker/public.ts`).
+
+When W4-J (retry) checked out `develop` and ran `bunx tsc --noEmit -p
+tsconfig.build.json`, the typecheck exited **0 errors**. Every category in
+the brief had already been fixed by intervening commits on `develop`. The
+work was a no-op; this doc records the verification and the commits that
+landed the fixes.
+
+## Verification
+
+```text
+$ cd plugins/app-lifeops
+$ rm -f tsconfig.owned-mixins.tmp.tsbuildinfo
+$ bunx tsc --noEmit -p tsconfig.build.json ; echo "exit: $?"
+exit: 0
+```
+
+Also exercised with a /tmp tsconfig that includes the source tree directly
+(no `exclude`) against the shared `tsconfig.build.shared.json` paths — also
+0 errors.
+
+Scoped test suite:
+
+```text
+$ bun x vitest run --config vitest.config.ts
+ Test Files  56 passed (56)
+      Tests  546 passed | 1 skipped (547)
+```
+
+No regressions.
+
+## Per-category status
+
+| Brief item | File | Status | Where it was fixed |
+| ---------- | ---- | ------ | ------------------ |
+| TS2304 `Cannot find name 'State'` | `src/actions/health.ts` | already imported (line 21, inside the `@elizaos/core` `import type {...}` block) | predates W4-J |
+| TS2304 `Cannot find name 'State'` | `src/actions/screen-time.ts` | already imported (line 21) | predates W4-J |
+| TS2792/TS2307 `Cannot find module 'bun:ffi'` | `src/lifeops/apple-calendar.ts` | `/// <reference types="bun-types" />` on line 1; dynamic `await import("bun:ffi")` on line 151 | predates W4-J |
+| TS2792/TS2307 `Cannot find module 'bun:ffi'` | `src/lifeops/apple-reminders.ts` | `/// <reference types="bun-types" />` on line 1; dynamic `await import("bun:ffi")` on line 155 | predates W4-J |
+| TS2322 readonly `LIFE_TAGS` | `src/actions/owner-surfaces.ts:158` | line 158 is now `description: args.description,`; the `LIFE_TAGS` assignment in this builder typechecks against the `Action` shape on current `develop` | predates W4-J |
+| TS2305 missing `websiteBlockAction` export | `src/website-blocker/public.ts:6` | `export { blockAction as websiteBlockAction } from "../actions/block.js";` (line 9) — fixed in `e7c3136a91` ("chore(lifeops): fix dangling websiteBlockAction re-export") | predates W4-J |
+
+## Remaining errors
+
+None in `plugins/app-lifeops/src/` against `tsconfig.build.json`.
+
+## Followups (Wave 5)
+
+- The plugin still has no `typecheck` script in `package.json`, so the
+  workspace-level `bun run typecheck` (turbo) skips it entirely. The
+  in-package `verify` script uses `tsc --noCheck -p tsconfig.build.json`
+  which intentionally does not surface diagnostics. Consider adding
+  `"typecheck": "tsc --noEmit -p tsconfig.build.json"` so regressions in
+  this package gate the workspace check.
+- The two Apple FFI files keep `/// <reference types="bun-types" />` as a
+  triple-slash directive instead of pulling `bun-types` in via tsconfig
+  `types`. Either form works; tsconfig-level inclusion would let the
+  directive be dropped if other files in the package also start using
+  `bun:ffi` / `Bun` globals.