Skip to content

Commit 97c0d3c

Browse files
author
lalalune
committed
merge: sync develop with origin
2 parents f1f0642 + 2431bee commit 97c0d3c

39 files changed

Lines changed: 3635 additions & 441 deletions

docs/audits/lifeops-2026-05-11/INDEX.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -87,15 +87,15 @@ Unified metrics, prompt optimization, multi-tier model e2e, native DSPy-style op
8787

8888
### Wave 5 — verify + close gaps
8989
- [~] W5-A full multi-tier run (small / mid / large / frontier) — concurrent
90+
- [`wave-5a-gap-list.md`](./wave-5a-gap-list.md) — post-rebuild gap inventory
9091
- [~] W5-B delta vs. baseline, optimizer improvement >= 20pp — concurrent
9192
- [x] W5-C final REPORT.md + INDEX.md close-out
9293
- `docs/audits/lifeops-2026-05-11/REPORT.md` (this commit)
9394

9495
## Follow-ups
9596

96-
- **W5-A gap list**`docs/audits/lifeops-2026-05-11/wave-5a-gap-list.md`
97-
lands when the W5-A multi-tier validation run completes. Link will be
98-
added under "Wave 5 follow-ups" once committed.
97+
- **W5-A gap list**[`wave-5a-gap-list.md`](./wave-5a-gap-list.md)
98+
(committed 2026-05-11; P0=0, P1=3 real fixes + 4 no-ops, P2=5, P3=6).
9999
- **Wave-3 P0/P1 follow-ups** — full list in
100100
[`REPORT.md`](./REPORT.md) "Known issues + follow-ups" and
101101
[`rebaseline-report.md`](./rebaseline-report.md). Headline items:

docs/audits/lifeops-2026-05-11/REPORT.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ Full per-wave detail in [`INDEX.md`](./INDEX.md). Headline commits:
146146
|----------------------------------------|------:|:-------|
147147
| Cache stability | 10 | pass |
148148
| Benchmarks lib (TS) | 44 | pass |
149-
| DSPy primitives (TS) | 11 | pass |
149+
| DSPy primitives (TS) | 9 | pass |
150150
| Action-retrieval measurement (TS) | 7 | pass |
151151
| Retrieval defaults (TS) | 10 | pass |
152152
| Retrieval funnel script (TS) | 3 | pass |
@@ -211,9 +211,13 @@ From [`rebaseline-report.md`](./rebaseline-report.md) Wave-3 follow-ups:
211211
- **[P2] Plumb hermes per-turn `cost_usd` and `latency_ms`** into
212212
`MessageTurn` for granular debugging.
213213

214-
`docs/audits/lifeops-2026-05-11/wave-5a-gap-list.md` lands separately
215-
as W5-A completes — it will be linked from [`INDEX.md`](./INDEX.md)
216-
under "Wave 5 follow-ups" once the multi-tier validation run finishes.
214+
[`wave-5a-gap-list.md`](./wave-5a-gap-list.md) is the post-rebuild gap
215+
inventory (committed 2026-05-11 after the rate-limit-delayed W5-A run
216+
resumed). Headline: P0=0, P1=3 real fixes + 4 no-op confirmations,
217+
P2=5 document-only, P3=6 follow-up tracked. The four items W5-B was
218+
pre-assigned (`browser.ts`, `plugin-music` test, `test_hermes_agent`,
219+
`action-retrieval` regex namespace) all confirmed green on `develop`
220+
under `6ef80720a9`.
217221

218222
Wave 4-B residuals (full text in
219223
[`known-typecheck-failures.md`](./known-typecheck-failures.md)):
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# LifeOps pipeline — operator runbook (2026-05-11)
2+
3+
What's left after the rebuild that the **operator** (not Claude) must do.
4+
Each item lists the command, prerequisites, expected runtime, and what to
5+
do with the output.
6+
7+
Cross-links: [`REPORT.md`](./REPORT.md) (canonical summary),
8+
[`INDEX.md`](./INDEX.md) (per-wave deliverables),
9+
[`wave-5a-gap-list.md`](./wave-5a-gap-list.md) (gap inventory).
10+
11+
---
12+
13+
## 1. First measured retrieval-funnel run (gap-list P3#1)
14+
15+
**Why**: `retrieval-funnel.{md,json}` is structurally correct but reports
16+
`counted samples: 0` because no full run has yet emitted measurement
17+
trajectories. The per-tier defaults in
18+
`packages/benchmarks/lib/src/retrieval-defaults.ts` are heuristic until
19+
real measurements land.
20+
21+
**Prereq**: `CEREBRAS_API_KEY` in `.env`.
22+
23+
**Command**:
24+
```bash
25+
cd /Users/shawwalters/milaidy/eliza
26+
MILADY_RETRIEVAL_MEASUREMENT=1 bun run lifeops:multi-tier:core
27+
bun run lifeops:retrieval:funnel
28+
bun run lifeops:retrieval:pareto
29+
```
30+
31+
**Estimated runtime**: 30–90 min (Cerebras throughput dependent).
32+
33+
**Action after**: review `retrieval-funnel.md` and `retrieval-pareto.md`.
34+
Either update the constants in `retrieval-defaults.ts` with measured
35+
top-K + stage weights, or document the measured deltas if the heuristics
36+
held up.
37+
38+
---
39+
40+
## 2. Anthropic re-bench with DSPy-optimized planner (P3#2)
41+
42+
**Why**: the current rebaseline is Cerebras-only because `ANTHROPIC_API_KEY`
43+
was unset at W2-9 time.
44+
45+
**Prereq**: `ANTHROPIC_API_KEY` in `.env`.
46+
47+
**Command**:
48+
```bash
49+
bun run lifeops:multi-tier:smoke --tiers frontier
50+
# After it completes, diff vs the Cerebras baseline:
51+
bun run lifeops:delta -- \
52+
--baseline runs/<cerebras-runId> \
53+
--candidate runs/<anthropic-runId> \
54+
--out runs/anthropic-vs-cerebras
55+
```
56+
57+
**Estimated runtime**: 10–20 min (Anthropic Opus 4.7).
58+
59+
**Action after**: confirm pass-rate and cost deltas; if Anthropic regresses
60+
materially on a scenario the planner improved on Cerebras, that's a sign
61+
the DSPy-optimized planner over-fit to the Cerebras teacher.
62+
63+
---
64+
65+
## 3. Run other lifeops domains (P3#3)
66+
67+
**Why**: the W2-9 rebaseline is calendar-only (25/25 scenarios). The full
68+
suite has 100+ scenarios across mail, reminders, contacts, finance,
69+
travel, health, sleep, etc. Per-domain numbers diverge significantly —
70+
hermes peaked at 0.494 on `mail` in W1-3 while `calendar` is much harder.
71+
72+
**Prereq**: `CEREBRAS_API_KEY`.
73+
74+
**Command**:
75+
```bash
76+
# Single domain
77+
python -m eliza_lifeops_bench --agent hermes --suite core --domain mail
78+
python -m eliza_lifeops_bench --agent hermes --suite core --domain reminders
79+
python -m eliza_lifeops_bench --agent hermes --suite core --domain contacts
80+
python -m eliza_lifeops_bench --agent hermes --suite core --domain finance
81+
python -m eliza_lifeops_bench --agent hermes --suite core --domain travel
82+
python -m eliza_lifeops_bench --agent hermes --suite core --domain health
83+
python -m eliza_lifeops_bench --agent hermes --suite core --domain sleep
84+
85+
# Or the full core suite in one shot:
86+
bun run lifeops:multi-tier:core
87+
```
88+
89+
**Estimated runtime**: ~5–10 min per domain × 7 domains = 35–70 min.
90+
91+
**Action after**: update `rebaseline-report.md` with per-domain numbers.
92+
Anything < 0.30 pass-rate is a candidate for targeted scenario-level
93+
investigation.
94+
95+
---
96+
97+
## 4. Plumb hermes per-turn cost+latency (P3#4 / F4)
98+
99+
**Status**: in progress under Wave 6-F4 (`wave-6-f4` commit). After it
100+
lands, confirm with:
101+
```bash
102+
cd packages/benchmarks/lifeops-bench && python -m pytest tests/test_unified_telemetry.py -v
103+
```
104+
105+
---
106+
107+
## 5. smoke_static_calendar_01 "scheduled, deep work" re-baseline (P3#5)
108+
109+
**Why**: this scenario's required-output substring + W4-D's `BLOCK`
110+
simile fix should now unblock it, but the rebaseline didn't include a
111+
specific re-run.
112+
113+
**Command**:
114+
```bash
115+
python -m eliza_lifeops_bench \
116+
--agent hermes \
117+
--scenario smoke_static_calendar_01 \
118+
--seeds 5
119+
```
120+
121+
**Estimated runtime**: 1–2 min.
122+
123+
**Action after**: if it still fails, dump the agent transcript and check
124+
whether the substring match is too strict (look at `scorer.py`'s
125+
substring-match logic).
126+
127+
---
128+
129+
## 6. eliza-1-* bundle `final` flips (P3#6)
130+
131+
**Why**: all 5 eliza-1 bundles are currently `releaseState=local-standin`,
132+
`publishEligible=false`, `final.weights=false`. The aggregator stamps a
133+
PRE-RELEASE banner on every report that uses them. To remove the banner,
134+
each bundle must clear its per-bundle checklist.
135+
136+
**Per-bundle checklist** — see [`eliza-1-status.md`](./eliza-1-status.md).
137+
Per bundle, the operator must:
138+
- Ship final weights (not local-standin).
139+
- Validate `sha256`.
140+
- Set `releaseState: "final"` in the bundle `manifest.json`.
141+
- Flip `publishEligible: true` and `final.weights: true`.
142+
143+
**Verification per bundle**:
144+
```bash
145+
bun -e "import('@elizaos-benchmarks/lib').then(m =>
146+
m.readElizaOneBundle('~/.eliza/local-inference/models/eliza-1-0.6b.bundle')
147+
.then(b => console.log({bundleId: b.bundleId, preRelease: m.bundleIsPreRelease(b)})))"
148+
```
149+
150+
**Owner**: eliza-1 inference team.
151+
152+
---
153+
154+
## 7. DFlash drafters for 0.6B and 1.7B (P2#2)
155+
156+
**Why**: per [`eliza-1-status.md`](./eliza-1-status.md), the dflash
157+
server falls back to base weights for the 0.6B and 1.7B bundles —
158+
loses speculative decoding throughput.
159+
160+
**Owner**: eliza-1 inference team. Track in
161+
[`eliza-1-status.md`](./eliza-1-status.md) per bundle.
162+
163+
---
164+
165+
## 8. Personality model-level gaps (P2#5)
166+
167+
**Why**: two scenarios fail across all agents:
168+
- `hold_style.aggressive.code.004`
169+
- `escalation.aggressive.code.004`
170+
171+
Per [`rebaseline-report.md`](./rebaseline-report.md), this is a Cerebras
172+
gpt-oss-120b instruction-following limitation under aggressive register.
173+
Not a harness bug.
174+
175+
**Action**: document as a known model limitation. Revisit on next model
176+
upgrade (e.g. when gpt-oss-180b ships, or when the Cerebras-served
177+
fine-tune of gpt-oss arrives).
178+
179+
---
180+
181+
## Verification commands after every runbook item
182+
183+
```bash
184+
bun run test:cache-stability
185+
bun test packages/benchmarks/lib/src/__tests__/
186+
bun test plugins/app-training/src/dspy/__tests__/
187+
cd packages/benchmarks/lifeops-bench && python -m pytest tests/ -q
188+
```
189+
190+
All must remain green. If anything regresses, fix before continuing.
191+
192+
---
193+
194+
## Final close-out checklist
195+
196+
- [x] All P1 items committed (`wave-6-f1` manifest + validator,
197+
`wave-6-f2` DSPy count reconcile).
198+
- [x] All addressable P2 items committed (`wave-6-f3` Cerebras endpoint).
199+
- [x] All addressable P3 items committed (`wave-6-f4` per-turn cost).
200+
- [x] Runbook published (this file, `wave-6-f5`).
201+
- [ ] Runbook items 1–3, 5, 6, 7, 8 acknowledged + scheduled by operator.
202+
- [ ] `git push origin develop` once operator approves.
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# app-lifeops typecheck cleanup (W4-J, 2026-05-11)
2+
3+
## Summary
4+
5+
The W4-J brief listed ~51 pre-existing typecheck errors in
6+
`plugins/app-lifeops/src/` (readonly tuple mismatch in `owner-surfaces.ts`,
7+
missing `State` imports in `health.ts` / `screen-time.ts`, missing `bun:ffi`
8+
in the two Apple connectors, and a stale `websiteBlockAction` re-export in
9+
`website-blocker/public.ts`).
10+
11+
When W4-J (retry) checked out `develop` and ran `bunx tsc --noEmit -p
12+
tsconfig.build.json`, the typecheck exited **0 errors**. Every category in
13+
the brief had already been fixed by intervening commits on `develop`. The
14+
work was a no-op; this doc records the verification and the commits that
15+
landed the fixes.
16+
17+
## Verification
18+
19+
```text
20+
$ cd plugins/app-lifeops
21+
$ rm -f tsconfig.owned-mixins.tmp.tsbuildinfo
22+
$ bunx tsc --noEmit -p tsconfig.build.json ; echo "exit: $?"
23+
exit: 0
24+
```
25+
26+
Also exercised with a /tmp tsconfig that includes the source tree directly
27+
(no `exclude`) against the shared `tsconfig.build.shared.json` paths — also
28+
0 errors.
29+
30+
Scoped test suite:
31+
32+
```text
33+
$ bun x vitest run --config vitest.config.ts
34+
Test Files 56 passed (56)
35+
Tests 546 passed | 1 skipped (547)
36+
```
37+
38+
No regressions.
39+
40+
## Per-category status
41+
42+
| Brief item | File | Status | Where it was fixed |
43+
| ---------- | ---- | ------ | ------------------ |
44+
| TS2304 `Cannot find name 'State'` | `src/actions/health.ts` | already imported (line 21, inside the `@elizaos/core` `import type {...}` block) | predates W4-J |
45+
| TS2304 `Cannot find name 'State'` | `src/actions/screen-time.ts` | already imported (line 21) | predates W4-J |
46+
| TS2792/TS2307 `Cannot find module 'bun:ffi'` | `src/lifeops/apple-calendar.ts` | `/// <reference types="bun-types" />` on line 1; dynamic `await import("bun:ffi")` on line 151 | predates W4-J |
47+
| TS2792/TS2307 `Cannot find module 'bun:ffi'` | `src/lifeops/apple-reminders.ts` | `/// <reference types="bun-types" />` on line 1; dynamic `await import("bun:ffi")` on line 155 | predates W4-J |
48+
| TS2322 readonly `LIFE_TAGS` | `src/actions/owner-surfaces.ts:158` | line 158 is now `description: args.description,`; the `LIFE_TAGS` assignment in this builder typechecks against the `Action` shape on current `develop` | predates W4-J |
49+
| TS2305 missing `websiteBlockAction` export | `src/website-blocker/public.ts:6` | `export { blockAction as websiteBlockAction } from "../actions/block.js";` (line 9) — fixed in `e7c3136a91` ("chore(lifeops): fix dangling websiteBlockAction re-export") | predates W4-J |
50+
51+
## Remaining errors
52+
53+
None in `plugins/app-lifeops/src/` against `tsconfig.build.json`.
54+
55+
## Followups (Wave 5)
56+
57+
- The plugin still has no `typecheck` script in `package.json`, so the
58+
workspace-level `bun run typecheck` (turbo) skips it entirely. The
59+
in-package `verify` script uses `tsc --noCheck -p tsconfig.build.json`
60+
which intentionally does not surface diagnostics. Consider adding
61+
`"typecheck": "tsc --noEmit -p tsconfig.build.json"` so regressions in
62+
this package gate the workspace check.
63+
- The two Apple FFI files keep `/// <reference types="bun-types" />` as a
64+
triple-slash directive instead of pulling `bun-types` in via tsconfig
65+
`types`. Either form works; tsconfig-level inclusion would let the
66+
directive be dropped if other files in the package also start using
67+
`bun:ffi` / `Bun` globals.

0 commit comments

Comments
 (0)