|
| 1 | +# LifeOps pipeline — operator runbook (2026-05-11) |
| 2 | + |
| 3 | +What's left after the rebuild that the **operator** (not Claude) must do. |
| 4 | +Each item lists the command, prerequisites, expected runtime, and what to |
| 5 | +do with the output. |
| 6 | + |
| 7 | +Cross-links: [`REPORT.md`](./REPORT.md) (canonical summary), |
| 8 | +[`INDEX.md`](./INDEX.md) (per-wave deliverables), |
| 9 | +[`wave-5a-gap-list.md`](./wave-5a-gap-list.md) (gap inventory). |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## 1. First measured retrieval-funnel run (gap-list P3#1) |
| 14 | + |
| 15 | +**Why**: `retrieval-funnel.{md,json}` is structurally correct but reports |
| 16 | +`counted samples: 0` because no full run has yet emitted measurement |
| 17 | +trajectories. The per-tier defaults in |
| 18 | +`packages/benchmarks/lib/src/retrieval-defaults.ts` are heuristic until |
| 19 | +real measurements land. |
| 20 | + |
| 21 | +**Prereq**: `CEREBRAS_API_KEY` in `.env`. |
| 22 | + |
| 23 | +**Command**: |
| 24 | +```bash |
| 25 | +cd /Users/shawwalters/milaidy/eliza |
| 26 | +MILADY_RETRIEVAL_MEASUREMENT=1 bun run lifeops:multi-tier:core |
| 27 | +bun run lifeops:retrieval:funnel |
| 28 | +bun run lifeops:retrieval:pareto |
| 29 | +``` |
| 30 | + |
| 31 | +**Estimated runtime**: 30–90 min (Cerebras throughput dependent). |
| 32 | + |
| 33 | +**Action after**: review `retrieval-funnel.md` and `retrieval-pareto.md`. |
| 34 | +Either update the constants in `retrieval-defaults.ts` with measured |
| 35 | +top-K + stage weights, or document the measured deltas if the heuristics |
| 36 | +held up. |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +## 2. Anthropic re-bench with DSPy-optimized planner (P3#2) |
| 41 | + |
| 42 | +**Why**: the current rebaseline is Cerebras-only because `ANTHROPIC_API_KEY` |
| 43 | +was unset at W2-9 time. |
| 44 | + |
| 45 | +**Prereq**: `ANTHROPIC_API_KEY` in `.env`. |
| 46 | + |
| 47 | +**Command**: |
| 48 | +```bash |
| 49 | +bun run lifeops:multi-tier:smoke --tiers frontier |
| 50 | +# After it completes, diff vs the Cerebras baseline: |
| 51 | +bun run lifeops:delta -- \ |
| 52 | + --baseline runs/<cerebras-runId> \ |
| 53 | + --candidate runs/<anthropic-runId> \ |
| 54 | + --out runs/anthropic-vs-cerebras |
| 55 | +``` |
| 56 | + |
| 57 | +**Estimated runtime**: 10–20 min (Anthropic Opus 4.7). |
| 58 | + |
| 59 | +**Action after**: confirm pass-rate and cost deltas; if Anthropic regresses |
| 60 | +materially on a scenario the planner improved on Cerebras, that's a sign |
| 61 | +the DSPy-optimized planner over-fit to the Cerebras teacher. |
| 62 | + |
| 63 | +--- |
| 64 | + |
| 65 | +## 3. Run other lifeops domains (P3#3) |
| 66 | + |
| 67 | +**Why**: the W2-9 rebaseline is calendar-only (25/25 scenarios). The full |
| 68 | +suite has 100+ scenarios across mail, reminders, contacts, finance, |
| 69 | +travel, health, sleep, etc. Per-domain numbers diverge significantly — |
| 70 | +hermes peaked at 0.494 on `mail` in W1-3 while `calendar` is much harder. |
| 71 | + |
| 72 | +**Prereq**: `CEREBRAS_API_KEY`. |
| 73 | + |
| 74 | +**Command**: |
| 75 | +```bash |
| 76 | +# Single domain |
| 77 | +python -m eliza_lifeops_bench --agent hermes --suite core --domain mail |
| 78 | +python -m eliza_lifeops_bench --agent hermes --suite core --domain reminders |
| 79 | +python -m eliza_lifeops_bench --agent hermes --suite core --domain contacts |
| 80 | +python -m eliza_lifeops_bench --agent hermes --suite core --domain finance |
| 81 | +python -m eliza_lifeops_bench --agent hermes --suite core --domain travel |
| 82 | +python -m eliza_lifeops_bench --agent hermes --suite core --domain health |
| 83 | +python -m eliza_lifeops_bench --agent hermes --suite core --domain sleep |
| 84 | + |
| 85 | +# Or the full core suite in one shot: |
| 86 | +bun run lifeops:multi-tier:core |
| 87 | +``` |
| 88 | + |
| 89 | +**Estimated runtime**: ~5–10 min per domain × 7 domains = 35–70 min. |
| 90 | + |
| 91 | +**Action after**: update `rebaseline-report.md` with per-domain numbers. |
| 92 | +Anything < 0.30 pass-rate is a candidate for targeted scenario-level |
| 93 | +investigation. |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## 4. Plumb hermes per-turn cost+latency (P3#4 / F4) |
| 98 | + |
| 99 | +**Status**: in progress under Wave 6-F4 (`wave-6-f4` commit). After it |
| 100 | +lands, confirm with: |
| 101 | +```bash |
| 102 | +cd packages/benchmarks/lifeops-bench && python -m pytest tests/test_unified_telemetry.py -v |
| 103 | +``` |
| 104 | + |
| 105 | +--- |
| 106 | + |
| 107 | +## 5. smoke_static_calendar_01 "scheduled, deep work" re-baseline (P3#5) |
| 108 | + |
| 109 | +**Why**: this scenario's required-output substring + W4-D's `BLOCK` |
| 110 | +simile fix should now unblock it, but the rebaseline didn't include a |
| 111 | +specific re-run. |
| 112 | + |
| 113 | +**Command**: |
| 114 | +```bash |
| 115 | +python -m eliza_lifeops_bench \ |
| 116 | + --agent hermes \ |
| 117 | + --scenario smoke_static_calendar_01 \ |
| 118 | + --seeds 5 |
| 119 | +``` |
| 120 | + |
| 121 | +**Estimated runtime**: 1–2 min. |
| 122 | + |
| 123 | +**Action after**: if it still fails, dump the agent transcript and check |
| 124 | +whether the substring match is too strict (look at `scorer.py`'s |
| 125 | +substring-match logic). |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +## 6. eliza-1-* bundle `final` flips (P3#6) |
| 130 | + |
| 131 | +**Why**: all 5 eliza-1 bundles are currently `releaseState=local-standin`, |
| 132 | +`publishEligible=false`, `final.weights=false`. The aggregator stamps a |
| 133 | +PRE-RELEASE banner on every report that uses them. To remove the banner, |
| 134 | +each bundle must clear its per-bundle checklist. |
| 135 | + |
| 136 | +**Per-bundle checklist** — see [`eliza-1-status.md`](./eliza-1-status.md). |
| 137 | +Per bundle, the operator must: |
| 138 | +- Ship final weights (not local-standin). |
| 139 | +- Validate `sha256`. |
| 140 | +- Set `releaseState: "final"` in the bundle `manifest.json`. |
| 141 | +- Flip `publishEligible: true` and `final.weights: true`. |
| 142 | + |
| 143 | +**Verification per bundle**: |
| 144 | +```bash |
| 145 | +bun -e "import('@elizaos-benchmarks/lib').then(m => |
| 146 | + m.readElizaOneBundle('~/.eliza/local-inference/models/eliza-1-0.6b.bundle') |
| 147 | + .then(b => console.log({bundleId: b.bundleId, preRelease: m.bundleIsPreRelease(b)})))" |
| 148 | +``` |
| 149 | + |
| 150 | +**Owner**: eliza-1 inference team. |
| 151 | + |
| 152 | +--- |
| 153 | + |
| 154 | +## 7. DFlash drafters for 0.6B and 1.7B (P2#2) |
| 155 | + |
| 156 | +**Why**: per [`eliza-1-status.md`](./eliza-1-status.md), the dflash |
| 157 | +server falls back to base weights for the 0.6B and 1.7B bundles — |
| 158 | +loses speculative decoding throughput. |
| 159 | + |
| 160 | +**Owner**: eliza-1 inference team. Track in |
| 161 | +[`eliza-1-status.md`](./eliza-1-status.md) per bundle. |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## 8. Personality model-level gaps (P2#5) |
| 166 | + |
| 167 | +**Why**: two scenarios fail across all agents: |
| 168 | +- `hold_style.aggressive.code.004` |
| 169 | +- `escalation.aggressive.code.004` |
| 170 | + |
| 171 | +Per [`rebaseline-report.md`](./rebaseline-report.md), this is a Cerebras |
| 172 | +gpt-oss-120b instruction-following limitation under aggressive register. |
| 173 | +Not a harness bug. |
| 174 | + |
| 175 | +**Action**: document as a known model limitation. Revisit on next model |
| 176 | +upgrade (e.g. when gpt-oss-180b ships, or when the Cerebras-served |
| 177 | +fine-tune of gpt-oss arrives). |
| 178 | + |
| 179 | +--- |
| 180 | + |
| 181 | +## Verification commands after every runbook item |
| 182 | + |
| 183 | +```bash |
| 184 | +bun run test:cache-stability |
| 185 | +bun test packages/benchmarks/lib/src/__tests__/ |
| 186 | +bun test plugins/app-training/src/dspy/__tests__/ |
| 187 | +cd packages/benchmarks/lifeops-bench && python -m pytest tests/ -q |
| 188 | +``` |
| 189 | + |
| 190 | +All must remain green. If anything regresses, fix before continuing. |
| 191 | + |
| 192 | +--- |
| 193 | + |
| 194 | +## Final close-out checklist |
| 195 | + |
| 196 | +- [x] All P1 items committed (`wave-6-f1` manifest + validator, |
| 197 | + `wave-6-f2` DSPy count reconcile). |
| 198 | +- [x] All addressable P2 items committed (`wave-6-f3` Cerebras endpoint). |
| 199 | +- [x] All addressable P3 items committed (`wave-6-f4` per-turn cost). |
| 200 | +- [x] Runbook published (this file, `wave-6-f5`). |
| 201 | +- [ ] Runbook items 1–3, 5, 6, 7, 8 acknowledged + scheduled by operator. |
| 202 | +- [ ] `git push origin develop` once operator approves. |
0 commit comments