You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
v0.0.2: two-pass eval, +4.0pp lift, no regressions
The cross-domain bias from v0.0.1 is gone. Memory now comes from baseline
runs of the same task suite, so the consolidator is being measured on
what it actually has to learn instead of guessing at a foreign codebase.
Result on the 15-task suite (5 trials per condition, 150 trials):
baseline 92%, dreamed 96%, lift +4.0pp.
- +20pp on 07_bulk_create_members, 12_generic_repository_base, and
14_test_translate_function
- 0 regressions (v0.0.1 had two: 7 and 9, both at -20pp)
- 12 of 15 tasks ceiling at 100% baseline — they have nowhere to go,
which is what dilutes the aggregate. Suite redesign is v0.0.3.
Missed SPEC §3's 5pp bar by 1pp. Honest read: the consolidator does its
job on every task with room to lift; the bottleneck is the suite, not
the pipeline.
What landed:
- --two-pass flag and run_two_pass_eval orchestrator (collect, consolidate,
rerun); eval state lives in .opendream-eval/store.sqlite, never touches
the user's ~/.opendream/db.sqlite
- ClaudeCodeRunner captures stream-json transcripts to a per-trial file;
empty or malformed captures halt the run
- probe_claude_capture pre-flight catches CLI drift before 150 trials
burn an afternoon
- Adapter accepts both sessionId and session_id so stream-json output
ingests the same way as project-dir jsonl
189 tests, ruff + mypy clean. ~$2 of API spend, ~3h wall.
Copy file name to clipboardExpand all lines: SPEC.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ Existing OSS memory layers — Letta (formerly MemGPT) and mem0 — handle stora
52
52
3. Eval harness reports a measurable lift on the 15-task suite (baseline vs. dreamed agent, 5 trials each per task). Target ≥ 5 percentage points.
53
53
4. README polished + 60-second demo recorded + GitHub repo public + MIT license.
54
54
55
-
> **Note (2026-05-10):** Criterion 3 was not met by v0.0.1-alpha — measured cross-domain lift was +0.0pp aggregate (+40 / −20 / −20 / 12 ceilings per-task). The cross-domain test is itself a flawed measurement of the consolidation pass; v0.0.2 commits to the domain-matched two-pass eval (collect baselines → dream → re-run on identical tasks). Original criterion preserved here for historical record. See [`CHANGELOG.md`](CHANGELOG.md) and [`README.md`](README.md#known-limitations) for the per-task breakdown.
55
+
> **Note (2026-05-10):** Criterion 3 was not met by v0.0.1-alpha or v0.0.2 — but the gap shrank and the failure mode changed. v0.0.1-alpha's cross-domain eval measured +0.0pp aggregate (+40 / −20 / −20 / 12 ceilings per-task) — the cross-domain test was itself flawed. v0.0.2 introduced the domain-matched two-pass eval (collect baselines → dream → re-run on identical tasks) and measured **+4.0pp aggregate, +20pp on each of 3 tasks, no regressions** — missed the 5pp bar by 1pp because **12 of 15 tasks are ceiling-effected** at 100% baseline (suite-design issue, not consolidator issue). v0.0.3 will replace the ceiling-effected tasks with harder discriminators so the 5pp bar becomes reachable. Original criterion preserved here for historical record. See [`CHANGELOG.md`](CHANGELOG.md)`[0.0.2]`and [`README.md`](README.md#v002-eval-result-2026-05-10--domain-matched-two-pass) for per-task breakdowns.
0 commit comments