Add raw episodic read-time-judge control by CynthiaOmovoiye · Pull Request #17 · CynthiaOmovoiye/recall-lab

CynthiaOmovoiye · 2026-06-02T14:15:11Z

Adds the "just keep everything" baseline, motivated by a paper that cuts directly at Recall Lab's premise.

Why

arxiv 2605.12978, "Useful Memories Become Faulty When Continuously Updated by LLMs", tested repeated LLM rewriting of memory and found it degrades over time; keeping raw traces and deciding at read time beat the rewriting approaches. Recall Lab's sleep job is a rewriting loop (compress, supersede, re-render the brief daily), so this baseline is the sharpest test of whether consolidation earns its place. The protection the brief has: the raw episodic log is never discarded, the brief is derived. This control makes that defensible by measuring it.

What

controls/episodic.py (EpisodicJudgeAgent): keep every statement verbatim, inject the whole log each turn, decide the current answer at read time. No compression, no supersede, no consolidation. Reports last_input_tokens so the growing raw bill charts against the brief's flat one.
Wired into multiday_trial (--agents episodic, folded into all) via run_episodic_trial, and into variance.py's lineup.
tests/test_episodic_control.py: verbatim retention, oldest-first order, nothing dropped/compressed, API-key guard. 41 tests pass, ruff clean.
README controls list + status, protocol.md conditions, and research-log.md (paper + rationale + next step) all updated.

The comparison it sets up

Does validity-state consolidation beat keeping everything raw?

Accuracy: if raw history confuses the model on a long correction chain, the brief's explicit Past, no longer current labelling wins.
Cost: even if raw ties on accuracy on the short relocation chain, it pays a growing input-token bill while the brief stays bounded. The crossover is the Chapter 3 result.

Not done in this PR

Live run. Runner and control are in place; the --agents episodic campaign on the relocation chain (pinned provider) is the next job. Report either way before widening a claim — if raw beats the brief on accuracy here, the paper is right for this setup and the sleep job must justify itself on cost or a longer scenario.

🤖 Generated with Claude Code

Motivated by arxiv 2605.12978, "Useful Memories Become Faulty When Continuously Updated by LLMs", which found that repeatedly rewriting memory degrades it and that keeping raw traces with read-time decision beat the rewriting approaches. Recall Lab's sleep job is a rewriting loop, so this is the sharpest test of whether consolidation earns its place. controls/episodic.py (EpisodicJudgeAgent): keep every statement verbatim, inject the whole log each turn, decide the current answer at read time. No compression, no supersede, no consolidation. Reports last_input_tokens so the growing raw-history bill can be charted against the brief's flat one. Wired into multiday_trial (--agents episodic, folded into all) via run_episodic_trial, and into variance.py's lineup. Tests cover the raw-history plumbing offline: verbatim retention, oldest-first order, nothing dropped, and the API-key guard. 41 tests pass, ruff clean. The comparison this answers: does validity-state consolidation beat keeping everything raw, on accuracy and on input-token cost as the log grows? The crossover is the Chapter 3 result. README, protocol.md conditions, and the research log updated. Live run not done yet; runner and control are in place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add raw episodic read-time-judge control#17

Add raw episodic read-time-judge control#17
CynthiaOmovoiye wants to merge 1 commit into
mainfrom
feat/episodic-baseline-control

CynthiaOmovoiye commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CynthiaOmovoiye commented Jun 2, 2026

Why

What

The comparison it sets up

Not done in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant