Skip to content

Add raw episodic read-time-judge control#17

Open
CynthiaOmovoiye wants to merge 1 commit into
mainfrom
feat/episodic-baseline-control
Open

Add raw episodic read-time-judge control#17
CynthiaOmovoiye wants to merge 1 commit into
mainfrom
feat/episodic-baseline-control

Conversation

@CynthiaOmovoiye
Copy link
Copy Markdown
Owner

Adds the "just keep everything" baseline, motivated by a paper that cuts directly at Recall Lab's premise.

Why

arxiv 2605.12978, "Useful Memories Become Faulty When Continuously Updated by LLMs", tested repeated LLM rewriting of memory and found it degrades over time; keeping raw traces and deciding at read time beat the rewriting approaches. Recall Lab's sleep job is a rewriting loop (compress, supersede, re-render the brief daily), so this baseline is the sharpest test of whether consolidation earns its place. The protection the brief has: the raw episodic log is never discarded, the brief is derived. This control makes that defensible by measuring it.

What

  • controls/episodic.py (EpisodicJudgeAgent): keep every statement verbatim, inject the whole log each turn, decide the current answer at read time. No compression, no supersede, no consolidation. Reports last_input_tokens so the growing raw bill charts against the brief's flat one.
  • Wired into multiday_trial (--agents episodic, folded into all) via run_episodic_trial, and into variance.py's lineup.
  • tests/test_episodic_control.py: verbatim retention, oldest-first order, nothing dropped/compressed, API-key guard. 41 tests pass, ruff clean.
  • README controls list + status, protocol.md conditions, and research-log.md (paper + rationale + next step) all updated.

The comparison it sets up

Does validity-state consolidation beat keeping everything raw?

  • Accuracy: if raw history confuses the model on a long correction chain, the brief's explicit Past, no longer current labelling wins.
  • Cost: even if raw ties on accuracy on the short relocation chain, it pays a growing input-token bill while the brief stays bounded. The crossover is the Chapter 3 result.

Not done in this PR

  • Live run. Runner and control are in place; the --agents episodic campaign on the relocation chain (pinned provider) is the next job. Report either way before widening a claim — if raw beats the brief on accuracy here, the paper is right for this setup and the sleep job must justify itself on cost or a longer scenario.

🤖 Generated with Claude Code

Motivated by arxiv 2605.12978, "Useful Memories Become Faulty When Continuously
Updated by LLMs", which found that repeatedly rewriting memory degrades it and
that keeping raw traces with read-time decision beat the rewriting approaches.
Recall Lab's sleep job is a rewriting loop, so this is the sharpest test of
whether consolidation earns its place.

controls/episodic.py (EpisodicJudgeAgent): keep every statement verbatim,
inject the whole log each turn, decide the current answer at read time. No
compression, no supersede, no consolidation. Reports last_input_tokens so the
growing raw-history bill can be charted against the brief's flat one.

Wired into multiday_trial (--agents episodic, folded into all) via
run_episodic_trial, and into variance.py's lineup. Tests cover the raw-history
plumbing offline: verbatim retention, oldest-first order, nothing dropped, and
the API-key guard. 41 tests pass, ruff clean.

The comparison this answers: does validity-state consolidation beat keeping
everything raw, on accuracy and on input-token cost as the log grows? The
crossover is the Chapter 3 result. README, protocol.md conditions, and the
research log updated. Live run not done yet; runner and control are in place.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant