Add raw episodic read-time-judge control#17
Open
CynthiaOmovoiye wants to merge 1 commit into
Open
Conversation
Motivated by arxiv 2605.12978, "Useful Memories Become Faulty When Continuously Updated by LLMs", which found that repeatedly rewriting memory degrades it and that keeping raw traces with read-time decision beat the rewriting approaches. Recall Lab's sleep job is a rewriting loop, so this is the sharpest test of whether consolidation earns its place. controls/episodic.py (EpisodicJudgeAgent): keep every statement verbatim, inject the whole log each turn, decide the current answer at read time. No compression, no supersede, no consolidation. Reports last_input_tokens so the growing raw-history bill can be charted against the brief's flat one. Wired into multiday_trial (--agents episodic, folded into all) via run_episodic_trial, and into variance.py's lineup. Tests cover the raw-history plumbing offline: verbatim retention, oldest-first order, nothing dropped, and the API-key guard. 41 tests pass, ruff clean. The comparison this answers: does validity-state consolidation beat keeping everything raw, on accuracy and on input-token cost as the log grows? The crossover is the Chapter 3 result. README, protocol.md conditions, and the research log updated. Live run not done yet; runner and control are in place.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the "just keep everything" baseline, motivated by a paper that cuts directly at Recall Lab's premise.
Why
arxiv 2605.12978, "Useful Memories Become Faulty When Continuously Updated by LLMs", tested repeated LLM rewriting of memory and found it degrades over time; keeping raw traces and deciding at read time beat the rewriting approaches. Recall Lab's sleep job is a rewriting loop (compress, supersede, re-render the brief daily), so this baseline is the sharpest test of whether consolidation earns its place. The protection the brief has: the raw episodic log is never discarded, the brief is derived. This control makes that defensible by measuring it.
What
controls/episodic.py(EpisodicJudgeAgent): keep every statement verbatim, inject the whole log each turn, decide the current answer at read time. No compression, no supersede, no consolidation. Reportslast_input_tokensso the growing raw bill charts against the brief's flat one.multiday_trial(--agents episodic, folded intoall) viarun_episodic_trial, and intovariance.py's lineup.tests/test_episodic_control.py: verbatim retention, oldest-first order, nothing dropped/compressed, API-key guard. 41 tests pass, ruff clean.protocol.mdconditions, andresearch-log.md(paper + rationale + next step) all updated.The comparison it sets up
Does validity-state consolidation beat keeping everything raw?
Past, no longer currentlabelling wins.Not done in this PR
--agents episodiccampaign on the relocation chain (pinned provider) is the next job. Report either way before widening a claim — if raw beats the brief on accuracy here, the paper is right for this setup and the sleep job must justify itself on cost or a longer scenario.🤖 Generated with Claude Code