Environment
- claude-mem 13.8.0 (observed; appears to be a regression vs 13.6.2, which captured tool work fine)
- Windows 11, bun runtime, worker HTTP on 127.0.0.1:37777
- Generator model: claude-sonnet-4-5
Summary
Across an entire multi-hour session, the observer/generator captured only user prompts and produced essentially zero observations of actual work (every generated observation had files_read: []). The worker logs show the generator session being "poisoned" and respawned every ~30–60s for hours.
Root cause: the parser treats a benign generator response (the model saying "nothing to record yet" as plain prose, or returning an empty/idle response) as an invalid output. After Y9 = 3 consecutive such outputs the session is "poisoned": conversationHistory = [] is wiped and the SDK session respawns (losing context, Issue #817, preservedPending: 0). Because the wipe destroys the very context the observer needs to accumulate, the next checkpoint again looks like "just a user prompt → nothing to record → prose", and the loop sustains itself forever. No work is ever captured.
Code (worker-service.cjs, function Bc)
// M9(t): output is "valid" only if it contains <observation>/<summary>/<skip_summary/>
// D9(t): outputClass -> "idle" | "poisoned" (matches Cwe API-failure markers) | "xml" | "prose"
if (!d.valid) {
let k = D9(t), I = N9(t);
if (e.consecutiveInvalidOutputs = (e.consecutiveInvalidOutputs ?? 0) + 1,
g.warn("PARSER", `${a} returned non-XML ${k} response — ignoring queued batch`, {...}),
k === "poisoned" || e.consecutiveInvalidOutputs >= Y9) { // <-- BUG
// poison: respawnPoisonedSession -> conversationHistory=[] wipe
}
...
}
The problem is || e.consecutiveInvalidOutputs >= Y9: benign prose/idle (the model legitimately having nothing to record) accumulates toward the threshold and triggers a context-destroying poison. Cwe (the genuine poison markers) are real API failures ("context window", "session exhausted", "prompt is too long", …) — prose/idle are not failures.
Evidence (logs)
# Every cycle, for hours:
[WARN ] [PARSER] SDK returned non-XML prose response — ignoring queued batch {outputClass=prose, consecutiveInvalidOutputs=3}
[ERROR] [SESSION] SDK session poisoned — killing and respawning {outputClass=prose, consecutiveInvalidOutputs=3, threshold=3}
[WARN ] [SESSION] Respawning poisoned SDK session, preserving pending messages {preservedPending=0}
[WARN ] [SESSION] Discarding stale memory_session_id from previous worker instance (Issue #817)
# prose previews show the model only ever sees the prompt:
"I'm observing the primary session, but I don't see any technical work being performed yet - only the user's request."
# Net result: only prompt-echo observations, all with files_read: []
Suggested fix (one condition)
Only poison on a genuine API-failure marker, never on accumulated benign prose/idle:
- k === "poisoned" || e.consecutiveInvalidOutputs >= Y9
+ k === "poisoned"
(Or, more conservatively: do not increment consecutiveInvalidOutputs for k === "prose" || k === "idle"; treat them as a no-op skip, equivalent to <skip_summary/>.)
Verification of the fix (applied locally, before/after on the same session 327)
|
Before patch |
After patch |
| Poison events |
every ~30–60s for hours |
none (count frozen) |
consecutiveInvalidOutputs=3/4 |
→ poison + context wipe |
ignored, session survives |
| Observations stored |
1 in hours (prompt echo) |
3 in minutes, real content |
files_read |
[] |
populated (actual files read) |
After the patch, the generator survives the early "idle/prose" checkpoints, accumulates the real tool activity, and emits + stores genuine <observation> XML ([DB] STORING | obsCount=1).
Related
Distinct from the worker-recycle/zombie issue (#3031); this is the observation-generation pipeline. Also touches Issue #817 (SDK context lost on respawn) — the poison loop makes #817 fire continuously.
Environment
Summary
Across an entire multi-hour session, the observer/generator captured only user prompts and produced essentially zero observations of actual work (every generated observation had
files_read: []). The worker logs show the generator session being "poisoned" and respawned every ~30–60s for hours.Root cause: the parser treats a benign generator response (the model saying "nothing to record yet" as plain prose, or returning an empty/idle response) as an invalid output. After
Y9 = 3consecutive such outputs the session is "poisoned":conversationHistory = []is wiped and the SDK session respawns (losing context, Issue #817,preservedPending: 0). Because the wipe destroys the very context the observer needs to accumulate, the next checkpoint again looks like "just a user prompt → nothing to record → prose", and the loop sustains itself forever. No work is ever captured.Code (worker-service.cjs, function
Bc)The problem is
|| e.consecutiveInvalidOutputs >= Y9: benignprose/idle(the model legitimately having nothing to record) accumulates toward the threshold and triggers a context-destroying poison.Cwe(the genuine poison markers) are real API failures ("context window", "session exhausted", "prompt is too long", …) —prose/idleare not failures.Evidence (logs)
Suggested fix (one condition)
Only poison on a genuine API-failure marker, never on accumulated benign prose/idle:
(Or, more conservatively: do not increment
consecutiveInvalidOutputsfork === "prose" || k === "idle"; treat them as a no-op skip, equivalent to<skip_summary/>.)Verification of the fix (applied locally, before/after on the same session 327)
consecutiveInvalidOutputs=3/4files_read[]After the patch, the generator survives the early "idle/prose" checkpoints, accumulates the real tool activity, and emits + stores genuine
<observation>XML ([DB] STORING | obsCount=1).Related
Distinct from the worker-recycle/zombie issue (#3031); this is the observation-generation pipeline. Also touches Issue #817 (SDK context lost on respawn) — the poison loop makes #817 fire continuously.