Skip to content

Observation generation stuck in a self-sustaining poison→respawn loop: benign "prose"/"idle" SDK output counts toward the poison threshold, wiping context and dropping all captured work #3032

Description

@SvetZitrka

Environment

  • claude-mem 13.8.0 (observed; appears to be a regression vs 13.6.2, which captured tool work fine)
  • Windows 11, bun runtime, worker HTTP on 127.0.0.1:37777
  • Generator model: claude-sonnet-4-5

Summary

Across an entire multi-hour session, the observer/generator captured only user prompts and produced essentially zero observations of actual work (every generated observation had files_read: []). The worker logs show the generator session being "poisoned" and respawned every ~30–60s for hours.

Root cause: the parser treats a benign generator response (the model saying "nothing to record yet" as plain prose, or returning an empty/idle response) as an invalid output. After Y9 = 3 consecutive such outputs the session is "poisoned": conversationHistory = [] is wiped and the SDK session respawns (losing context, Issue #817, preservedPending: 0). Because the wipe destroys the very context the observer needs to accumulate, the next checkpoint again looks like "just a user prompt → nothing to record → prose", and the loop sustains itself forever. No work is ever captured.

Code (worker-service.cjs, function Bc)

// M9(t): output is "valid" only if it contains <observation>/<summary>/<skip_summary/>
// D9(t): outputClass -> "idle" | "poisoned" (matches Cwe API-failure markers) | "xml" | "prose"
if (!d.valid) {
  let k = D9(t), I = N9(t);
  if (e.consecutiveInvalidOutputs = (e.consecutiveInvalidOutputs ?? 0) + 1,
      g.warn("PARSER", `${a} returned non-XML ${k} response — ignoring queued batch`, {...}),
      k === "poisoned" || e.consecutiveInvalidOutputs >= Y9) {   // <-- BUG
        // poison: respawnPoisonedSession -> conversationHistory=[] wipe
  }
  ...
}

The problem is || e.consecutiveInvalidOutputs >= Y9: benign prose/idle (the model legitimately having nothing to record) accumulates toward the threshold and triggers a context-destroying poison. Cwe (the genuine poison markers) are real API failures ("context window", "session exhausted", "prompt is too long", …) — prose/idle are not failures.

Evidence (logs)

# Every cycle, for hours:
[WARN ] [PARSER] SDK returned non-XML prose response — ignoring queued batch {outputClass=prose, consecutiveInvalidOutputs=3}
[ERROR] [SESSION] SDK session poisoned — killing and respawning {outputClass=prose, consecutiveInvalidOutputs=3, threshold=3}
[WARN ] [SESSION] Respawning poisoned SDK session, preserving pending messages {preservedPending=0}
[WARN ] [SESSION] Discarding stale memory_session_id from previous worker instance (Issue #817)
# prose previews show the model only ever sees the prompt:
"I'm observing the primary session, but I don't see any technical work being performed yet - only the user's request."
# Net result: only prompt-echo observations, all with files_read: []

Suggested fix (one condition)

Only poison on a genuine API-failure marker, never on accumulated benign prose/idle:

-  k === "poisoned" || e.consecutiveInvalidOutputs >= Y9
+  k === "poisoned"

(Or, more conservatively: do not increment consecutiveInvalidOutputs for k === "prose" || k === "idle"; treat them as a no-op skip, equivalent to <skip_summary/>.)

Verification of the fix (applied locally, before/after on the same session 327)

Before patch After patch
Poison events every ~30–60s for hours none (count frozen)
consecutiveInvalidOutputs=3/4 → poison + context wipe ignored, session survives
Observations stored 1 in hours (prompt echo) 3 in minutes, real content
files_read [] populated (actual files read)

After the patch, the generator survives the early "idle/prose" checkpoints, accumulates the real tool activity, and emits + stores genuine <observation> XML ([DB] STORING | obsCount=1).

Related

Distinct from the worker-recycle/zombie issue (#3031); this is the observation-generation pipeline. Also touches Issue #817 (SDK context lost on respawn) — the poison loop makes #817 fire continuously.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions