refactor(conversation): delete now-dead stripOlderReasoning (QA follow-up)

mgoldsborough · mgoldsborough · commit 4293e3317469 · 2026-06-09T09:07:23.000-10:00
Per QA on #402: after retaining reasoning, stripOlderReasoning has no production caller. The "keep for compaction" rationale doesn't hold — compaction summarizes old turns into a reasoning-free block, so it obviates per-turn stripping rather than resurrecting it. Delete the function and its dedicated tests; git history preserves them if ever needed. Keep applyReasoningReplayPolicy as the documented chokepoint for the invariant "retain reasoning, never strip per-turn" — it guards both Anthropic cache stability and OpenAI/Gemini replay correctness, and is the anchor the byte-stability regression test asserts against. Tightened its doc to stop referencing the removed helper. Added a note on the regression test so a future reader doesn't mistake it for a tautology and delete it.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -116,7 +116,7 @@
 ### Fixed
 
 - **Prompt cache no longer thrashes on long agentic runs.** The single cache breakpoint sat on the last *user* message, which is fixed for a run — so every iteration's freshly-appended assistant/tool content lived past it, uncached, and the growing prefix was re-sent at full input rate (measured: cache-write/non-cached dominated long-run cost, effective hit rate 14–40%). Cache-control placement moves out of the engine into a provider-scoped seam (`src/model/cache-policy.ts`, mirroring `applyReasoningReplayPolicy`); the Anthropic strategy places a rolling step-anchor breakpoint on each prior step's tail so the whole prior prefix is read back instead of re-written. Validated against the live API: expensive (non-cached + write) tokens drop ~70–100% on the within-run pattern. Other providers pass through (OpenAI caches prefixes automatically; the append-only prefix is all it needs).
-- **Reasoning blocks are retained on replay instead of stripped per turn.** The replay policy stripped older Anthropic thinking blocks keyed on "is this the latest assistant message" — so the instant a turn stopped being latest its bytes changed, invalidating the cached prefix from that point. With the rolling step-anchor that bust landed just behind the anchor on every iteration, forcing a full re-write of the growing prefix. Under prompt caching, retained reasoning is written to cache once and read back cheaply, so the strip's token savings were dwarfed by the re-writes it forced. Stripping is only cache-safe beyond a stable frozen boundary (compaction); `stripOlderReasoning` is kept for that future use. OpenAI/Gemini already retained reasoning (correctness); now all providers do.
+- **Reasoning blocks are retained on replay instead of stripped per turn.** The replay policy stripped older Anthropic thinking blocks keyed on "is this the latest assistant message" — so the instant a turn stopped being latest its bytes changed, invalidating the cached prefix from that point. With the rolling step-anchor that bust landed just behind the anchor on every iteration, forcing a full re-write of the growing prefix. Under prompt caching, retained reasoning is written to cache once and read back cheaply, so the strip's token savings were dwarfed by the re-writes it forced. Stripping is only cache-safe beyond a stable frozen boundary (compaction), and compaction summarizes old turns into a reasoning-free block anyway, so the per-turn strip helper is removed rather than kept dormant. OpenAI/Gemini already retained reasoning (correctness); now all providers do.
 - User- and workspace-scope `type: skill` skills now load instead of silently doing nothing: the trigger/keyword matcher runs per-request over the merged org+workspace+user pool (it was boot-only and never scanned those tiers), strategy-less skills created via `skills__create` auto-resolve to `loading-strategy: always` written to disk, and `skills__list` flags any skill that would never load. ([#391](https://github.com/NimbleBrainInc/nimblebrain/issues/391))
 - Automation field edits no longer fail silently: a rejected update (e.g. an out-of-range `maxIterations`) now surfaces the error in the detail view instead of swallowing it and snapping the field back to its old value.
 - Automation `maxIterations` cap raised from 15 to 50 (the engine's absolute ceiling), and the create-form default from 5 to 25. The old 5-step default amputated analytical automations mid-run — they hit the cap, produced no final output, and reported "No output captured." Iterations are a runaway backstop, not a cost budget; `maxInputTokens` and `maxRunDurationMs` bound per-run cost directly.
diff --git a/src/conversation/window.ts b/src/conversation/window.ts
@@ -73,71 +73,31 @@ function groupMessages(messages: LanguageModelV3Message[]): LanguageModelV3Messa
   return groups;
 }
 
-/**
- * Strip reasoning blocks from assistant messages older than the most recent
- * assistant turn.
- *
- * Anthropic's guidance for extended thinking: pass thinking blocks from the
- * most recent turn back to the API unchanged; strip thinking blocks from
- * older turns to reduce token usage. The reasoning blocks attached to the
- * last assistant message are still load-bearing — they pair with any
- * tool-use chain currently in flight — but every earlier assistant message's
- * reasoning is historical and replays as opaque signature bytes that bloat
- * the prompt linearly with turn count.
- *
- * In production conv_e00606c7aab7423d we saw 100+ KB `llm.response` events
- * dominated by Anthropic signatures with empty `text`. This is the seam
- * where that growth is cut.
- *
- * Edge case: an assistant message that contains ONLY reasoning blocks is a
- * legitimate placeholder for a turn that produced reasoning-only output
- * (see `event-reconstructor.ts` step 4a). Stripping its only content would
- * leave an empty assistant message that Anthropic rejects on replay, so
- * those placeholders are kept intact.
- */
-export function stripOlderReasoning(messages: LanguageModelV3Message[]): LanguageModelV3Message[] {
-  let lastAssistantIdx = -1;
-  for (let i = messages.length - 1; i >= 0; i--) {
-    if (messages[i]?.role === "assistant") {
-      lastAssistantIdx = i;
-      break;
-    }
-  }
-  if (lastAssistantIdx <= 0) return messages;
-
-  let changed = false;
-  const out = messages.map((msg, idx) => {
-    if (idx === lastAssistantIdx) return msg;
-    if (msg.role !== "assistant") return msg;
-    if (typeof msg.content === "string") return msg;
-    const nonReasoning = msg.content.filter((part) => part.type !== "reasoning");
-    if (nonReasoning.length === msg.content.length) return msg;
-    if (nonReasoning.length === 0) return msg; // pure-reasoning placeholder
-    changed = true;
-    return { ...msg, content: nonReasoning };
-  });
-  return changed ? out : messages;
-}
-
 /**
  * Apply provider-specific replay policy for reasoning blocks.
  *
- * Reasoning blocks are RETAINED for every provider. OpenAI and Gemini require
- * reasoning/thought metadata to stay paired with replayed tool calls, so their
- * history must be intact. Anthropic *permits* stripping older thinking blocks
- * (an optional token optimization), but doing it here — per request, keyed on
- * "is this the latest assistant message" — is incompatible with prompt caching:
- * the moment a turn stops being the latest, its reasoning bytes change, which
- * invalidates the cached prefix from that point. With the rolling step-anchor
- * (see `model/cache-policy.ts`) that bust lands just behind the anchor on EVERY
- * iteration, forcing a full re-write of the growing prefix — the exact pathology
- * this whole effort removes. Under caching, retained reasoning is written to
- * cache once and read back at the cache-read rate; the strip's token savings are
- * dwarfed by the re-writes it forces.
+ * This is the single chokepoint for one invariant: **reasoning blocks are
+ * RETAINED on replay — never stripped per turn.** It protects two providers for
+ * two different reasons, so it's worth one named seam (and the regression test
+ * that guards it):
+ *   - OpenAI / Gemini *require* reasoning/thought metadata to stay paired with
+ *     replayed tool calls (a correctness constraint).
+ *   - Anthropic merely *permits* stripping older thinking blocks (a token
+ *     optimization), but stripping here — per request, keyed on "is this the
+ *     latest assistant message" — is incompatible with prompt caching: the
+ *     moment a turn stops being the latest, its reasoning bytes change, which
+ *     invalidates the cached prefix from that point. With the rolling
+ *     step-anchor (see `model/cache-policy.ts`) that bust lands just behind the
+ *     anchor on EVERY iteration, forcing a full re-write of the growing prefix —
+ *     the exact pathology this effort removes. Retained reasoning is written to
+ *     cache once and read back cheaply; the strip's savings are dwarfed by the
+ *     re-writes it forced.
  *
- * Stripping is only cache-safe beyond a STABLE frozen boundary that advances
- * rarely (i.e. at compaction). `stripOlderReasoning` is kept for that future
- * use — applied once to a compacted prefix, not per turn. Until then, retain.
+ * The policy is uniform today (retain, regardless of `provider`), so this is a
+ * passthrough — but it stays provider-keyed because that's the dispatch point a
+ * future provider-specific replay transform plugs into. Bounded stripping, if it
+ * ever returns, belongs at a stable compaction boundary (applied once to a
+ * frozen prefix), not here per turn.
  */
 export function applyReasoningReplayPolicy(
   messages: LanguageModelV3Message[],
diff --git a/test/unit/window.test.ts b/test/unit/window.test.ts
@@ -3,7 +3,6 @@ import type { LanguageModelV3Message } from "@ai-sdk/provider";
 import {
 	applyReasoningReplayPolicy,
 	sliceHistory,
-	stripOlderReasoning,
 	windowMessages,
 } from "../../src/conversation/window.ts";
 
@@ -21,14 +20,6 @@ function assistantWithReasoning(
 	};
 }
 
-/** Helper: create an assistant message with only reasoning (placeholder turn). */
-function assistantReasoningOnly(reasoningText: string): LanguageModelV3Message {
-	return {
-		role: "assistant",
-		content: [{ type: "reasoning" as const, text: reasoningText }],
-	};
-}
-
 /** Helper: create a simple text message. */
 function textMsg(role: "user" | "assistant", text: string): LanguageModelV3Message {
 	return { role, content: [{ type: "text" as const, text }] };
@@ -359,109 +350,6 @@ describe("sliceHistory", () => {
 	});
 });
 
-describe("stripOlderReasoning", () => {
-	it("keeps reasoning on the most recent assistant message", () => {
-		const msgs: LanguageModelV3Message[] = [
-			textMsg("user", "first"),
-			assistantWithReasoning("thinking 1", "answer 1"),
-			textMsg("user", "second"),
-			assistantWithReasoning("thinking 2", "answer 2"),
-		];
-		const result = stripOlderReasoning(msgs);
-
-		// First assistant message: reasoning stripped, text kept.
-		expect(result[1]).toEqual({
-			role: "assistant",
-			content: [{ type: "text", text: "answer 1" }],
-		});
-		// Last assistant message: untouched.
-		expect(result[3]).toEqual(msgs[3]!);
-	});
-
-	it("returns the input unchanged when there is nothing to strip", () => {
-		const msgs: LanguageModelV3Message[] = [
-			textMsg("user", "hi"),
-			assistantWithReasoning("thinking", "hello"),
-		];
-		const result = stripOlderReasoning(msgs);
-		// Only assistant message is the latest — no older turns to strip.
-		// Returns the same reference (identity) for the common no-op path.
-		expect(result).toBe(msgs);
-	});
-
-	it("preserves reasoning-only placeholder messages instead of leaving an empty assistant turn", () => {
-		const msgs: LanguageModelV3Message[] = [
-			textMsg("user", "first"),
-			assistantReasoningOnly("opaque signature"),
-			textMsg("user", "second"),
-			assistantWithReasoning("thinking", "answer"),
-		];
-		const result = stripOlderReasoning(msgs);
-
-		// The earlier reasoning-only assistant message is kept as-is —
-		// stripping would leave an empty content array, which Anthropic
-		// rejects on replay.
-		expect(result[1]).toEqual(msgs[1]!);
-		expect(result[3]).toEqual(msgs[3]!);
-	});
-
-	it("strips reasoning from earlier assistant messages even when they also contain tool-calls", () => {
-		const msgs: LanguageModelV3Message[] = [
-			textMsg("user", "do something"),
-			{
-				role: "assistant",
-				content: [
-					{ type: "reasoning" as const, text: "considering options" },
-					{
-						type: "tool-call" as const,
-						toolCallId: "call_1",
-						toolName: "search",
-						input: { q: "x" },
-					},
-				],
-			},
-			toolResultMsg("call_1"),
-			assistantWithReasoning("now reasoning again", "done"),
-		];
-		const result = stripOlderReasoning(msgs);
-
-		// Older assistant message: reasoning stripped, tool-call retained
-		// so the tool_result it pairs with still has its tool_use anchor.
-		expect(result[1]).toEqual({
-			role: "assistant",
-			content: [
-				{
-					type: "tool-call",
-					toolCallId: "call_1",
-					toolName: "search",
-					input: { q: "x" },
-				},
-			],
-		});
-		// Tool-result unchanged.
-		expect(result[2]).toEqual(msgs[2]!);
-		// Latest assistant unchanged.
-		expect(result[3]).toEqual(msgs[3]!);
-	});
-
-	it("is a no-op when there are no assistant messages", () => {
-		const msgs: LanguageModelV3Message[] = [textMsg("user", "hi")];
-		const result = stripOlderReasoning(msgs);
-		expect(result).toBe(msgs);
-	});
-
-	it("is a no-op when no reasoning blocks exist", () => {
-		const msgs: LanguageModelV3Message[] = [
-			textMsg("user", "hi"),
-			textMsg("assistant", "hello"),
-			textMsg("user", "again"),
-			textMsg("assistant", "hello again"),
-		];
-		const result = stripOlderReasoning(msgs);
-		expect(result).toBe(msgs);
-	});
-});
-
 describe("applyReasoningReplayPolicy", () => {
 	const replayHistoryWithToolCall = (): LanguageModelV3Message[] => [
 		textMsg("user", "do something"),
@@ -498,6 +386,12 @@ describe("applyReasoningReplayPolicy", () => {
 	});
 
 	it("prefix is byte-stable as the latest assistant advances (no per-turn re-strip)", () => {
+		// REGRESSION GUARD (not a tautology): trivially true while the policy is a
+		// passthrough, but it fails loudly the moment anyone re-introduces per-turn
+		// reasoning stripping here — which would change a turn's bytes the instant
+		// it stops being the latest assistant and bust the rolling cache anchor
+		// just behind it. Keep it.
+		//
 		// Simulate the engine appending a new step: the previously-latest
 		// assistant must NOT change bytes when a newer assistant arrives, or the
 		// rolling cache anchor just behind it misses every turn.