Skip to content

Commit e0476c7

Browse files
committed
Fix parseTranscript: real user messages dropped, injected blocks treated as signals
Two related bugs in `parseTranscript` cause Soul signals to be derived from the wrong content: 1. **String-shaped content was unhandled.** Claude Code's JSONL emits `message.content` as a bare string for genuine human-typed user turns (e.g. `{ "role": "user", "content": "Please fix the bug." }`), and as an array of content blocks for assistant turns and for system- injected user-role messages (tool_result blocks, system reminders, slash-command artifacts, Skill tool result bodies). The `TranscriptEntry` type declared `content: ContentBlock[]` unconditionally; calling `.filter()` on a string threw `TypeError: content.filter is not a function`, swallowed by the surrounding try/catch. Result: every real user message was silently dropped. 2. **Injected user-role blocks were treated as user intent.** The only "user" content surviving the bug above was array-shaped — overwhelmingly `tool_result` blocks (correctly filtered by the `c.type === "text"` guard) plus the occasional `{ type: "text", ... }` block injected by Claude Code: system reminders, slash-command wrappers (`<command-name>/clear</command-name>` etc.), and Skill tool result bodies (which carry the invoked skill's SKILL.md content as a text block). These pattern-matched against the `CORRECTION_PATTERN` / `GRATITUDE_PATTERN` / `COMPLETION_PATTERN` regexes, producing self-referential false signals (e.g. the `task-observer` skill body was repeatedly registering as user "gratitude" and "correction"). Soul's own QUICK reflection self-diagnosed this as *"signal data is corrupted (contains system marker fragments and truncation)"*. Fix: - Widen `TranscriptEntry.message.content` to `string | ContentBlock[]` (matches reality). - In `parseTranscript`, branch on `typeof content`. For strings: trim and use directly. For arrays: keep the existing text-block filter. - Apply a shared `INJECTED_PREFIX` regex in both branches that strips system-reminders, slash-command artifacts (`<local-command-caveat>`, `<command-name>`, `<command-message>`, `<command-args>`), Skill tool result bodies (`Base directory for this skill:`), the request- interrupted marker, and the post-compact continuation banner. These are all framework-injected, not user-typed. Tests: 6 new cases in `signal-extractor.test.ts` covering string content, array content with intermixed tool_use blocks, both injection shapes (system-reminder array block + slash-command string), Skill-body injection, mixed-shape transcripts (regression guard against the swallowed-TypeError pattern), and malformed-line resilience. Verification against a live JSONL: pre-fix returned 2 array-text blocks (both injected noise); post-fix returns 9 real user messages with zero injection leakage. Impact: this is upstream of every signal-based feature — framework seeding, lesson selection, reflection cadence triggers. Pre-fix, those were all running on injected system text instead of user dialogue.
1 parent 2626391 commit e0476c7

2 files changed

Lines changed: 227 additions & 6 deletions

File tree

packages/server/src/engine/signal-extractor.ts

Lines changed: 34 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,12 @@ type ContentBlock =
212212

213213
/**
214214
* A single entry from the transcript JSONL file.
215+
*
216+
* `message.content` is a discriminated union: a bare string for genuine
217+
* human-typed user turns, and an array of content blocks for assistant
218+
* turns and for system-injected user-role messages (tool_result blocks,
219+
* system reminders, slash-command artifacts, Skill tool result bodies).
220+
* Treating it as always-array dropped every real user message.
215221
*/
216222
type TranscriptEntry = {
217223
type: "user" | "assistant" | "system" | "summary" | string;
@@ -220,10 +226,19 @@ type TranscriptEntry = {
220226
sessionId: string;
221227
message?: {
222228
role: "user" | "assistant";
223-
content: ContentBlock[];
229+
content: string | ContentBlock[];
224230
};
225231
};
226232

233+
/**
234+
* Prefixes identifying user-role text that was injected by Claude Code
235+
* rather than typed by the user. Keeping these out of the signal stream
236+
* is what prevents Skill tool result bodies (e.g. SKILL.md content) from
237+
* pattern-matching as user gratitude / corrections / completion signals.
238+
*/
239+
const INJECTED_PREFIX =
240+
/^\s*(<system-reminder>|<local-command-caveat>|<command-name>|<command-message>|<command-args>|Base directory for this skill:|\[Request interrupted by user|This session is being continued from a previous conversation)/;
241+
227242
/**
228243
* Parse a Claude Code transcript JSONL file into messages suitable for signal extraction.
229244
*/
@@ -238,11 +253,25 @@ export function parseTranscript(jsonlContent: string): TranscriptMessage[] {
238253
if (entry.type !== "user" && entry.type !== "assistant") continue;
239254
if (!entry.message?.content) continue;
240255

241-
const textParts = entry.message.content
242-
.filter((c): c is { type: "text"; text: string } => c.type === "text")
243-
.map((c) => c.text);
256+
const content = entry.message.content;
257+
let text: string;
258+
259+
if (typeof content === "string") {
260+
if (INJECTED_PREFIX.test(content)) continue;
261+
text = content.trim();
262+
} else if (Array.isArray(content)) {
263+
const textParts = content
264+
.filter((c): c is { type: "text"; text: string } =>
265+
c.type === "text" && typeof (c as { text?: unknown }).text === "string",
266+
)
267+
.map((c) => c.text)
268+
.filter((t) => !INJECTED_PREFIX.test(t));
269+
270+
text = textParts.join("\n").trim();
271+
} else {
272+
continue;
273+
}
244274

245-
const text = textParts.join("\n").trim();
246275
if (!text) continue;
247276

248277
messages.push({

packages/server/tests/signal-extractor.test.ts

Lines changed: 193 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
import { describe, it, expect } from "vitest";
2-
import { extractSignalsFromMessages, type TranscriptMessage } from "../src/engine/signal-extractor.js";
2+
import {
3+
extractSignalsFromMessages,
4+
parseTranscript,
5+
type TranscriptMessage,
6+
} from "../src/engine/signal-extractor.js";
37

48
describe("signal-extractor", () => {
59
const sessionKey = "test-session";
@@ -102,3 +106,191 @@ describe("signal-extractor", () => {
102106
expect(corrections).toHaveLength(1);
103107
});
104108
});
109+
110+
describe("parseTranscript", () => {
111+
// Helper: build a single JSONL line for a transcript entry.
112+
const line = (entry: Record<string, unknown>) => JSON.stringify(entry);
113+
114+
it("extracts genuine user messages stored as bare strings", () => {
115+
// Claude Code emits message.content as a string for user-typed turns.
116+
// Before the fix, .filter() was called unconditionally on content,
117+
// throwing TypeError which the surrounding try/catch swallowed —
118+
// every real user message was silently dropped.
119+
const jsonl = [
120+
line({
121+
type: "user",
122+
uuid: "u1",
123+
timestamp: "2026-05-25T10:00:00Z",
124+
sessionId: "s1",
125+
message: { role: "user", content: "Please fix the bug." },
126+
}),
127+
line({
128+
type: "assistant",
129+
uuid: "a1",
130+
timestamp: "2026-05-25T10:00:01Z",
131+
sessionId: "s1",
132+
message: { role: "assistant", content: [{ type: "text", text: "Fixed." }] },
133+
}),
134+
].join("\n");
135+
136+
const messages = parseTranscript(jsonl);
137+
expect(messages).toHaveLength(2);
138+
expect(messages[0]).toEqual({ role: "user", text: "Please fix the bug." });
139+
expect(messages[1]).toEqual({ role: "assistant", text: "Fixed." });
140+
});
141+
142+
it("extracts text blocks from array-shaped content", () => {
143+
const jsonl = line({
144+
type: "assistant",
145+
uuid: "a1",
146+
timestamp: "2026-05-25T10:00:00Z",
147+
sessionId: "s1",
148+
message: {
149+
role: "assistant",
150+
content: [
151+
{ type: "text", text: "Running the test." },
152+
{ type: "tool_use", id: "t1", name: "Bash", input: {} },
153+
{ type: "text", text: "Done." },
154+
],
155+
},
156+
});
157+
158+
const messages = parseTranscript(jsonl);
159+
expect(messages).toHaveLength(1);
160+
expect(messages[0].text).toBe("Running the test.\nDone.");
161+
});
162+
163+
it("filters out system-reminder injections (array shape)", () => {
164+
// System reminders arrive as {type:'text'} blocks inside an
165+
// array-shaped user-role message. They are injected by Claude Code,
166+
// not typed by the user, and must not pose as user intent.
167+
const jsonl = line({
168+
type: "user",
169+
uuid: "u1",
170+
timestamp: "2026-05-25T10:00:00Z",
171+
sessionId: "s1",
172+
message: {
173+
role: "user",
174+
content: [
175+
{ type: "text", text: "<system-reminder>\nTodos updated.\n</system-reminder>" },
176+
],
177+
},
178+
});
179+
180+
const messages = parseTranscript(jsonl);
181+
expect(messages).toHaveLength(0);
182+
});
183+
184+
it("filters out slash-command artifacts (string shape)", () => {
185+
// Slash commands like /clear arrive as bare strings with
186+
// <local-command-caveat> / <command-name> wrappers.
187+
const jsonl = [
188+
line({
189+
type: "user",
190+
uuid: "u1",
191+
timestamp: "2026-05-25T10:00:00Z",
192+
sessionId: "s1",
193+
message: {
194+
role: "user",
195+
content:
196+
"<local-command-caveat>Caveat: ...</local-command-caveat>",
197+
},
198+
}),
199+
line({
200+
type: "user",
201+
uuid: "u2",
202+
timestamp: "2026-05-25T10:00:01Z",
203+
sessionId: "s1",
204+
message: {
205+
role: "user",
206+
content: "<command-name>/clear</command-name>",
207+
},
208+
}),
209+
].join("\n");
210+
211+
const messages = parseTranscript(jsonl);
212+
expect(messages).toHaveLength(0);
213+
});
214+
215+
it("filters out Skill tool result bodies posing as user content", () => {
216+
// The Skill tool returns SKILL.md content as a {type:'text'} block
217+
// attached to a user-role message. Pre-fix this surfaced as user
218+
// gratitude / correction signals — the source of the "all my Soul
219+
// signals are corrupted self-references" symptom.
220+
const jsonl = line({
221+
type: "user",
222+
uuid: "u1",
223+
timestamp: "2026-05-25T10:00:00Z",
224+
sessionId: "s1",
225+
message: {
226+
role: "user",
227+
content: [
228+
{
229+
type: "text",
230+
text:
231+
"Base directory for this skill: ~/.claude/skills/example\n\nThanks for using this skill, perfect work!",
232+
},
233+
],
234+
},
235+
});
236+
237+
const messages = parseTranscript(jsonl);
238+
expect(messages).toHaveLength(0);
239+
});
240+
241+
it("handles mixed-shape transcript without throwing", () => {
242+
// The pre-fix bug surfaced as a swallowed TypeError on the FIRST
243+
// string-content entry — subsequent entries on that same parse
244+
// call were unaffected only because the try/catch ate the throw.
245+
// This test guards against any regression that would crash on
246+
// shape mismatch.
247+
const jsonl = [
248+
line({
249+
type: "user",
250+
uuid: "u1",
251+
timestamp: "2026-05-25T10:00:00Z",
252+
sessionId: "s1",
253+
message: { role: "user", content: "Real user message." },
254+
}),
255+
line({
256+
type: "user",
257+
uuid: "u2",
258+
timestamp: "2026-05-25T10:00:01Z",
259+
sessionId: "s1",
260+
message: {
261+
role: "user",
262+
content: [{ type: "tool_result", tool_use_id: "t1", content: "ok" }],
263+
},
264+
}),
265+
line({
266+
type: "assistant",
267+
uuid: "a1",
268+
timestamp: "2026-05-25T10:00:02Z",
269+
sessionId: "s1",
270+
message: { role: "assistant", content: [{ type: "text", text: "Acknowledged." }] },
271+
}),
272+
].join("\n");
273+
274+
const messages = parseTranscript(jsonl);
275+
expect(messages).toHaveLength(2);
276+
expect(messages[0].text).toBe("Real user message.");
277+
expect(messages[1].text).toBe("Acknowledged.");
278+
});
279+
280+
it("skips malformed JSONL lines without crashing", () => {
281+
const jsonl = [
282+
"{not valid json",
283+
line({
284+
type: "user",
285+
uuid: "u1",
286+
timestamp: "2026-05-25T10:00:00Z",
287+
sessionId: "s1",
288+
message: { role: "user", content: "valid" },
289+
}),
290+
].join("\n");
291+
292+
const messages = parseTranscript(jsonl);
293+
expect(messages).toHaveLength(1);
294+
expect(messages[0].text).toBe("valid");
295+
});
296+
});

0 commit comments

Comments
 (0)