-
Notifications
You must be signed in to change notification settings - Fork 0
Improve GSD Journal with targeted OpenTelemetry concepts for better forensics #4
Description
Summary
The GSD Journal captures structured events for auto-mode iterations but is missing several OpenTelemetry-inspired concepts that would significantly improve forensics diagnosis quality. This issue tracks five targeted improvements, ordered by value.
1. Correlate journal events to pi session (highest value)
Problem: The journal (unit-start/unit-end) and the pi session JSONL (LLM calls, tool executions) are completely disconnected. Forensics infers the link by timestamp proximity, which is fragile and wastes LLM context on full-file scanning.
Fix: Add sessionId and messageOffset to unit-start:
```typescript
{
eventType: "unit-start",
data: {
unitId, unitType,
sessionId: "abc123", // pi session file identifier
messageOffset: 42 // message count at unit start
}
}
```
Impact: Forensics can jump directly from unit-end { status: "error" } to the exact tool call that failed, without scanning the whole session file.
2. Explicit durations on unit-end
Problem: Duration must be computed by pairing unit-start.ts and unit-end.ts timestamps. Forensics can't query slow units directly.
Fix: Add durationMs to unit-end:
```typescript
{ eventType: "unit-end", data: { unitId, status, artifactVerified, durationMs: 142000 } }
```
Impact: Timeout anomaly detection (queryJournal({ eventType: "unit-end" }) + filter on durationMs) works from journal alone without cross-referencing activity logs.
3. Structured error detail on unit-end
Problem: unit-end { status: "error" } carries no error detail in the journal. Forensics must parse the pi session JSONL to find what went wrong.
Fix:
```typescript
{
eventType: "unit-end",
data: {
unitId, status: "error",
error: "Bash tool failed: permission denied on /etc/hosts",
errorType: "tool-error" | "timeout" | "context-overflow" | "unknown"
}
}
```
Impact: Forensics can classify failure modes and generate a summary section from journal-only data.
4. Resource attributes on iteration-start
Problem: Journal entries carry no metadata about the GSD version or model in use. Forensics fetches this from GSD_VERSION env and metrics.json separately, making regression correlation manual.
Fix: Add a resource block to iteration-start:
```typescript
{
eventType: "iteration-start",
data: { iteration },
resource: { gsdVersion: "2.48.0", model: "anthropic/claude-sonnet-4-20250514", cwd: "/..." }
}
```
Impact: Forensics can answer "did this regression start after the model changed?" from journal alone.
5. Cross-iteration causal links for recovery chains
Problem: causedBy only works within a single flowId. When stuck detection fires and the next iteration is a recovery attempt, there is no journal link between them.
Fix: Emit causedBy on the recovery iteration's iteration-start pointing to the stuck-detected event:
```typescript
// iteration N+1 recovery
{ flowId: "flow-bbb", seq: 1, eventType: "iteration-start",
causedBy: { flowId: "flow-aaa", seq: 5 } // points to stuck-detected
}
```
Impact: Forensics reconstructs the full recovery chain (stuck → cache-invalidate → retry → still-stuck → hard-stop) from the data model rather than inferring it from timestamps.
What NOT to add
- OTLP export / external collectors — GSD is a local tool
- Sampling — 100% event capture is correct for auto-mode frequency
- Combined span-style start/end — the two-event model is better for crash forensics (a crash leaves
unit-startwith no matchingunit-end, which is precisely the signal)
Affected files
src/resources/extensions/gsd/journal.ts—JournalEntrytype +emitJournalEventsrc/resources/extensions/gsd/auto/loop.ts—iteration-startemitsrc/resources/extensions/gsd/auto/phases.ts—unit-start,unit-endemitssrc/resources/extensions/gsd/auto/loop-deps.ts—LoopDeps.emitJournalEventsignaturesrc/resources/extensions/gsd/forensics.ts— update journal summary section to use new fieldssrc/resources/extensions/gsd/tests/journal*.test.ts— update fixtures