Skip to content

Improve GSD Journal with targeted OpenTelemetry concepts for better forensics #4

@igouss

Description

@igouss

Summary

The GSD Journal captures structured events for auto-mode iterations but is missing several OpenTelemetry-inspired concepts that would significantly improve forensics diagnosis quality. This issue tracks five targeted improvements, ordered by value.


1. Correlate journal events to pi session (highest value)

Problem: The journal (unit-start/unit-end) and the pi session JSONL (LLM calls, tool executions) are completely disconnected. Forensics infers the link by timestamp proximity, which is fragile and wastes LLM context on full-file scanning.

Fix: Add sessionId and messageOffset to unit-start:

```typescript
{
eventType: "unit-start",
data: {
unitId, unitType,
sessionId: "abc123", // pi session file identifier
messageOffset: 42 // message count at unit start
}
}
```

Impact: Forensics can jump directly from unit-end { status: "error" } to the exact tool call that failed, without scanning the whole session file.


2. Explicit durations on unit-end

Problem: Duration must be computed by pairing unit-start.ts and unit-end.ts timestamps. Forensics can't query slow units directly.

Fix: Add durationMs to unit-end:

```typescript
{ eventType: "unit-end", data: { unitId, status, artifactVerified, durationMs: 142000 } }
```

Impact: Timeout anomaly detection (queryJournal({ eventType: "unit-end" }) + filter on durationMs) works from journal alone without cross-referencing activity logs.


3. Structured error detail on unit-end

Problem: unit-end { status: "error" } carries no error detail in the journal. Forensics must parse the pi session JSONL to find what went wrong.

Fix:

```typescript
{
eventType: "unit-end",
data: {
unitId, status: "error",
error: "Bash tool failed: permission denied on /etc/hosts",
errorType: "tool-error" | "timeout" | "context-overflow" | "unknown"
}
}
```

Impact: Forensics can classify failure modes and generate a summary section from journal-only data.


4. Resource attributes on iteration-start

Problem: Journal entries carry no metadata about the GSD version or model in use. Forensics fetches this from GSD_VERSION env and metrics.json separately, making regression correlation manual.

Fix: Add a resource block to iteration-start:

```typescript
{
eventType: "iteration-start",
data: { iteration },
resource: { gsdVersion: "2.48.0", model: "anthropic/claude-sonnet-4-20250514", cwd: "/..." }
}
```

Impact: Forensics can answer "did this regression start after the model changed?" from journal alone.


5. Cross-iteration causal links for recovery chains

Problem: causedBy only works within a single flowId. When stuck detection fires and the next iteration is a recovery attempt, there is no journal link between them.

Fix: Emit causedBy on the recovery iteration's iteration-start pointing to the stuck-detected event:

```typescript
// iteration N+1 recovery
{ flowId: "flow-bbb", seq: 1, eventType: "iteration-start",
causedBy: { flowId: "flow-aaa", seq: 5 } // points to stuck-detected
}
```

Impact: Forensics reconstructs the full recovery chain (stuck → cache-invalidate → retry → still-stuck → hard-stop) from the data model rather than inferring it from timestamps.


What NOT to add

  • OTLP export / external collectors — GSD is a local tool
  • Sampling — 100% event capture is correct for auto-mode frequency
  • Combined span-style start/end — the two-event model is better for crash forensics (a crash leaves unit-start with no matching unit-end, which is precisely the signal)

Affected files

  • src/resources/extensions/gsd/journal.tsJournalEntry type + emitJournalEvent
  • src/resources/extensions/gsd/auto/loop.tsiteration-start emit
  • src/resources/extensions/gsd/auto/phases.tsunit-start, unit-end emits
  • src/resources/extensions/gsd/auto/loop-deps.tsLoopDeps.emitJournalEvent signature
  • src/resources/extensions/gsd/forensics.ts — update journal summary section to use new fields
  • src/resources/extensions/gsd/tests/journal*.test.ts — update fixtures

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions