Skip to content

Commit ac78183

Browse files
authored
Merge pull request #15 from lewisnsmith/claude/clever-torvalds-30cabc
feat: Flight Verification Layer — Trace v2, causal attribution, assertions, trust records, experiments
2 parents ef00df0 + e6b725f commit ac78183

160 files changed

Lines changed: 9827 additions & 27817 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,5 @@ dist/
55
coverage/
66
project_master_roadmap copy 2.md
77
.claude/worktrees/
8+
.claude/all-nighter-log.md
9+
.claude/all-nighter-rules.md

ARCHITECTURE.md

Lines changed: 234 additions & 437 deletions
Large diffs are not rendered by default.

CLAUDE.md

Lines changed: 125 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,24 @@
1-
# Flight — Agent Observability Platform
1+
# Flight — Agent Verification & Trust Layer
22

33
## What This Is
44

5-
An agent observability platform that provides structured tracing, audit, and replay for AI agent systems. Flight supports multiple ingestion paths: TypeScript SDK (direct file I/O), Python SDK (HTTP client), HTTP collector, MCP stdio proxy, and Claude Code hooks. All paths write to the same JSONL format at `~/.flight/logs/`.
5+
An agent verification and trust layer that evaluates whether what AI agents do is correct, expected, and safe. Flight captures causally-linked traces of agent runs and applies four core capabilities:
6+
7+
1. **Causal trace attribution** — every event is linked back to what caused it (tool results to tool calls, tool calls to LLM outputs, LLM calls to agent decisions)
8+
2. **Inline behavioral assertions** — declarative YAML rules evaluated post-event (non-blocking)
9+
3. **Comparative experiment harness** — YAML spec + `flight experiment run` across variants and repetitions
10+
4. **Structured trust records** — unsigned JSON summaries of every run, content-hashed for integrity
11+
12+
All ingress paths emit Trace v2. Data lives in `~/.flight/traces/` and `~/.flight/trust/`.
613

714
```
8-
Agents (TS SDK, Python SDK, HTTP, MCP Proxy, Hooks)
15+
Agents (TS SDK, Python SDK, MCP Proxy, Claude Code Hooks)
16+
17+
TraceSession (causal attribution, event append)
18+
19+
~/.flight/traces/<run_id>.trace.json
920
10-
~/.flight/logs/ (session JSONL + alerts)
21+
~/.flight/trust/<run_id>.trust.json (built at run end)
1122
```
1223

1324
## Stack
@@ -18,92 +29,128 @@ Python SDK: stdlib only (no external deps), Python 3.9+
1829
## Project Structure
1930

2031
```
21-
src/
22-
cli.ts — Commander CLI (subcommands: status, proxy, serve, setup, sprites, session, annotate, log, claude, hook)
23-
proxy.ts — stdio proxy: spawn upstream, bidirectional JSON-RPC forwarding
24-
json-rpc.ts — streaming JSON-RPC parser (readline + JSON.parse per line)
25-
logger.ts — session logger with async write queue and alert detection
26-
sdk.ts — TypeScript SDK: createFlightClient() for programmatic logging (schema-first)
27-
ingest.ts — HTTP collector server (flight serve)
28-
query.ts — SQLite-backed query layer (FlightDB, indexing, aggregation, trends)
29-
progressive-disclosure.ts — PD handler: phase logic, usage tracking, tool filtering
30-
pd-schema.ts — pure schema compression utilities
31-
file-lock.ts — advisory file locking (O_CREAT|O_EXCL)
32-
retry.ts — automatic retry manager for transient MCP errors
33-
hooks.ts — Claude Code hook installation/removal (SessionStart/End, UserPromptSubmit, PostToolUse)
34-
init.ts — Claude/Claude Code config file management (wraps mcpServers)
35-
setup.ts — interactive setup wizard (wraps servers + installs hooks)
36-
shared.ts — shared constants (DEFAULT_LOG_DIR, C colors, McpServerEntry type)
37-
summary.ts — session summary computation
38-
stats.ts — usage statistics
39-
lifecycle.ts — log compression and garbage collection
40-
export.ts — CSV/JSONL export
41-
replay.ts — tool call replay from logs
42-
log-commands.ts — CLI subcommands for log inspection (list, tail, view, filter, inspect, audit, verbose)
43-
index.ts — public API re-exports
32+
packages/flight-proxy/src/
33+
schema/
34+
trace-v2.ts — Trace v2 TS types (Trace, Event, CausalLink, AssertionResult, TrustRecord)
35+
trace-v2.schema.json — JSON Schema (canonical; used by Python SDK + tests)
36+
write.ts — TraceSession: append-event + finalize-trace writers
37+
read.ts — load + ajv-validate Trace v2 files
38+
causal.ts — CausalContext: cause-id derivation + parent span tracking
39+
proxy.ts — MCP stdio proxy emitting Trace v2 events
40+
json-rpc.ts — streaming JSON-RPC parser
41+
ingest.ts — HTTP collector accepting Trace v2 events
42+
sdk.ts — TS SDK (Trace v2 native)
43+
hooks.ts — Claude Code hooks emitting Trace v2 lifecycle events
44+
init.ts — Claude config wrapping
45+
assert/
46+
load.ts — YAML rule loader + validator
47+
rules.ts — built-in rule kinds (sequence, threshold, precondition, regex)
48+
evaluator.ts — post-event async evaluator; appends AssertionResult to trace
49+
cli.ts — `flight assert check|watch` handlers
50+
trust/
51+
record.ts — TrustRecord builder: aggregates events + assertion outcomes + anomalies
52+
cli.ts — `flight trust show|list` handlers
53+
experiment/
54+
spec.ts — Experiment YAML schema + loader
55+
runner.ts — run a spec across N variants × M repetitions
56+
compare.ts — pairwise + N-way trust-record comparison
57+
cli.ts — `flight experiment run|compare` handlers
58+
trace/
59+
cli.ts — `flight trace show|ls` (causal-tree pretty printer)
60+
cli.ts — top-level Commander wiring (delegates to subcommand cli.ts files)
61+
shared.ts — DEFAULT_TRACE_DIR, DEFAULT_TRUST_DIR, color constants
62+
index.ts — public API re-exports (createFlightClient + Trace v2 types)
4463
4564
sdk/python/
4665
flight_sdk/
47-
__init__.py — Package exports (FlightClient, LogEntry, ModelConfig)
48-
client.py — Buffered HTTP client for flight serve
49-
types.py — LogEntry, ModelConfig dataclasses
66+
__init__.py — Package exports
67+
client.py — Buffered HTTP client emitting Trace v2 events
68+
trace.py — Python dataclasses mirroring trace-v2.schema.json
69+
causal.py — Python-side cause-id helpers
5070
tests/
51-
test_client.py — Integration tests (starts flight serve, posts events, verifies JSONL)
52-
pyproject.toml — Package config (flight-sdk, Python 3.9+)
71+
test_client.py — Integration tests (starts flight serve, posts events, verifies v2)
72+
pyproject.toml — Package config (flight-sdk, Python 3.9+)
73+
74+
examples/verified-agent/ — End-to-end example
75+
agent.mjs — Runnable agent exercising direct vs probe strategies
76+
flight.assertions.yaml — Behavioral rules (sequence, threshold, regex)
77+
experiment.yaml — 2-strategy comparison spec
78+
expected-trust.json — Reference trust record shape
5379
```
5480

5581
## CLI Structure
5682

5783
```bash
58-
# Top-level commands
59-
flight serve [--port 4242] [--log-dir] # HTTP collector
60-
flight proxy --cmd <server> -- <args> # MCP stdio proxy
61-
flight status # One-line summary of active sessions
62-
flight setup # Interactive configuration wizard
63-
flight session start|end # Explicit session lifecycle
64-
flight annotate <target-id> --label <l> # Attach annotation to a run/session/turn/tool_call
65-
66-
# Log commands
67-
flight log list|tail|view|filter|inspect|alerts|summary|tools|audit|verbose
68-
flight log stats|export|replay|replay-call|gc|prune|query
69-
70-
# Claude Code integration
71-
flight claude setup # Interactive wizard
72-
flight claude hooks install|remove # Hook management
73-
flight claude init desktop|code # MCP server wrapping
74-
75-
# Internal (used by hooks)
84+
# Ingress
85+
flight proxy --cmd <server> -- <args> # MCP stdio proxy → Trace v2
86+
flight serve [--port 4242] # HTTP collector → Trace v2 (Python SDK ingest)
87+
88+
# Assertions
89+
flight assert check <trace> # Run YAML assertions against a recorded trace
90+
flight assert watch # Live-evaluate assertions on incoming events
91+
92+
# Trust records
93+
flight trust show <run_id> # Print the trust record for a run
94+
flight trust list # List trust records
95+
96+
# Experiments
97+
flight experiment run <spec.yaml> # Run an experiment spec; emit comparison record
98+
flight experiment compare <run_a> <run_b> # Pairwise comparison of two trust records
99+
100+
# Trace inspection
101+
flight trace show <trace> # Inspect a Trace v2 file (causal tree)
102+
flight trace ls # List traces under ~/.flight/traces/
103+
104+
# Claude Code integration (internal + management)
76105
flight hook session-start|session-end|user-prompt-submit|post-tool-use
106+
flight claude install|uninstall # Hook + slash command management
77107
```
78108

79-
## Log Schema
109+
## Trace v2 Schema
110+
111+
The canonical schema lives at `packages/flight-proxy/src/schema/trace-v2.schema.json`. All ingress paths must emit documents that validate against it.
112+
113+
**Top-level `Trace` document:**
114+
115+
| Field | Required | Description |
116+
|-------|:--------:|-------------|
117+
| `schema_version` | yes | `"2.0"` |
118+
| `trace_id` | yes | ULID — unique per trace file |
119+
| `run_id` | yes | ULID — groups related traces (e.g., experiment repetitions) |
120+
| `started_at` | yes | ISO 8601 |
121+
| `ended_at` || set by `finalize()` |
122+
| `input_context` | yes | prompt or task context |
123+
| `events` | yes | ordered `Event[]` |
124+
| `assertions` | yes | `AssertionResult[]` appended post-evaluation |
125+
| `anomalies` | yes | `Anomaly[]` (loop, error_recovery, assertion_fail, schema_violation) |
126+
| `metadata` | yes | model, agent_id, provider, token_counts, cost_usd |
80127

81-
Required fields: `session_id`, `timestamp`, `event_type`
82-
Event types: `tool_call`, `tool_result`, `agent_action`, `evaluation`, `lifecycle`
83-
Optional fields: `run_id`, `agent_id`, `model_config`, `chosen_action`, `execution_outcome`, `evaluator_score`, `labels`, `metadata`, `call_id`, `direction`, `method`, `tool_name`, `payload`, `error`, `latency_ms`, `error_recovery_anomaly`, `pd_active`, `schema_tokens_saved`
128+
**`Event` fields:** `event_id`, `span_id`, `parent_span_id?`, `kind`, `timestamp`, `causal_link?`, `payload`, `latency_ms?`, `error?`
129+
130+
**Event kinds:** `lifecycle.run_start`, `lifecycle.run_end`, `llm.call`, `llm.result`, `tool_call`, `tool_result`, `agent.decision`, `assertion.evaluated`
131+
132+
**`CausalLink`:** `caused_by_event_id`, `reason` (`tool_result_consumed | llm_output_emitted | user_input | scheduled | explicit`), `notes?`
133+
134+
**Data locations:**
135+
- `~/.flight/traces/<run_id>.trace.json` — Trace v2 files
136+
- `~/.flight/trust/<run_id>.trust.json` — trust records
84137

85138
## Claude Code Integration
86139

87140
### Hooks (always active)
88-
Installed in `~/.claude/settings.json` by `flight claude setup`:
89-
- **SessionStart**`flight hook session-start`creates active session marker
90-
- **SessionEnd**`flight hook session-end`outputs summary, triggers compression/GC
91-
- **UserPromptSubmit**`flight hook user-prompt-submit`logs user prompt submissions
92-
- **PostToolUse**`flight hook post-tool-use`logs tool calls to `<session>_tools.jsonl`
141+
Installed in `~/.claude/settings.json` by `flight claude install`:
142+
- **SessionStart**`flight hook session-start`opens a TraceSession with a `lifecycle.run_start` event
143+
- **SessionEnd**`flight hook session-end`finalizes trace, builds trust record
144+
- **UserPromptSubmit**`flight hook user-prompt-submit`records user input as `user_input` event
145+
- **PostToolUse**`flight hook post-tool-use`records `tool_call` + `tool_result` events
93146

94147
### MCP Proxy Wrapping (optional)
95-
`flight claude init code --apply` rewrites `~/.claude.json` mcpServers.
148+
`flight proxy --cmd <server> -- <args>` wraps any MCP server transparently.
96149

97150
### Slash Commands
98-
Installed in `~/.claude/commands/` by `flight claude setup`:
99-
- **`/flight`** — quick session audit (runs `flight log audit`)
100-
- **`/flight-log`** — comprehensive view (runs `flight log verbose`)
101-
102-
### Data Locations
103-
- `~/.flight/logs/session_*.jsonl` — session recordings
104-
- `~/.flight/logs/<session>_tools.jsonl` — tool call metadata from hooks
105-
- `~/.flight/alerts.jsonl` — hallucination hints, loops, errors
106-
- `~/.flight/usage/` — token usage statistics
151+
Installed in `~/.claude/commands/` by `flight claude install`:
152+
- **`/flight`**`flight trace show` for the current session
153+
- **`/flight-log`**`flight trust show` for the current run
107154

108155
## Commands
109156

@@ -119,20 +166,18 @@ Python SDK tests: `cd sdk/python && python3 -m pytest tests/ -v`
119166

120167
## Key Patterns
121168

122-
- **Handler result objects**`PDResponseResult` carries rewritten responses, log metadata, and status messages in one return value.
123-
- **Async write queue** — Logger batches writes with a flush timer; `closeSync()` drains synchronously for signal handlers.
124-
- **SDK uses logEntry**`createFlightClient()` constructs `LogEntry` objects directly via `logger.logEntry()` (schema-first, no fake JSON-RPC).
125-
- **HTTP collector**`startCollector()` uses Node built-in `http.createServer`, validates entries, batches writes per session.
126-
- **Python SDK buffering** — entries buffer in memory, flush every 1s or 100 entries via `urllib.request` POST to `/ingest`.
127-
- **JSON-RPC streaming**`parseJsonRpcStream` is a newline-delimited JSON parser on Node readable streams.
128-
- **Progressive disclosure** — Phase 1 (observation), Phase 2 (schema compression), Phase 3 (compression + filtering).
129-
- **SQLite query layer**`FlightDB` indexes JSONL files into SQLite for cross-session queries, aggregation by tool, and daily trends.
130-
- **Alert detection** — Error-recovery anomalies (different tool called after error), loop detection (same tool 5x in 60s).
169+
- **TraceSession** — unified API wrapping the writer + `CausalContext`; call `recordEvent()`, get back a fully linked `Event` with `causal_link` filled automatically.
170+
- **Causal attribution rules**`tool_result` links to matching `tool_call` by `call_id`; `tool_call` links to most recent `llm.result`; `llm.call` links to most recent `agent.decision` or `lifecycle.run_start`.
171+
- **Async post-event assertion evaluation** — evaluator runs on a microtask queue; results are appended to `trace.assertions`; failures also append an `Anomaly` of kind `assertion_fail`.
172+
- **Content-hashed trust records**`buildTrustRecord(trace)` produces a `TrustRecord` with `content_hash = sha256(canonical-JSON(trace))`; unsigned in v2 (attestation deferred).
173+
- **YAML-driven experiments**`experiment.yaml` specifies variants + repetitions; runner spawns child processes, collects trust records, `compareResults()` produces per-metric min/max/mean/stddev.
174+
- **JSON-RPC streaming**`parseJsonRpcStream` is a newline-delimited JSON parser on Node readable streams (unchanged from v1).
175+
- **HTTP collector**`startCollector()` uses Node built-in `http.createServer`; accepts `POST /ingest` batches of v2 events; `POST /finalize` closes the run.
131176

132177
## Testing
133178

134-
- Tests live in `test/` alongside source
135-
- Mock MCP server pattern: spawn a test server, connect via proxy, assert on JSON-RPC messages
179+
- Tests live in `packages/flight-proxy/test/` alongside source
180+
- Mock MCP server pattern: spawn a test server, connect via proxy, assert on Trace v2 output
136181
- `test/simulate/` contains validation harnesses for Claude API compatibility
137182
- `sdk/python/tests/` — Python integration tests (require built CLI for `flight serve`)
138183
- Run `npm run test` — all tests should pass before any PR

FAQ.md

Lines changed: 0 additions & 73 deletions
This file was deleted.

0 commit comments

Comments
 (0)