lewisnsmith
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎ARCHITECTURE.md‎
Lines changed: 234 additions & 437 deletions b/‎ARCHITECTURE.md‎
Lines changed: 234 additions & 437 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 125 additions & 80 deletions b/‎CLAUDE.md‎
Lines changed: 125 additions & 80 deletions
diff --git a/‎FAQ.md‎
Lines changed: 0 additions & 73 deletions b/‎FAQ.md‎
Lines changed: 0 additions & 73 deletions
@@ -5,3 +5,5 @@ dist/
 coverage/
 project_master_roadmap copy 2.md
 .claude/worktrees/
+.claude/all-nighter-log.md
+.claude/all-nighter-rules.md
@@ -1,13 +1,24 @@
-# Flight — Agent Observability Platform
+# Flight — Agent Verification & Trust Layer
 
 ## What This Is
 
-An agent observability platform that provides structured tracing, audit, and replay for AI agent systems. Flight supports multiple ingestion paths: TypeScript SDK (direct file I/O), Python SDK (HTTP client), HTTP collector, MCP stdio proxy, and Claude Code hooks. All paths write to the same JSONL format at `~/.flight/logs/`.
+An agent verification and trust layer that evaluates whether what AI agents do is correct, expected, and safe. Flight captures causally-linked traces of agent runs and applies four core capabilities:
+
+1. **Causal trace attribution** — every event is linked back to what caused it (tool results to tool calls, tool calls to LLM outputs, LLM calls to agent decisions)
+2. **Inline behavioral assertions** — declarative YAML rules evaluated post-event (non-blocking)
+3. **Comparative experiment harness** — YAML spec + `flight experiment run` across variants and repetitions
+4. **Structured trust records** — unsigned JSON summaries of every run, content-hashed for integrity
+
+All ingress paths emit Trace v2. Data lives in `~/.flight/traces/` and `~/.flight/trust/`.
 
 ```
-Agents (TS SDK, Python SDK, HTTP, MCP Proxy, Hooks)
+Agents (TS SDK, Python SDK, MCP Proxy, Claude Code Hooks)
+                    ↓
+            TraceSession (causal attribution, event append)
+                    ↓
+     ~/.flight/traces/<run_id>.trace.json
                     ↓
-              ~/.flight/logs/  (session JSONL + alerts)
+     ~/.flight/trust/<run_id>.trust.json  (built at run end)
 ```
 
 ## Stack
@@ -18,92 +29,128 @@ Python SDK: stdlib only (no external deps), Python 3.9+
 ## Project Structure
 
 ```
-src/
-  cli.ts               — Commander CLI (subcommands: status, proxy, serve, setup, sprites, session, annotate, log, claude, hook)
-  proxy.ts             — stdio proxy: spawn upstream, bidirectional JSON-RPC forwarding
-  json-rpc.ts          — streaming JSON-RPC parser (readline + JSON.parse per line)
-  logger.ts            — session logger with async write queue and alert detection
-  sdk.ts               — TypeScript SDK: createFlightClient() for programmatic logging (schema-first)
-  ingest.ts            — HTTP collector server (flight serve)
-  query.ts             — SQLite-backed query layer (FlightDB, indexing, aggregation, trends)
-  progressive-disclosure.ts — PD handler: phase logic, usage tracking, tool filtering
-  pd-schema.ts         — pure schema compression utilities
-  file-lock.ts         — advisory file locking (O_CREAT|O_EXCL)
-  retry.ts             — automatic retry manager for transient MCP errors
-  hooks.ts             — Claude Code hook installation/removal (SessionStart/End, UserPromptSubmit, PostToolUse)
-  init.ts              — Claude/Claude Code config file management (wraps mcpServers)
-  setup.ts             — interactive setup wizard (wraps servers + installs hooks)
-  shared.ts            — shared constants (DEFAULT_LOG_DIR, C colors, McpServerEntry type)
-  summary.ts           — session summary computation
-  stats.ts             — usage statistics
-  lifecycle.ts         — log compression and garbage collection
-  export.ts            — CSV/JSONL export
-  replay.ts            — tool call replay from logs
-  log-commands.ts      — CLI subcommands for log inspection (list, tail, view, filter, inspect, audit, verbose)
-  index.ts             — public API re-exports
+packages/flight-proxy/src/
+  schema/
+    trace-v2.ts          — Trace v2 TS types (Trace, Event, CausalLink, AssertionResult, TrustRecord)
+    trace-v2.schema.json — JSON Schema (canonical; used by Python SDK + tests)
+    write.ts             — TraceSession: append-event + finalize-trace writers
+    read.ts              — load + ajv-validate Trace v2 files
+    causal.ts            — CausalContext: cause-id derivation + parent span tracking
+  proxy.ts               — MCP stdio proxy emitting Trace v2 events
+  json-rpc.ts            — streaming JSON-RPC parser
+  ingest.ts              — HTTP collector accepting Trace v2 events
+  sdk.ts                 — TS SDK (Trace v2 native)
+  hooks.ts               — Claude Code hooks emitting Trace v2 lifecycle events
+  init.ts                — Claude config wrapping
+  assert/
+    load.ts              — YAML rule loader + validator
+    rules.ts             — built-in rule kinds (sequence, threshold, precondition, regex)
+    evaluator.ts         — post-event async evaluator; appends AssertionResult to trace
+    cli.ts               — `flight assert check|watch` handlers
+  trust/
+    record.ts            — TrustRecord builder: aggregates events + assertion outcomes + anomalies
+    cli.ts               — `flight trust show|list` handlers
+  experiment/
+    spec.ts              — Experiment YAML schema + loader
+    runner.ts            — run a spec across N variants × M repetitions
+    compare.ts           — pairwise + N-way trust-record comparison
+    cli.ts               — `flight experiment run|compare` handlers
+  trace/
+    cli.ts               — `flight trace show|ls` (causal-tree pretty printer)
+  cli.ts                 — top-level Commander wiring (delegates to subcommand cli.ts files)
+  shared.ts              — DEFAULT_TRACE_DIR, DEFAULT_TRUST_DIR, color constants
+  index.ts               — public API re-exports (createFlightClient + Trace v2 types)
 
 sdk/python/
   flight_sdk/
-    __init__.py        — Package exports (FlightClient, LogEntry, ModelConfig)
-    client.py          — Buffered HTTP client for flight serve
-    types.py           — LogEntry, ModelConfig dataclasses
+    __init__.py          — Package exports
+    client.py            — Buffered HTTP client emitting Trace v2 events
+    trace.py             — Python dataclasses mirroring trace-v2.schema.json
+    causal.py            — Python-side cause-id helpers
   tests/
-    test_client.py     — Integration tests (starts flight serve, posts events, verifies JSONL)
-  pyproject.toml       — Package config (flight-sdk, Python 3.9+)
+    test_client.py       — Integration tests (starts flight serve, posts events, verifies v2)
+  pyproject.toml         — Package config (flight-sdk, Python 3.9+)
+
+examples/verified-agent/   — End-to-end example
+  agent.mjs              — Runnable agent exercising direct vs probe strategies
+  flight.assertions.yaml — Behavioral rules (sequence, threshold, regex)
+  experiment.yaml        — 2-strategy comparison spec
+  expected-trust.json    — Reference trust record shape
 ```
 
 ## CLI Structure
 
 ```bash
-# Top-level commands
-flight serve [--port 4242] [--log-dir]   # HTTP collector
-flight proxy --cmd <server> -- <args>     # MCP stdio proxy
-flight status                             # One-line summary of active sessions
-flight setup                              # Interactive configuration wizard
-flight session start|end                  # Explicit session lifecycle
-flight annotate <target-id> --label <l>   # Attach annotation to a run/session/turn/tool_call
-
-# Log commands
-flight log list|tail|view|filter|inspect|alerts|summary|tools|audit|verbose
-flight log stats|export|replay|replay-call|gc|prune|query
-
-# Claude Code integration
-flight claude setup                       # Interactive wizard
-flight claude hooks install|remove        # Hook management
-flight claude init desktop|code           # MCP server wrapping
-
-# Internal (used by hooks)
+# Ingress
+flight proxy --cmd <server> -- <args>     # MCP stdio proxy → Trace v2
+flight serve [--port 4242]                # HTTP collector → Trace v2 (Python SDK ingest)
+
+# Assertions
+flight assert check <trace>               # Run YAML assertions against a recorded trace
+flight assert watch                       # Live-evaluate assertions on incoming events
+
+# Trust records
+flight trust show <run_id>                # Print the trust record for a run
+flight trust list                         # List trust records
+
+# Experiments
+flight experiment run <spec.yaml>         # Run an experiment spec; emit comparison record
+flight experiment compare <run_a> <run_b> # Pairwise comparison of two trust records
+
+# Trace inspection
+flight trace show <trace>                 # Inspect a Trace v2 file (causal tree)
+flight trace ls                           # List traces under ~/.flight/traces/
+
+# Claude Code integration (internal + management)
 flight hook session-start|session-end|user-prompt-submit|post-tool-use
+flight claude install|uninstall           # Hook + slash command management
 ```
 
-## Log Schema
+## Trace v2 Schema
+
+The canonical schema lives at `packages/flight-proxy/src/schema/trace-v2.schema.json`. All ingress paths must emit documents that validate against it.
+
+**Top-level `Trace` document:**
+
+| Field | Required | Description |
+|-------|:--------:|-------------|
+| `schema_version` | yes | `"2.0"` |
+| `trace_id` | yes | ULID — unique per trace file |
+| `run_id` | yes | ULID — groups related traces (e.g., experiment repetitions) |
+| `started_at` | yes | ISO 8601 |
+| `ended_at` | — | set by `finalize()` |
+| `input_context` | yes | prompt or task context |
+| `events` | yes | ordered `Event[]` |
+| `assertions` | yes | `AssertionResult[]` appended post-evaluation |
+| `anomalies` | yes | `Anomaly[]` (loop, error_recovery, assertion_fail, schema_violation) |
+| `metadata` | yes | model, agent_id, provider, token_counts, cost_usd |
 
-Required fields: `session_id`, `timestamp`, `event_type`
-Event types: `tool_call`, `tool_result`, `agent_action`, `evaluation`, `lifecycle`
-Optional fields: `run_id`, `agent_id`, `model_config`, `chosen_action`, `execution_outcome`, `evaluator_score`, `labels`, `metadata`, `call_id`, `direction`, `method`, `tool_name`, `payload`, `error`, `latency_ms`, `error_recovery_anomaly`, `pd_active`, `schema_tokens_saved`
+**`Event` fields:** `event_id`, `span_id`, `parent_span_id?`, `kind`, `timestamp`, `causal_link?`, `payload`, `latency_ms?`, `error?`
+
+**Event kinds:** `lifecycle.run_start`, `lifecycle.run_end`, `llm.call`, `llm.result`, `tool_call`, `tool_result`, `agent.decision`, `assertion.evaluated`
+
+**`CausalLink`:** `caused_by_event_id`, `reason` (`tool_result_consumed | llm_output_emitted | user_input | scheduled | explicit`), `notes?`
+
+**Data locations:**
+- `~/.flight/traces/<run_id>.trace.json` — Trace v2 files
+- `~/.flight/trust/<run_id>.trust.json` — trust records
 
 ## Claude Code Integration
 
 ### Hooks (always active)
-Installed in `~/.claude/settings.json` by `flight claude setup`:
-- **SessionStart** → `flight hook session-start` — creates active session marker
-- **SessionEnd** → `flight hook session-end` — outputs summary, triggers compression/GC
-- **UserPromptSubmit** → `flight hook user-prompt-submit` — logs user prompt submissions
-- **PostToolUse** → `flight hook post-tool-use` — logs tool calls to `<session>_tools.jsonl`
+Installed in `~/.claude/settings.json` by `flight claude install`:
+- **SessionStart** → `flight hook session-start` — opens a TraceSession with a `lifecycle.run_start` event
+- **SessionEnd** → `flight hook session-end` — finalizes trace, builds trust record
+- **UserPromptSubmit** → `flight hook user-prompt-submit` — records user input as `user_input` event
+- **PostToolUse** → `flight hook post-tool-use` — records `tool_call` + `tool_result` events
 
 ### MCP Proxy Wrapping (optional)
-`flight claude init code --apply` rewrites `~/.claude.json` mcpServers.
+`flight proxy --cmd <server> -- <args>` wraps any MCP server transparently.
 
 ### Slash Commands
-Installed in `~/.claude/commands/` by `flight claude setup`:
-- **`/flight`** — quick session audit (runs `flight log audit`)
-- **`/flight-log`** — comprehensive view (runs `flight log verbose`)
-
-### Data Locations
-- `~/.flight/logs/session_*.jsonl` — session recordings
-- `~/.flight/logs/<session>_tools.jsonl` — tool call metadata from hooks
-- `~/.flight/alerts.jsonl` — hallucination hints, loops, errors
-- `~/.flight/usage/` — token usage statistics
+Installed in `~/.claude/commands/` by `flight claude install`:
+- **`/flight`** — `flight trace show` for the current session
+- **`/flight-log`** — `flight trust show` for the current run
 
 ## Commands
 
@@ -119,20 +166,18 @@ Python SDK tests: `cd sdk/python && python3 -m pytest tests/ -v`
 
 ## Key Patterns
 
-- **Handler result objects** — `PDResponseResult` carries rewritten responses, log metadata, and status messages in one return value.
-- **Async write queue** — Logger batches writes with a flush timer; `closeSync()` drains synchronously for signal handlers.
-- **SDK uses logEntry** — `createFlightClient()` constructs `LogEntry` objects directly via `logger.logEntry()` (schema-first, no fake JSON-RPC).
-- **HTTP collector** — `startCollector()` uses Node built-in `http.createServer`, validates entries, batches writes per session.
-- **Python SDK buffering** — entries buffer in memory, flush every 1s or 100 entries via `urllib.request` POST to `/ingest`.
-- **JSON-RPC streaming** — `parseJsonRpcStream` is a newline-delimited JSON parser on Node readable streams.
-- **Progressive disclosure** — Phase 1 (observation), Phase 2 (schema compression), Phase 3 (compression + filtering).
-- **SQLite query layer** — `FlightDB` indexes JSONL files into SQLite for cross-session queries, aggregation by tool, and daily trends.
-- **Alert detection** — Error-recovery anomalies (different tool called after error), loop detection (same tool 5x in 60s).
+- **TraceSession** — unified API wrapping the writer + `CausalContext`; call `recordEvent()`, get back a fully linked `Event` with `causal_link` filled automatically.
+- **Causal attribution rules** — `tool_result` links to matching `tool_call` by `call_id`; `tool_call` links to most recent `llm.result`; `llm.call` links to most recent `agent.decision` or `lifecycle.run_start`.
+- **Async post-event assertion evaluation** — evaluator runs on a microtask queue; results are appended to `trace.assertions`; failures also append an `Anomaly` of kind `assertion_fail`.
+- **Content-hashed trust records** — `buildTrustRecord(trace)` produces a `TrustRecord` with `content_hash = sha256(canonical-JSON(trace))`; unsigned in v2 (attestation deferred).
+- **YAML-driven experiments** — `experiment.yaml` specifies variants + repetitions; runner spawns child processes, collects trust records, `compareResults()` produces per-metric min/max/mean/stddev.
+- **JSON-RPC streaming** — `parseJsonRpcStream` is a newline-delimited JSON parser on Node readable streams (unchanged from v1).
+- **HTTP collector** — `startCollector()` uses Node built-in `http.createServer`; accepts `POST /ingest` batches of v2 events; `POST /finalize` closes the run.
 
 ## Testing
 
-- Tests live in `test/` alongside source
-- Mock MCP server pattern: spawn a test server, connect via proxy, assert on JSON-RPC messages
+- Tests live in `packages/flight-proxy/test/` alongside source
+- Mock MCP server pattern: spawn a test server, connect via proxy, assert on Trace v2 output
 - `test/simulate/` contains validation harnesses for Claude API compatibility
 - `sdk/python/tests/` — Python integration tests (require built CLI for `flight serve`)
 - Run `npm run test` — all tests should pass before any PR