lewisnsmith
diff --git a/‎.claude/commands/flight-log.md‎
Lines changed: 2 additions & 2 deletions b/‎.claude/commands/flight-log.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎ARCHITECTURE.md‎
Lines changed: 39 additions & 3 deletions b/‎ARCHITECTURE.md‎
Lines changed: 39 additions & 3 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 15 additions & 3 deletions b/‎CHANGELOG.md‎
Lines changed: 15 additions & 3 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 17 additions & 4 deletions b/‎CLAUDE.md‎
Lines changed: 17 additions & 4 deletions
diff --git a/‎FAQ.md‎
Lines changed: 4 additions & 4 deletions b/‎FAQ.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 80 additions & 23 deletions b/‎README.md‎
Lines changed: 80 additions & 23 deletions
@@ -1,4 +1,4 @@
-Run `flight log audit` to display a full audit of all tool calls from the current session.
+Run `flight logs audit` to display a full audit of all tool calls from the current session.
 
 Read the output carefully. Present a concise summary to the user:
 
@@ -9,4 +9,4 @@ Read the output carefully. Present a concise summary to the user:
 
 If there are errors or suspicious patterns, offer to investigate the specific tool calls or help fix the underlying issues.
 
-If the user asks about a specific tool call, you can run `flight log tools` with `--tool <name>` to filter, or read the session's `_tools.jsonl` file directly from `~/.flight/logs/` for full details.
+If the user asks about a specific tool call, you can run `flight logs tools` with `--tool <name>` to filter, or read the session's `_tools.jsonl` file directly from `~/.flight/logs/` for full details.
@@ -386,7 +386,7 @@ every subsequent call.
 
 **Trigger:** Any `server->client` response containing an `error` field.
 Every tool error is recorded as an alert for cross-session querying
-via `flight log alerts`.
+via `flight logs alerts`.
 
 ### Auto-Retry (Transparent)
 
@@ -408,7 +408,7 @@ Active session
     |
     | (after 24h default, or --compress-after)
     v
-flight log gc --compress-after 24
+flight logs gc --compress-after 24
     |
     v
 session_*.jsonl.gz                   <-- gzip compressed, original deleted
@@ -447,7 +447,43 @@ Oldest sessions deleted (FIFO)
 
 ---
 
-## 10. Key Design Decisions
+## 10. Experiment Registry
+
+The experiment registry (`src/experiments.ts`) provides a lightweight, file-per-experiment store at `~/.flight/experiments/<name>.json`. Each file is a JSON object conforming to `ExperimentEntry`:
+
+```ts
+type ExperimentEntry = {
+  name: string;
+  created_at: string;
+  description?: string;
+  tags: string[];
+  baseline_run_id?: string;
+  model_config?: Record<string, unknown>;
+  notes?: string;
+}
+```
+
+### Key properties
+
+- **Race-safe creation** — `ensureExperimentRegistered` writes with `{ flag: "wx" }` (O_EXCL). Concurrent callers are safe; exactly one wins `created: true`.
+- **Idempotent** — calling `ensureExperimentRegistered` a second time returns `{ created: false }` without rewriting the file.
+- **Merge semantics** — `createOrUpdateExperiment` merges patch fields (arrays replace, not append) while preserving `created_at`.
+- **Graceful reads** — `listExperiments` skips files with invalid JSON or wrong shape, logging a warning per skipped file.
+- **Auto-creation** — `flight run --experiment <name>` calls `ensureExperimentRegistered` before `runSessionStart`; on `{ created: true }` it prints a one-line stderr hint.
+
+### CLI verbs
+
+| Command | Description |
+|---------|-------------|
+| `flight experiment new <name>` | Register or update an experiment (description, tags, baseline, model, notes) |
+| `flight experiment list` | Table of all experiments with run counts from SQLite |
+| `flight experiment show <name>` | Metadata + recent runs |
+| `flight experiment diff <a> <b>` | Delegates to `compareCommand` with sessions from both experiments |
+| `flight experiment export <name>` | Streams research JSONL per run to stdout (unbuffered) |
+
+---
+
+## 11. Key Design Decisions
 
 ### Why STDIO (not HTTP)
 
 
@@ -7,6 +7,18 @@
 - `/flight-compare` slash command: 3-bullet experiment diff (winner, biggest delta, suggested next test) via `flight experiment diff`.
 - `/flight-annotate` slash command: per-turn labelling with strict one-command-per-turn output for persisting annotations via `flight annotate`.
 
+## 1.5.0
+
+### Breaking
+- **`flight log` renamed to `flight logs`** (plural). All subcommands are unchanged. There is no deprecation shim — update any scripts that call `flight log <subcommand>`. Re-run `flight claude setup` to update installed slash commands.
+
+### Added
+- `flight run --agent <agent> [--experiment <id>] [--model <name>]` — start a run with a human-friendly output (`Started run <runId> session <sessionId>`). Mirrors `session start` options.
+- `flight show <session-id>` — view a recorded session (alias for `flight logs view`).
+- `flight logs` bare invocation (no subcommand) now behaves like `flight logs list`.
+- `flight experiment {new,list,show,diff,export}` — experiment registry and cross-run analysis commands backed by `~/.flight/experiments/<name>.json`.
+- Auto-registration: `flight run --experiment <name>` creates the experiment registry file on first use and prints a one-line stderr hint to add description/tags.
+
 ## 1.4.0
 
 ### Removed
@@ -27,7 +39,7 @@
 ## 1.2.0
 
 ### Added
-- `flight log audit` — rich audit view of tool calls for the current session (powers `/flight-log` slash command)
+- `flight logs audit` — rich audit view of tool calls for the current session (powers `/flight-log` slash command)
 - `/flight-log` slash command installed by `flight setup`
 - Active session marker (`~/.flight/logs/.active_session`) for hook-aware session resolution
 - `mergeSessionUsage()` exported for programmatic usage tracking in progressive disclosure
@@ -62,8 +74,8 @@
 - `flight proxy --cmd <command>` — start the proxy
 - `flight init claude` / `flight init claude-code` — auto-configure MCP clients
 - `flight setup` / `flight setup --remove` — zero-config setup with Claude Code hooks
-- `flight log list|view|tail|filter|inspect|alerts|summary` — log inspection
-- `flight log gc|prune` — log lifecycle management
+- `flight logs list|view|tail|filter|inspect|alerts|summary` — log inspection
+- `flight logs gc|prune` — log lifecycle management
 - `flight export <session> --format csv|jsonl` — research export
 - `flight replay <call-id> --cmd <server>` — call replay
 - `flight stats [session]` — token metrics and tool breakdown
 
@@ -19,7 +19,7 @@ Python SDK: stdlib only (no external deps), Python 3.9+
 
 ```
 src/
-  cli.ts               — Commander CLI (subcommands: serve, proxy, log, claude, hook)
+  cli.ts               — Commander CLI (subcommands: serve, proxy, run, show, logs, claude, hook)
   proxy.ts             — stdio proxy: spawn upstream, bidirectional JSON-RPC forwarding
   json-rpc.ts          — streaming JSON-RPC parser (readline + JSON.parse per line)
   logger.ts            — session logger with async write queue and alert detection
@@ -40,6 +40,7 @@ src/
   export.ts            — CSV/JSONL export
   replay.ts            — tool call replay from logs
   log-commands.ts      — CLI subcommands for log inspection (list, tail, view, filter, inspect, audit, verbose)
+  experiments.ts       — Experiment registry: per-file JSON store at ~/.flight/experiments/<name>.json
   index.ts             — public API re-exports
 
 sdk/python/
@@ -58,10 +59,20 @@ sdk/python/
 # Top-level commands
 flight serve [--port 4242] [--log-dir]   # HTTP collector
 flight proxy --cmd <server> -- <args>     # MCP stdio proxy
+flight run --agent <agent> [--experiment <id>] [--model <name>]  # Start a run
+flight show <session-id>                  # View a recorded session
+flight logs                               # List sessions (same as flight logs list)
 
 # Log commands
-flight log list|tail|view|filter|inspect|alerts|summary|tools|audit|verbose
-flight log stats|export|replay|gc|prune|query
+flight logs list|tail|view|filter|inspect|alerts|summary|tools|audit|verbose
+flight logs stats|export|replay|gc|prune|query
+
+# Experiment registry
+flight experiment new <name>         # Register/update experiment
+flight experiment list               # Table with run counts
+flight experiment show <name>        # Metadata + runs
+flight experiment diff <a> <b>       # Cross-experiment comparison
+flight experiment export <name>      # Research JSONL to stdout
 
 # Claude Code integration
 flight claude setup                       # Interactive wizard
@@ -72,7 +83,7 @@ flight claude init desktop|code           # MCP server wrapping
 flight hook session-start|session-end|post-tool-use
 ```
 
-Old command paths (`flight setup`, `flight hooks`, `flight init`, `flight stats`, `flight export`, `flight replay`) are deprecated aliases that print a warning and delegate.
+Note: `flight log` (singular) was renamed to `flight logs` (plural) in 1.5.0 — update any scripts accordingly. There is no deprecation shim.
 
 ## Log Schema
 
@@ -100,6 +111,7 @@ Installed in `~/.claude/commands/` by `flight claude setup`:
 - **`/flight-annotate`** — labels each turn and emits `flight annotate` shell commands to persist labels (runs `flight logs verbose`)
 
 ### Data Locations
+- `~/.flight/experiments/<name>.json` — experiment registry (one JSON file per experiment)
 - `~/.flight/logs/session_*.jsonl` — session recordings
 - `~/.flight/logs/<session>_tools.jsonl` — tool call metadata from hooks
 - `~/.flight/alerts.jsonl` — hallucination hints, loops, errors
@@ -128,6 +140,7 @@ Python SDK tests: `cd sdk/python && python3 -m pytest tests/ -v`
 - **Progressive disclosure** — Phase 1 (observation), Phase 2 (schema compression), Phase 3 (compression + filtering).
 - **SQLite query layer** — `FlightDB` indexes JSONL files into SQLite for cross-session queries, aggregation by tool, and daily trends.
 - **Alert detection** — Error-recovery anomalies (different tool called after error), loop detection (same tool 5x in 60s).
+- **Experiment registry** — `src/experiments.ts` stores one JSON file per experiment at `~/.flight/experiments/<name>.json`. Race-safe creation via O_EXCL (`flag: "wx"`); `createOrUpdateExperiment` merges patches (arrays replace). `flight run --experiment` auto-registers and prints a hint on first use.
 
 ## Testing
 
 
@@ -20,7 +20,7 @@ All session logs are stored locally at:
 ~/.flight/logs/<session_id>.jsonl
 ```
 
-Each session produces one append-only JSONL file. You can list sessions with `flight log list` and inspect them with `flight log view <session>` or `flight log inspect <call-id>`.
+Each session produces one append-only JSONL file. You can list sessions with `flight logs list` and inspect them with `flight logs view <session>` or `flight logs inspect <call-id>`.
 
 ## How do I set it up with Claude Desktop?
 
@@ -52,8 +52,8 @@ Progressive Disclosure is a token optimization feature. Instead of sending full
 Use the export command to extract session data in CSV or JSONL format:
 
 ```bash
-flight log export --format csv --session <session_id> > output.csv
-flight log export --format jsonl --session <session_id> > output.jsonl
+flight logs export --format csv --session <session_id> > output.csv
+flight logs export --format jsonl --session <session_id> > output.jsonl
 ```
 
 You can also work with the raw JSONL files directly using `jq`, Python, or any tool that reads newline-delimited JSON.
@@ -67,7 +67,7 @@ Yes. All data stays on your local machine. Flight never sends data to any extern
 Flight includes a heuristic hallucination hint detector. It flags cases where the client proceeds after a server error without retrying -- a pattern that often indicates the agent is operating on assumptions rather than real data. View flagged entries with:
 
 ```bash
-flight log filter --hallucinations
+flight logs filter --hallucinations
 ```
 
 These hints are investigative leads, not definitive verdicts. They tell you where to look, not what happened.
@@ -191,36 +191,48 @@ flight serve [--port 4242] [--log-dir ~/.flight/logs]
 flight proxy --cmd <server> -- <args>
 flight proxy --cmd <server> --pd           # With progressive disclosure
 
+# Happy-path commands
+flight run --agent <agent> [--experiment <id>] [--model <name>]  # Start a run
+flight show <session-id>            # View a recorded session
+flight logs                         # List all sessions (same as flight logs list)
+
 # Session lifecycle + annotation
 flight session start --agent <agent> [--run <run-id>]
 flight session end [--session <id>] [--status completed|failed]
 flight annotate <target-id> --type run|session|turn|tool_call --label <label>
 
 # Log inspection and analysis
-flight log list                     # List all sessions
-flight log tail [--session <id>]    # Live stream a session
-flight log view <session>           # Full timeline with summary
-flight log filter --tool <name>     # Filter by tool name
-flight log filter --errors          # Show only failed calls
-flight log filter --anomalies       # Show error-recovery anomalies
-flight log inspect <call-id>        # Full request/response payload
-flight log alerts                   # Anomaly/loop/error alerts
-flight log summary [--session <id>] # Session summary statistics
-flight log tools                    # Tool call frequency breakdown
-flight log compare --run-id <id>    # Compare sessions/models within a run
-flight log stats [session]          # Usage statistics across sessions
-flight log export [session] --format research|raw|csv|jsonl
-flight log replay <call-id> --cmd <server> -- <args>
-flight log gc                       # Compress old sessions, collect garbage
-flight log prune --before <date>    # Delete sessions before a date
-flight log prune --keep <n>         # Keep only N most recent sessions
+flight logs list                     # List all sessions
+flight logs tail [--session <id>]    # Live stream a session
+flight logs view <session>           # Full timeline with summary
+flight logs filter --tool <name>     # Filter by tool name
+flight logs filter --errors          # Show only failed calls
+flight logs filter --anomalies       # Show error-recovery anomalies
+flight logs inspect <call-id>        # Full request/response payload
+flight logs alerts                   # Anomaly/loop/error alerts
+flight logs summary [--session <id>] # Session summary statistics
+flight logs tools                    # Tool call frequency breakdown
+flight logs compare --run-id <id>    # Compare sessions/models within a run
+flight logs stats [session]          # Usage statistics across sessions
+flight logs export [session] --format research|raw|csv|jsonl
+flight logs replay <call-id> --cmd <server> -- <args>
+flight logs gc                       # Compress old sessions, collect garbage
+flight logs prune --before <date>    # Delete sessions before a date
+flight logs prune --keep <n>         # Keep only N most recent sessions
 
 # Cross-session queries (SQLite-backed)
-flight log query --aggregate        # Error rates + latency percentiles by tool
-flight log query --trend            # Daily trend (totals, errors, anomalies)
-flight log query --tool <name>      # Filter by tool name
-flight log query --anomalies        # Show only error-recovery anomalies
-flight log query --after <date>     # Filter by time range
+flight logs query --aggregate        # Error rates + latency percentiles by tool
+flight logs query --trend            # Daily trend (totals, errors, anomalies)
+flight logs query --tool <name>      # Filter by tool name
+flight logs query --anomalies        # Show only error-recovery anomalies
+flight logs query --after <date>     # Filter by time range
+
+# Experiment registry
+flight experiment new <name> [--description <desc>] [--tags <csv>] [--baseline <run-id>] [--model <name>] [--notes <text>]
+flight experiment list               # Table of all experiments + run counts
+flight experiment show <name>        # Metadata + recent runs for an experiment
+flight experiment diff <name1> <name2>  # Compare runs across two experiments
+flight experiment export <name>      # Stream all runs as research JSONL to stdout
 
 # Claude Code integration
 flight claude setup                 # Interactive setup wizard
@@ -263,6 +275,51 @@ flight hook session-start|session-end|user-prompt-submit|post-tool-use
 
 ---
 
+## Experiment Registry
+
+The experiment registry provides a lightweight, file-per-experiment store at `~/.flight/experiments/<name>.json`. It lets you group and compare runs across multiple sessions.
+
+### Schema
+
+```json
+{
+  "name": "bench-a",
+  "created_at": "2026-04-17T12:00:00.000Z",
+  "description": "Baseline throughput test",
+  "tags": ["fast", "cheap"],
+  "baseline_run_id": "run_1713355200_abcd1234",
+  "model_config": { "model": "claude-sonnet-4-20250514" },
+  "notes": "Compare against bench-b with streaming enabled"
+}
+```
+
+### Workflow
+
+```bash
+# Register an experiment with metadata
+flight experiment new bench-a --description "Baseline" --tags fast,cheap --model claude-sonnet-4
+
+# Start runs that belong to this experiment
+flight run --agent my-agent --experiment bench-a
+flight run --agent my-agent --experiment bench-b
+
+# List all experiments with run counts
+flight experiment list
+
+# Inspect a specific experiment and its runs
+flight experiment show bench-a
+
+# Compare two experiments head-to-head
+flight experiment diff bench-a bench-b
+
+# Export all runs as research JSONL (for offline analysis)
+flight experiment export bench-a | jq .
+```
+
+Unknown experiments are **auto-registered** on first `flight run --experiment <name>`, with a one-line stderr hint pointing to `flight experiment new` for adding metadata. The registry files are plain JSON and fully human-editable.
+
+---
+
 ## Performance
 
 - **<5ms** added latency per tool call (streaming NDJSON, fire-and-forget log writes)
@@ -278,7 +335,7 @@ flight hook session-start|session-end|user-prompt-submit|post-tool-use
 - **One file per session**, append-only
 - **Auto-compression:** sessions older than 24h are gzip-compressed (`.jsonl.gz`)
 - **Garbage collection:** configurable max sessions (100) and max size (2 GB)
-- **Pruning:** `flight log prune --before <date>` or `--keep <n>`
+- **Pruning:** `flight logs prune --before <date>` or `--keep <n>`
 
 ---