Skip to content

Commit 3ab0d9c

Browse files
Copilotlewisnsmith
andauthored
chore: merge origin/main and resolve docs conflicts
Co-authored-by: lewisnsmith <247513455+lewisnsmith@users.noreply.github.com>
2 parents 322d48b + 4807161 commit 3ab0d9c

14 files changed

Lines changed: 1128 additions & 80 deletions

.claude/commands/flight-log.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Run `flight log audit` to display a full audit of all tool calls from the current session.
1+
Run `flight logs audit` to display a full audit of all tool calls from the current session.
22

33
Read the output carefully. Present a concise summary to the user:
44

@@ -9,4 +9,4 @@ Read the output carefully. Present a concise summary to the user:
99

1010
If there are errors or suspicious patterns, offer to investigate the specific tool calls or help fix the underlying issues.
1111

12-
If the user asks about a specific tool call, you can run `flight log tools` with `--tool <name>` to filter, or read the session's `_tools.jsonl` file directly from `~/.flight/logs/` for full details.
12+
If the user asks about a specific tool call, you can run `flight logs tools` with `--tool <name>` to filter, or read the session's `_tools.jsonl` file directly from `~/.flight/logs/` for full details.

ARCHITECTURE.md

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -386,7 +386,7 @@ every subsequent call.
386386

387387
**Trigger:** Any `server->client` response containing an `error` field.
388388
Every tool error is recorded as an alert for cross-session querying
389-
via `flight log alerts`.
389+
via `flight logs alerts`.
390390

391391
### Auto-Retry (Transparent)
392392

@@ -408,7 +408,7 @@ Active session
408408
|
409409
| (after 24h default, or --compress-after)
410410
v
411-
flight log gc --compress-after 24
411+
flight logs gc --compress-after 24
412412
|
413413
v
414414
session_*.jsonl.gz <-- gzip compressed, original deleted
@@ -447,7 +447,43 @@ Oldest sessions deleted (FIFO)
447447

448448
---
449449

450-
## 10. Key Design Decisions
450+
## 10. Experiment Registry
451+
452+
The experiment registry (`src/experiments.ts`) provides a lightweight, file-per-experiment store at `~/.flight/experiments/<name>.json`. Each file is a JSON object conforming to `ExperimentEntry`:
453+
454+
```ts
455+
type ExperimentEntry = {
456+
name: string;
457+
created_at: string;
458+
description?: string;
459+
tags: string[];
460+
baseline_run_id?: string;
461+
model_config?: Record<string, unknown>;
462+
notes?: string;
463+
}
464+
```
465+
466+
### Key properties
467+
468+
- **Race-safe creation** — `ensureExperimentRegistered` writes with `{ flag: "wx" }` (O_EXCL). Concurrent callers are safe; exactly one wins `created: true`.
469+
- **Idempotent** — calling `ensureExperimentRegistered` a second time returns `{ created: false }` without rewriting the file.
470+
- **Merge semantics** — `createOrUpdateExperiment` merges patch fields (arrays replace, not append) while preserving `created_at`.
471+
- **Graceful reads** — `listExperiments` skips files with invalid JSON or wrong shape, logging a warning per skipped file.
472+
- **Auto-creation** — `flight run --experiment <name>` calls `ensureExperimentRegistered` before `runSessionStart`; on `{ created: true }` it prints a one-line stderr hint.
473+
474+
### CLI verbs
475+
476+
| Command | Description |
477+
|---------|-------------|
478+
| `flight experiment new <name>` | Register or update an experiment (description, tags, baseline, model, notes) |
479+
| `flight experiment list` | Table of all experiments with run counts from SQLite |
480+
| `flight experiment show <name>` | Metadata + recent runs |
481+
| `flight experiment diff <a> <b>` | Delegates to `compareCommand` with sessions from both experiments |
482+
| `flight experiment export <name>` | Streams research JSONL per run to stdout (unbuffered) |
483+
484+
---
485+
486+
## 11. Key Design Decisions
451487
452488
### Why STDIO (not HTTP)
453489

CHANGELOG.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,18 @@
77
- `/flight-compare` slash command: 3-bullet experiment diff (winner, biggest delta, suggested next test) via `flight experiment diff`.
88
- `/flight-annotate` slash command: per-turn labelling with strict one-command-per-turn output for persisting annotations via `flight annotate`.
99

10+
## 1.5.0
11+
12+
### Breaking
13+
- **`flight log` renamed to `flight logs`** (plural). All subcommands are unchanged. There is no deprecation shim — update any scripts that call `flight log <subcommand>`. Re-run `flight claude setup` to update installed slash commands.
14+
15+
### Added
16+
- `flight run --agent <agent> [--experiment <id>] [--model <name>]` — start a run with a human-friendly output (`Started run <runId> session <sessionId>`). Mirrors `session start` options.
17+
- `flight show <session-id>` — view a recorded session (alias for `flight logs view`).
18+
- `flight logs` bare invocation (no subcommand) now behaves like `flight logs list`.
19+
- `flight experiment {new,list,show,diff,export}` — experiment registry and cross-run analysis commands backed by `~/.flight/experiments/<name>.json`.
20+
- Auto-registration: `flight run --experiment <name>` creates the experiment registry file on first use and prints a one-line stderr hint to add description/tags.
21+
1022
## 1.4.0
1123

1224
### Removed
@@ -27,7 +39,7 @@
2739
## 1.2.0
2840

2941
### Added
30-
- `flight log audit` — rich audit view of tool calls for the current session (powers `/flight-log` slash command)
42+
- `flight logs audit` — rich audit view of tool calls for the current session (powers `/flight-log` slash command)
3143
- `/flight-log` slash command installed by `flight setup`
3244
- Active session marker (`~/.flight/logs/.active_session`) for hook-aware session resolution
3345
- `mergeSessionUsage()` exported for programmatic usage tracking in progressive disclosure
@@ -62,8 +74,8 @@
6274
- `flight proxy --cmd <command>` — start the proxy
6375
- `flight init claude` / `flight init claude-code` — auto-configure MCP clients
6476
- `flight setup` / `flight setup --remove` — zero-config setup with Claude Code hooks
65-
- `flight log list|view|tail|filter|inspect|alerts|summary` — log inspection
66-
- `flight log gc|prune` — log lifecycle management
77+
- `flight logs list|view|tail|filter|inspect|alerts|summary` — log inspection
78+
- `flight logs gc|prune` — log lifecycle management
6779
- `flight export <session> --format csv|jsonl` — research export
6880
- `flight replay <call-id> --cmd <server>` — call replay
6981
- `flight stats [session]` — token metrics and tool breakdown

CLAUDE.md

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Python SDK: stdlib only (no external deps), Python 3.9+
1919

2020
```
2121
src/
22-
cli.ts — Commander CLI (subcommands: serve, proxy, log, claude, hook)
22+
cli.ts — Commander CLI (subcommands: serve, proxy, run, show, logs, claude, hook)
2323
proxy.ts — stdio proxy: spawn upstream, bidirectional JSON-RPC forwarding
2424
json-rpc.ts — streaming JSON-RPC parser (readline + JSON.parse per line)
2525
logger.ts — session logger with async write queue and alert detection
@@ -40,6 +40,7 @@ src/
4040
export.ts — CSV/JSONL export
4141
replay.ts — tool call replay from logs
4242
log-commands.ts — CLI subcommands for log inspection (list, tail, view, filter, inspect, audit, verbose)
43+
experiments.ts — Experiment registry: per-file JSON store at ~/.flight/experiments/<name>.json
4344
index.ts — public API re-exports
4445
4546
sdk/python/
@@ -58,10 +59,20 @@ sdk/python/
5859
# Top-level commands
5960
flight serve [--port 4242] [--log-dir] # HTTP collector
6061
flight proxy --cmd <server> -- <args> # MCP stdio proxy
62+
flight run --agent <agent> [--experiment <id>] [--model <name>] # Start a run
63+
flight show <session-id> # View a recorded session
64+
flight logs # List sessions (same as flight logs list)
6165

6266
# Log commands
63-
flight log list|tail|view|filter|inspect|alerts|summary|tools|audit|verbose
64-
flight log stats|export|replay|gc|prune|query
67+
flight logs list|tail|view|filter|inspect|alerts|summary|tools|audit|verbose
68+
flight logs stats|export|replay|gc|prune|query
69+
70+
# Experiment registry
71+
flight experiment new <name> # Register/update experiment
72+
flight experiment list # Table with run counts
73+
flight experiment show <name> # Metadata + runs
74+
flight experiment diff <a> <b> # Cross-experiment comparison
75+
flight experiment export <name> # Research JSONL to stdout
6576

6677
# Claude Code integration
6778
flight claude setup # Interactive wizard
@@ -72,7 +83,7 @@ flight claude init desktop|code # MCP server wrapping
7283
flight hook session-start|session-end|post-tool-use
7384
```
7485

75-
Old command paths (`flight setup`, `flight hooks`, `flight init`, `flight stats`, `flight export`, `flight replay`) are deprecated aliases that print a warning and delegate.
86+
Note: `flight log` (singular) was renamed to `flight logs` (plural) in 1.5.0 — update any scripts accordingly. There is no deprecation shim.
7687

7788
## Log Schema
7889

@@ -100,6 +111,7 @@ Installed in `~/.claude/commands/` by `flight claude setup`:
100111
- **`/flight-annotate`** — labels each turn and emits `flight annotate` shell commands to persist labels (runs `flight logs verbose`)
101112

102113
### Data Locations
114+
- `~/.flight/experiments/<name>.json` — experiment registry (one JSON file per experiment)
103115
- `~/.flight/logs/session_*.jsonl` — session recordings
104116
- `~/.flight/logs/<session>_tools.jsonl` — tool call metadata from hooks
105117
- `~/.flight/alerts.jsonl` — hallucination hints, loops, errors
@@ -128,6 +140,7 @@ Python SDK tests: `cd sdk/python && python3 -m pytest tests/ -v`
128140
- **Progressive disclosure** — Phase 1 (observation), Phase 2 (schema compression), Phase 3 (compression + filtering).
129141
- **SQLite query layer**`FlightDB` indexes JSONL files into SQLite for cross-session queries, aggregation by tool, and daily trends.
130142
- **Alert detection** — Error-recovery anomalies (different tool called after error), loop detection (same tool 5x in 60s).
143+
- **Experiment registry**`src/experiments.ts` stores one JSON file per experiment at `~/.flight/experiments/<name>.json`. Race-safe creation via O_EXCL (`flag: "wx"`); `createOrUpdateExperiment` merges patches (arrays replace). `flight run --experiment` auto-registers and prints a hint on first use.
131144

132145
## Testing
133146

FAQ.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ All session logs are stored locally at:
2020
~/.flight/logs/<session_id>.jsonl
2121
```
2222

23-
Each session produces one append-only JSONL file. You can list sessions with `flight log list` and inspect them with `flight log view <session>` or `flight log inspect <call-id>`.
23+
Each session produces one append-only JSONL file. You can list sessions with `flight logs list` and inspect them with `flight logs view <session>` or `flight logs inspect <call-id>`.
2424

2525
## How do I set it up with Claude Desktop?
2626

@@ -52,8 +52,8 @@ Progressive Disclosure is a token optimization feature. Instead of sending full
5252
Use the export command to extract session data in CSV or JSONL format:
5353

5454
```bash
55-
flight log export --format csv --session <session_id> > output.csv
56-
flight log export --format jsonl --session <session_id> > output.jsonl
55+
flight logs export --format csv --session <session_id> > output.csv
56+
flight logs export --format jsonl --session <session_id> > output.jsonl
5757
```
5858

5959
You can also work with the raw JSONL files directly using `jq`, Python, or any tool that reads newline-delimited JSON.
@@ -67,7 +67,7 @@ Yes. All data stays on your local machine. Flight never sends data to any extern
6767
Flight includes a heuristic hallucination hint detector. It flags cases where the client proceeds after a server error without retrying -- a pattern that often indicates the agent is operating on assumptions rather than real data. View flagged entries with:
6868

6969
```bash
70-
flight log filter --hallucinations
70+
flight logs filter --hallucinations
7171
```
7272

7373
These hints are investigative leads, not definitive verdicts. They tell you where to look, not what happened.

README.md

Lines changed: 80 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -191,36 +191,48 @@ flight serve [--port 4242] [--log-dir ~/.flight/logs]
191191
flight proxy --cmd <server> -- <args>
192192
flight proxy --cmd <server> --pd # With progressive disclosure
193193

194+
# Happy-path commands
195+
flight run --agent <agent> [--experiment <id>] [--model <name>] # Start a run
196+
flight show <session-id> # View a recorded session
197+
flight logs # List all sessions (same as flight logs list)
198+
194199
# Session lifecycle + annotation
195200
flight session start --agent <agent> [--run <run-id>]
196201
flight session end [--session <id>] [--status completed|failed]
197202
flight annotate <target-id> --type run|session|turn|tool_call --label <label>
198203

199204
# Log inspection and analysis
200-
flight log list # List all sessions
201-
flight log tail [--session <id>] # Live stream a session
202-
flight log view <session> # Full timeline with summary
203-
flight log filter --tool <name> # Filter by tool name
204-
flight log filter --errors # Show only failed calls
205-
flight log filter --anomalies # Show error-recovery anomalies
206-
flight log inspect <call-id> # Full request/response payload
207-
flight log alerts # Anomaly/loop/error alerts
208-
flight log summary [--session <id>] # Session summary statistics
209-
flight log tools # Tool call frequency breakdown
210-
flight log compare --run-id <id> # Compare sessions/models within a run
211-
flight log stats [session] # Usage statistics across sessions
212-
flight log export [session] --format research|raw|csv|jsonl
213-
flight log replay <call-id> --cmd <server> -- <args>
214-
flight log gc # Compress old sessions, collect garbage
215-
flight log prune --before <date> # Delete sessions before a date
216-
flight log prune --keep <n> # Keep only N most recent sessions
205+
flight logs list # List all sessions
206+
flight logs tail [--session <id>] # Live stream a session
207+
flight logs view <session> # Full timeline with summary
208+
flight logs filter --tool <name> # Filter by tool name
209+
flight logs filter --errors # Show only failed calls
210+
flight logs filter --anomalies # Show error-recovery anomalies
211+
flight logs inspect <call-id> # Full request/response payload
212+
flight logs alerts # Anomaly/loop/error alerts
213+
flight logs summary [--session <id>] # Session summary statistics
214+
flight logs tools # Tool call frequency breakdown
215+
flight logs compare --run-id <id> # Compare sessions/models within a run
216+
flight logs stats [session] # Usage statistics across sessions
217+
flight logs export [session] --format research|raw|csv|jsonl
218+
flight logs replay <call-id> --cmd <server> -- <args>
219+
flight logs gc # Compress old sessions, collect garbage
220+
flight logs prune --before <date> # Delete sessions before a date
221+
flight logs prune --keep <n> # Keep only N most recent sessions
217222

218223
# Cross-session queries (SQLite-backed)
219-
flight log query --aggregate # Error rates + latency percentiles by tool
220-
flight log query --trend # Daily trend (totals, errors, anomalies)
221-
flight log query --tool <name> # Filter by tool name
222-
flight log query --anomalies # Show only error-recovery anomalies
223-
flight log query --after <date> # Filter by time range
224+
flight logs query --aggregate # Error rates + latency percentiles by tool
225+
flight logs query --trend # Daily trend (totals, errors, anomalies)
226+
flight logs query --tool <name> # Filter by tool name
227+
flight logs query --anomalies # Show only error-recovery anomalies
228+
flight logs query --after <date> # Filter by time range
229+
230+
# Experiment registry
231+
flight experiment new <name> [--description <desc>] [--tags <csv>] [--baseline <run-id>] [--model <name>] [--notes <text>]
232+
flight experiment list # Table of all experiments + run counts
233+
flight experiment show <name> # Metadata + recent runs for an experiment
234+
flight experiment diff <name1> <name2> # Compare runs across two experiments
235+
flight experiment export <name> # Stream all runs as research JSONL to stdout
224236

225237
# Claude Code integration
226238
flight claude setup # Interactive setup wizard
@@ -263,6 +275,51 @@ flight hook session-start|session-end|user-prompt-submit|post-tool-use
263275

264276
---
265277

278+
## Experiment Registry
279+
280+
The experiment registry provides a lightweight, file-per-experiment store at `~/.flight/experiments/<name>.json`. It lets you group and compare runs across multiple sessions.
281+
282+
### Schema
283+
284+
```json
285+
{
286+
"name": "bench-a",
287+
"created_at": "2026-04-17T12:00:00.000Z",
288+
"description": "Baseline throughput test",
289+
"tags": ["fast", "cheap"],
290+
"baseline_run_id": "run_1713355200_abcd1234",
291+
"model_config": { "model": "claude-sonnet-4-20250514" },
292+
"notes": "Compare against bench-b with streaming enabled"
293+
}
294+
```
295+
296+
### Workflow
297+
298+
```bash
299+
# Register an experiment with metadata
300+
flight experiment new bench-a --description "Baseline" --tags fast,cheap --model claude-sonnet-4
301+
302+
# Start runs that belong to this experiment
303+
flight run --agent my-agent --experiment bench-a
304+
flight run --agent my-agent --experiment bench-b
305+
306+
# List all experiments with run counts
307+
flight experiment list
308+
309+
# Inspect a specific experiment and its runs
310+
flight experiment show bench-a
311+
312+
# Compare two experiments head-to-head
313+
flight experiment diff bench-a bench-b
314+
315+
# Export all runs as research JSONL (for offline analysis)
316+
flight experiment export bench-a | jq .
317+
```
318+
319+
Unknown experiments are **auto-registered** on first `flight run --experiment <name>`, with a one-line stderr hint pointing to `flight experiment new` for adding metadata. The registry files are plain JSON and fully human-editable.
320+
321+
---
322+
266323
## Performance
267324

268325
- **<5ms** added latency per tool call (streaming NDJSON, fire-and-forget log writes)
@@ -278,7 +335,7 @@ flight hook session-start|session-end|user-prompt-submit|post-tool-use
278335
- **One file per session**, append-only
279336
- **Auto-compression:** sessions older than 24h are gzip-compressed (`.jsonl.gz`)
280337
- **Garbage collection:** configurable max sessions (100) and max size (2 GB)
281-
- **Pruning:** `flight log prune --before <date>` or `--keep <n>`
338+
- **Pruning:** `flight logs prune --before <date>` or `--keep <n>`
282339

283340
---
284341

0 commit comments

Comments
 (0)