braintrustdata
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 41 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 177 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 177 additions & 0 deletions
diff --git a/‎Makefile‎
Lines changed: 8 additions & 0 deletions b/‎Makefile‎
Lines changed: 8 additions & 0 deletions
@@ -0,0 +1,41 @@
+name: CI
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+  workflow_dispatch:
+
+jobs:
+  trace-claude-code:
+    name: trace-claude-code bash tests
+    runs-on: ubuntu-24.04
+    timeout-minutes: 30
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+
+      # jq, curl, uuid-runtime (uuidgen), python3 are all preinstalled on
+      # ubuntu. Confirm they're available so failures are obvious.
+      - name: Verify test dependencies
+        run: |
+          set -e
+          echo "bash:    $(bash --version | head -1)"
+          echo "jq:      $(jq --version)"
+          echo "curl:    $(curl --version | head -1)"
+          echo "uuidgen: $(uuidgen)"
+          echo "python3: $(python3 --version)"
+
+      - name: Run trace-claude-code tests
+        run: make test
+        env:
+          # Force non-color output so logs in GitHub Actions are readable.
+          NO_COLOR: "1"
+
+      - name: Show hook log on failure
+        if: failure()
+        run: |
+          if [ -f "$HOME/.claude/state/braintrust_hook.log" ]; then
+            echo "=== hook log ==="
+            cat "$HOME/.claude/state/braintrust_hook.log"
+          fi
@@ -40,6 +40,183 @@ uv run pre-commit install
 uv run pre-commit run --all-files
 ```
 
+## Testing the `trace-claude-code` plugin
+
+Bash test suite for the hook scripts. Tests run the hooks against a
+stubbed `curl`, capture the resulting HTTP requests, and assert on the
+inferred span tree.
+
+### Running
+
+```sh
+# From the repo root:
+make test
+
+# Or run a specific test file:
+bash plugins/trace-claude-code/test/run_tests.sh test_e2e
+bash plugins/trace-claude-code/test/run_tests.sh test_replay test_queue
+```
+
+### Layout
+
+```
+plugins/trace-claude-code/test/
+├── helpers/
+│   ├── assert.sh        # describe / it / assert_eq / assert_contains, color output
+│   ├── harness.sh       # setup_test_env, teardown_test_env, run_hook
+│   ├── curl_stub.sh     # curl() shell function that captures requests + returns canned responses
+│   ├── fixtures.sh      # builders for hook input JSON (fixture_session_start, etc.)
+│   ├── span_tree.sh     # all_spans, span_count_by_type, span_by_name, children_of, ...
+│   └── replay.sh        # replay_session, describe_fixture
+├── fixtures/
+│   └── sessions/        # captured Claude sessions used by test_replay.sh
+├── test_*.sh            # one file per area
+├── record_session.sh    # CLI to prep a fixture directory for capturing
+└── run_tests.sh         # entry point
+```
+
+### Writing a test
+
+Each `test_*.sh` follows this pattern:
+
+```bash
+#!/bin/bash
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+source "$SCRIPT_DIR/helpers/assert.sh"
+source "$SCRIPT_DIR/helpers/harness.sh"
+
+describe "my feature"
+
+t_my_test_body() {
+    # setup_test_env has already created an isolated $HOME and stubbed curl
+    stub_response_for "*/v1/project_logs/*/insert" 200 '{"row_ids":["row_1"]}'
+
+    run_hook session_start.sh "$(fixture_session_start "s1" "/tmp/x")"
+
+    assert_eq "$(span_count_by_type task)" "1"
+}
+
+it "does the thing" t_my_test_body
+```
+
+Key conventions:
+
+- `describe "..."` is a section header (purely visual).
+- `it "name" function_name` runs `function_name` between `setup_test_env`
+  and `teardown_test_env`, then prints a ✓ or ✗.
+- Assertions (`assert_eq`, `assert_contains`, `assert_failure`, ...) record
+  failures into the current test but do **not** abort. Multiple assertions
+  per test are fine.
+- Hooks are run synchronously in tests via `BRAINTRUST_SYNC_QUEUE=true`
+  set by `setup_test_env`. Span queue tests opt out of this when needed.
+
+### Capturing a real session as a test fixture
+
+The hooks support recording every invocation to disk when the env var
+`BRAINTRUST_RECORD_DIR` is set. The recorded data can then be replayed
+in a test.
+
+#### 1. Prepare a fixture directory
+
+```sh
+plugins/trace-claude-code/test/record_session.sh my-fixture
+```
+
+This prints a `BRAINTRUST_RECORD_DIR` value pointing at
+`test/fixtures/sessions/my-fixture/`.
+
+#### 2. Run Claude Code with recording on
+
+```sh
+export BRAINTRUST_RECORD_DIR=/abs/path/to/test/fixtures/sessions/my-fixture
+claude
+# ... use Claude Code normally ...
+```
+
+While `BRAINTRUST_RECORD_DIR` is set:
+
+- Every hook invocation appends one NDJSON record to
+  `events.ndjson` containing `{ts, hook, payload}`.
+- The `stop_hook` also copies the referenced transcript file into
+  `transcripts/<session_id>.jsonl`.
+
+You do not need to modify hook scripts or set anything else - the recorder
+runs inside the existing hooks.
+
+#### 3. Inspect the fixture
+
+```sh
+plugins/trace-claude-code/test/record_session.sh --describe my-fixture
+```
+
+Output:
+
+```
+Fixture: .../test/fixtures/sessions/my-fixture
+  Events: 14
+  Hook counts:
+    post_tool_use: 8
+    session_end: 1
+    session_start: 1
+    stop_hook: 3
+    user_prompt_submit: 1
+  Transcripts: 1
+```
+
+#### 4. Replay it in a test
+
+```bash
+t_replay_my_fixture() {
+    stub_response_for "*/v1/project_logs/*/insert" 200 '{"row_ids":["row_1"]}'
+
+    local n
+    n=$(replay_session "$SCRIPT_DIR/fixtures/sessions/my-fixture")
+    assert_success "$?"
+    assert_eq "$n" "14"
+
+    # Now assert on the span tree the hooks produced
+    assert_eq "$(span_count_by_type tool)" "8"
+    assert_eq "$(span_count_by_type llm)"  "3"
+}
+
+it "my real-world fixture produces the expected spans" t_replay_my_fixture
+```
+
+The replayer:
+
+- Reads `events.ndjson` line by line in order.
+- For `stop_hook` events, rewrites `payload.transcript_path` to point at
+  the bundled transcript so the replayed hook can read it.
+- Invokes the matching hook script via `run_hook` with the recorded
+  payload.
+
+#### When to use replay vs. synthetic fixtures
+
+- **Synthetic fixtures** (`fixture_session_start`, etc.) - fast to write,
+  test specific scenarios in isolation, no real Claude needed.
+- **Replayed fixtures** - high-fidelity regression tests of real-world
+  interactions. Use when you want to lock in behavior on a specific
+  pattern of hooks you saw in the wild (e.g. a session with parallel
+  tool calls, or a long multi-turn conversation).
+
+### Span-tree queries
+
+The captured HTTP requests are parsed to extract the inserted spans. Available helpers:
+
+| Function | Returns |
+|---|---|
+| `all_spans` | JSON array of every span sent to any `/insert` endpoint |
+| `span_count` | total number of spans |
+| `span_count_by_type "tool"` | count of spans with `span_attributes.type == "tool"` |
+| `spans_named "^Turn "` | array of spans whose name matches the regex |
+| `span_by_name "^Turn 1$"` | first matching span (or `null`) |
+| `span_by_type "llm"` | first span of that type |
+| `span_by_id "..."` | span with the given `span_id` |
+| `children_of "<span_id>"` | array of spans whose first parent is the given id |
+| `is_child_of "<child_id>" "<parent_id>"` | exit 0 if true |
+
+All return JSON on stdout; combine with `jq` for further drilling.
+
 # Updating the plugin
 
 After making changes:
 
@@ -0,0 +1,8 @@
+.PHONY: test test-trace-claude-code
+
+# Run all plugin tests
+test: test-trace-claude-code
+
+# Run trace-claude-code plugin tests
+test-trace-claude-code:
+	@bash plugins/trace-claude-code/test/run_tests.sh