Skip to content

Commit cc8dd55

Browse files
authored
fix async bugs and add tests (#25)
- redesign plugin to fix bugs introduced by using async hooks (use work queue plus bg span uploader) - add tests - add session replay tool w/ tests - fix misc bugs found from adding tests this is a similar architecture and test approach to what we're doing in the opencode plugin: https://github.com/braintrustdata/braintrust-opencode-plugin
1 parent a42b147 commit cc8dd55

32 files changed

Lines changed: 4375 additions & 72 deletions

.github/workflows/ci.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [main]
6+
pull_request:
7+
workflow_dispatch:
8+
9+
jobs:
10+
trace-claude-code:
11+
name: trace-claude-code bash tests
12+
runs-on: ubuntu-24.04
13+
timeout-minutes: 30
14+
steps:
15+
- name: Checkout repository
16+
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
17+
18+
# jq, curl, uuid-runtime (uuidgen), python3 are all preinstalled on
19+
# ubuntu. Confirm they're available so failures are obvious.
20+
- name: Verify test dependencies
21+
run: |
22+
set -e
23+
echo "bash: $(bash --version | head -1)"
24+
echo "jq: $(jq --version)"
25+
echo "curl: $(curl --version | head -1)"
26+
echo "uuidgen: $(uuidgen)"
27+
echo "python3: $(python3 --version)"
28+
29+
- name: Run trace-claude-code tests
30+
run: make test
31+
env:
32+
# Force non-color output so logs in GitHub Actions are readable.
33+
NO_COLOR: "1"
34+
35+
- name: Show hook log on failure
36+
if: failure()
37+
run: |
38+
if [ -f "$HOME/.claude/state/braintrust_hook.log" ]; then
39+
echo "=== hook log ==="
40+
cat "$HOME/.claude/state/braintrust_hook.log"
41+
fi

CONTRIBUTING.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,183 @@ uv run pre-commit install
4040
uv run pre-commit run --all-files
4141
```
4242

43+
## Testing the `trace-claude-code` plugin
44+
45+
Bash test suite for the hook scripts. Tests run the hooks against a
46+
stubbed `curl`, capture the resulting HTTP requests, and assert on the
47+
inferred span tree.
48+
49+
### Running
50+
51+
```sh
52+
# From the repo root:
53+
make test
54+
55+
# Or run a specific test file:
56+
bash plugins/trace-claude-code/test/run_tests.sh test_e2e
57+
bash plugins/trace-claude-code/test/run_tests.sh test_replay test_queue
58+
```
59+
60+
### Layout
61+
62+
```
63+
plugins/trace-claude-code/test/
64+
├── helpers/
65+
│ ├── assert.sh # describe / it / assert_eq / assert_contains, color output
66+
│ ├── harness.sh # setup_test_env, teardown_test_env, run_hook
67+
│ ├── curl_stub.sh # curl() shell function that captures requests + returns canned responses
68+
│ ├── fixtures.sh # builders for hook input JSON (fixture_session_start, etc.)
69+
│ ├── span_tree.sh # all_spans, span_count_by_type, span_by_name, children_of, ...
70+
│ └── replay.sh # replay_session, describe_fixture
71+
├── fixtures/
72+
│ └── sessions/ # captured Claude sessions used by test_replay.sh
73+
├── test_*.sh # one file per area
74+
├── record_session.sh # CLI to prep a fixture directory for capturing
75+
└── run_tests.sh # entry point
76+
```
77+
78+
### Writing a test
79+
80+
Each `test_*.sh` follows this pattern:
81+
82+
```bash
83+
#!/bin/bash
84+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
85+
source "$SCRIPT_DIR/helpers/assert.sh"
86+
source "$SCRIPT_DIR/helpers/harness.sh"
87+
88+
describe "my feature"
89+
90+
t_my_test_body() {
91+
# setup_test_env has already created an isolated $HOME and stubbed curl
92+
stub_response_for "*/v1/project_logs/*/insert" 200 '{"row_ids":["row_1"]}'
93+
94+
run_hook session_start.sh "$(fixture_session_start "s1" "/tmp/x")"
95+
96+
assert_eq "$(span_count_by_type task)" "1"
97+
}
98+
99+
it "does the thing" t_my_test_body
100+
```
101+
102+
Key conventions:
103+
104+
- `describe "..."` is a section header (purely visual).
105+
- `it "name" function_name` runs `function_name` between `setup_test_env`
106+
and `teardown_test_env`, then prints a ✓ or ✗.
107+
- Assertions (`assert_eq`, `assert_contains`, `assert_failure`, ...) record
108+
failures into the current test but do **not** abort. Multiple assertions
109+
per test are fine.
110+
- Hooks are run synchronously in tests via `BRAINTRUST_SYNC_QUEUE=true`
111+
set by `setup_test_env`. Span queue tests opt out of this when needed.
112+
113+
### Capturing a real session as a test fixture
114+
115+
The hooks support recording every invocation to disk when the env var
116+
`BRAINTRUST_RECORD_DIR` is set. The recorded data can then be replayed
117+
in a test.
118+
119+
#### 1. Prepare a fixture directory
120+
121+
```sh
122+
plugins/trace-claude-code/test/record_session.sh my-fixture
123+
```
124+
125+
This prints a `BRAINTRUST_RECORD_DIR` value pointing at
126+
`test/fixtures/sessions/my-fixture/`.
127+
128+
#### 2. Run Claude Code with recording on
129+
130+
```sh
131+
export BRAINTRUST_RECORD_DIR=/abs/path/to/test/fixtures/sessions/my-fixture
132+
claude
133+
# ... use Claude Code normally ...
134+
```
135+
136+
While `BRAINTRUST_RECORD_DIR` is set:
137+
138+
- Every hook invocation appends one NDJSON record to
139+
`events.ndjson` containing `{ts, hook, payload}`.
140+
- The `stop_hook` also copies the referenced transcript file into
141+
`transcripts/<session_id>.jsonl`.
142+
143+
You do not need to modify hook scripts or set anything else - the recorder
144+
runs inside the existing hooks.
145+
146+
#### 3. Inspect the fixture
147+
148+
```sh
149+
plugins/trace-claude-code/test/record_session.sh --describe my-fixture
150+
```
151+
152+
Output:
153+
154+
```
155+
Fixture: .../test/fixtures/sessions/my-fixture
156+
Events: 14
157+
Hook counts:
158+
post_tool_use: 8
159+
session_end: 1
160+
session_start: 1
161+
stop_hook: 3
162+
user_prompt_submit: 1
163+
Transcripts: 1
164+
```
165+
166+
#### 4. Replay it in a test
167+
168+
```bash
169+
t_replay_my_fixture() {
170+
stub_response_for "*/v1/project_logs/*/insert" 200 '{"row_ids":["row_1"]}'
171+
172+
local n
173+
n=$(replay_session "$SCRIPT_DIR/fixtures/sessions/my-fixture")
174+
assert_success "$?"
175+
assert_eq "$n" "14"
176+
177+
# Now assert on the span tree the hooks produced
178+
assert_eq "$(span_count_by_type tool)" "8"
179+
assert_eq "$(span_count_by_type llm)" "3"
180+
}
181+
182+
it "my real-world fixture produces the expected spans" t_replay_my_fixture
183+
```
184+
185+
The replayer:
186+
187+
- Reads `events.ndjson` line by line in order.
188+
- For `stop_hook` events, rewrites `payload.transcript_path` to point at
189+
the bundled transcript so the replayed hook can read it.
190+
- Invokes the matching hook script via `run_hook` with the recorded
191+
payload.
192+
193+
#### When to use replay vs. synthetic fixtures
194+
195+
- **Synthetic fixtures** (`fixture_session_start`, etc.) - fast to write,
196+
test specific scenarios in isolation, no real Claude needed.
197+
- **Replayed fixtures** - high-fidelity regression tests of real-world
198+
interactions. Use when you want to lock in behavior on a specific
199+
pattern of hooks you saw in the wild (e.g. a session with parallel
200+
tool calls, or a long multi-turn conversation).
201+
202+
### Span-tree queries
203+
204+
The captured HTTP requests are parsed to extract the inserted spans. Available helpers:
205+
206+
| Function | Returns |
207+
|---|---|
208+
| `all_spans` | JSON array of every span sent to any `/insert` endpoint |
209+
| `span_count` | total number of spans |
210+
| `span_count_by_type "tool"` | count of spans with `span_attributes.type == "tool"` |
211+
| `spans_named "^Turn "` | array of spans whose name matches the regex |
212+
| `span_by_name "^Turn 1$"` | first matching span (or `null`) |
213+
| `span_by_type "llm"` | first span of that type |
214+
| `span_by_id "..."` | span with the given `span_id` |
215+
| `children_of "<span_id>"` | array of spans whose first parent is the given id |
216+
| `is_child_of "<child_id>" "<parent_id>"` | exit 0 if true |
217+
218+
All return JSON on stdout; combine with `jq` for further drilling.
219+
43220
# Updating the plugin
44221

45222
After making changes:

Makefile

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
.PHONY: test test-trace-claude-code
2+
3+
# Run all plugin tests
4+
test: test-trace-claude-code
5+
6+
# Run trace-claude-code plugin tests
7+
test-trace-claude-code:
8+
@bash plugins/trace-claude-code/test/run_tests.sh

0 commit comments

Comments
 (0)