Skip to content

Commit 03f4ba2

Browse files
committed
review: aaif-goose/goose#8842 request-changes, aaif-goose/goose#8834 merge-after-nits
1 parent 8a1f4e6 commit 03f4ba2

2 files changed

Lines changed: 269 additions & 0 deletions

File tree

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# block/goose #8834 — Fix Windows dev loop: beforeDevCommand script + Vite IPv4 bind + test_acp_client.py stdin-flush
2+
3+
- **Repo**: block/goose
4+
- **PR**: #8834
5+
- **Author**: seydousakho-star
6+
- **Head SHA**: 7d26c8d048c38afff05299c040c19d0b3fbd7c47
7+
- **Base**: main
8+
- **Size**: +76 / −6 across two files: `test_acp_client.py` (+62/−8)
9+
and `ui/goose2/justfile` (+18/−2 in two recipes).
10+
11+
## What it changes
12+
13+
Three Windows-specific fixes to the inner-loop developer experience,
14+
each documented inline with the *why*:
15+
16+
1. **`test_acp_client.py:11-22`** — UTF-8 console reconfig at startup
17+
so progress glyphs (`✓ ✗ 📝`) don't crash on cp1252 Windows
18+
consoles. Wrapped in try/except for older Pythons.
19+
20+
2. **`test_acp_client.py:25-49, 75-93`**`_resolve_goose_cmd()`
21+
helper picks `$GOOSE_BIN``target/debug/goose[.exe]`
22+
`cargo run -p goose-cli -- acp` in that order. Subprocess
23+
construction switches `stderr=subprocess.PIPE`
24+
`subprocess.DEVNULL` and `bufsize=0``bufsize=1`, with comments
25+
at `:73-83` explaining that an unread `cargo run` stderr fills
26+
Windows' ~4KB pipe buffer and deadlocks the JSON-RPC stdio loop.
27+
28+
3. **`ui/goose2/justfile:99-115` and `:147-166`** — Tauri
29+
`beforeDevCommand` config rewritten:
30+
- Drops `cd ${PROJECT_DIR} && ` prefix (the embedded `&&`
31+
was getting parsed by the bash→npm→tauri argv passthrough
32+
and truncating the JSON value).
33+
- Drops the leading `exec` (cmd.exe doesn't know the bash
34+
builtin).
35+
- Adds `--host 127.0.0.1` (Vite 7 binds IPv6-only on Windows;
36+
WebView2 tries IPv4 first and 404s).
37+
- Changes `cwd` from `.` to `..` (Tauri resolves
38+
`beforeDevCommand` cwd relative to `src-tauri/`, not where
39+
`just` ran).
40+
41+
## Strengths
42+
43+
- **Each fix has a comment explaining the failure mode it addresses.**
44+
This is unusually high-quality for an env-fix PR — the comment at
45+
`justfile:99-114` enumerates all four sub-fixes and *why* each one
46+
matters, so a future maintainer untangling another shell-quoting
47+
issue won't have to re-derive the chain. Same applies to the
48+
stderr/bufsize comment block at `test_acp_client.py:73-83`.
49+
- The `_resolve_goose_cmd()` precedence (`:25-49`) — explicit env
50+
override, then prebuilt binary, then `cargo run` — is the right
51+
ordering for a test that ships in-tree but is sometimes run by
52+
CI and sometimes by an interactive developer. The `RuntimeError`
53+
with actionable advice at `:48` (build with `cargo build -p
54+
goose-cli` or set `GOOSE_BIN`) is good UX.
55+
- `stderr=subprocess.DEVNULL` is the right call for this test:
56+
the test cares only about JSON-RPC framing on stdout, and
57+
routing stderr to DEVNULL avoids the pipe-buffer deadlock without
58+
needing a reader thread. The alternative — spinning up a
59+
background reader — would add complexity for no test value.
60+
- `bufsize=1` (line-buffered) at `:91` is correct for text-mode I/O
61+
on both Windows and POSIX. The previous `bufsize=0` is only
62+
meaningful for binary streams and was actively wrong here.
63+
- `--host 127.0.0.1` is a *less* permissive bind than the default
64+
(which on Linux is `localhost` resolving to both v4/v6, and on
65+
Windows happens to be IPv6-only) — explicit IPv4 binding is both
66+
the cross-platform fix and a small security improvement (no
67+
accidental external exposure if Vite ever changes the default).
68+
69+
## Concerns / asks
70+
71+
- **The `dev` and `dev-debug` recipes have copy-pasted explanatory
72+
comments** but only the `dev` recipe lists all four sub-fixes;
73+
the `dev-debug` block at `:160-163` only covers the `exec` drop
74+
and `--host` add, not the `cwd` change. Either:
75+
1. Verify that `dev-debug` doesn't suffer from the `&&` and
76+
`cwd` issues (and add a one-liner explaining why), or
77+
2. If both recipes have all four issues, list all four in
78+
both comment blocks. Otherwise a future maintainer reading
79+
just `dev-debug` will wonder if the fix is incomplete there.
80+
- The `--host 127.0.0.1` bind change is a *behavior* change for
81+
*all* platforms, not just Windows. On Linux, devs who were
82+
reaching the dev server from another box on the LAN (e.g. for
83+
cross-device testing) will now get connection refused. Worth a
84+
one-liner in the comment block calling that out, or making the
85+
host configurable via env var (e.g. `${GOOSE_DEV_HOST:-127.0.0.1}`)
86+
so the cross-platform default is safe but the LAN-test workflow
87+
still has an escape hatch.
88+
- `sys.stdout.reconfigure(...)` at `test_acp_client.py:18` swallows
89+
`AttributeError, OSError`. The `AttributeError` covers
90+
`reconfigure` not existing on Python ≤ 3.6; the `OSError` covers
91+
detached/closed streams. Worth a one-line warning print to
92+
`sys.stderr` (which is now DEVNULL'd inside the subprocess but
93+
still works at the top of the test script) so a Windows dev who
94+
*does* hit the cp1252 fallback knows why their output looks
95+
garbled.
96+
- `_resolve_goose_cmd()` checks `os.path.isfile(prebuilt)` at
97+
`:43`, but doesn't check that the file is *executable*. On a
98+
fresh checkout where `target/debug/goose.exe` exists but lacks
99+
`+x` (rare on Windows, possible on POSIX after a `git restore`),
100+
the test will fall through to a confusing `Popen` error.
101+
`os.access(prebuilt, os.X_OK)` would catch it cleanly.
102+
103+
## Verdict
104+
105+
**merge-after-nits** — the actual fixes are correct, well-commented,
106+
and address real Windows-only failure modes that block the entire
107+
inner-loop dev experience there. The asks are: parity between the
108+
`dev` and `dev-debug` comment blocks, an env-var escape hatch for
109+
the IPv4 bind, and a couple of minor robustness tweaks in the
110+
Python helper.
111+
112+
## What I learned
113+
114+
The `cargo run` stderr-deadlock pattern is one of the canonical
115+
subprocess pitfalls and it bites *exactly* the test scripts that
116+
spawn cargo because cargo prints compile chatter to stderr by
117+
default. The fix is always the same — read it on a thread, or
118+
DEVNULL it — but it's worth remembering that pipe-buffer sizes
119+
differ across platforms (Windows ~4KB, Linux ~64KB), so a test
120+
that works on Linux can deadlock on Windows just by being slow
121+
to drain stderr.
122+
123+
The Tauri `beforeDevCommand` quoting issue is a different category
124+
of bug: it's a contract leak between four layers (just → bash →
125+
npm → tauri → cmd.exe). Embedding `&&` in a JSON value passed
126+
across all four layers is a recipe for argv splitting at the
127+
wrong point, and the symptom (truncated `--config` JSON) is
128+
nowhere near the cause (`&&` parsed too early). Per-platform
129+
escape rules accumulate and the only durable fix is to not put
130+
shell metacharacters in cross-process argv values at all.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# block/goose #8842 — feat: lifecycle hooks system
2+
3+
- **Repo**: block/goose
4+
- **PR**: #8842
5+
- **Author**: michaelneale (Michael Neale)
6+
- **Head SHA**: 83c0dcfebf305c8843e4263c716aa0475a5ef401
7+
- **Base**: main
8+
- **Size**: +550 / −10 — bulk in new `crates/goose/src/hooks.rs`
9+
(+~444), with 6 emit sites added across `crates/goose/src/agents/agent.rs`
10+
(+~80) and a sprinkle of glue in the agent constructor and reply loop.
11+
12+
## What it changes
13+
14+
Adds a config-driven lifecycle hooks subsystem with 6 events:
15+
`before_tool_call`, `after_tool_call`, `on_session_start`,
16+
`on_session_end`, `before_reply`, `after_reply`
17+
(`crates/goose/src/hooks.rs:14-25`). Each `HookEntry` pairs an
18+
optional `matcher` regex (matched against tool name) with one or
19+
more `HookHandler`s, each of which is either a shell `command`
20+
that receives the JSON `HookContext` on stdin, or an HTTP `url`
21+
that receives it as a POST body, with a `timeout` (default 10s).
22+
23+
Hooks are loaded from `goose config.yaml` via
24+
`HookManager::from_config()` (`hooks.rs:107-116`). The agent
25+
constructor (`agent.rs:251-255`) instantiates a manager, and three
26+
emit sites are wired:
27+
28+
- `before_tool_call` (`agent.rs:594-624`): synchronously gates the
29+
tool dispatch — a hook returning `block: true` short-circuits
30+
with `INVALID_REQUEST` and the supplied reason.
31+
- `after_tool_call` (`agent.rs:692-708`): fire-and-forget, no
32+
blocking semantics.
33+
- `before_reply` (`agent.rs:1118-1135`): emitted at the top of the
34+
reply loop with the user message text in `HookContext.message`.
35+
36+
## Strengths
37+
38+
- Event enum is exhaustive and `serde(rename_all = "snake_case")`
39+
(`hooks.rs:16`) so the YAML keys are predictable and unsurprising.
40+
- The `HookDecision` short-circuit at `hooks.rs:218-221` returns
41+
on the *first* `block: true` hook — important so that an
42+
early-deny rule can't be silently overridden by a later
43+
permissive rule. Matches the principle-of-least-surprise for an
44+
authorization layer.
45+
- `has_hooks()` gate at `agent.rs:594, 694, 1119` avoids the
46+
serialization cost of building `HookContext` when no hooks are
47+
registered for that event. With the hash-map check at
48+
`hooks.rs:163-167` returning `false` on empty/missing entries,
49+
the no-hooks code path is essentially free.
50+
- Matcher regex compilation failure is logged and the hook is
51+
*skipped*, not panicked (`hooks.rs:194-200`). Right call —
52+
one bad regex shouldn't take down all the other hooks for that
53+
event.
54+
- Timeout default of 10s (`hooks.rs:217`) is sane for a synchronous
55+
`before_tool_call` gate; long enough for a network-bound policy
56+
check, short enough that a stuck script can't hang the agent.
57+
58+
## Concerns / asks
59+
60+
- **`after_tool_call` is documented as "fire-and-forget" but
61+
`agent.rs:707-708` actually `await`s it.** That means a slow
62+
after-hook still blocks the reply loop just like a before-hook.
63+
Either:
64+
1. Spawn it via `tokio::spawn` so the loop returns immediately
65+
(matching the docstring), or
66+
2. Update the docstring and the comment at `:693` to say
67+
"synchronously emitted; respects timeout".
68+
As-is, the contract and the implementation disagree, and the
69+
consequence is that a misconfigured after-hook with a 30s
70+
timeout will pause the chat between every tool call.
71+
- `tool_result: None` at every emit site (`agent.rs:601, 700,
72+
1126`) — the `HookContext` *has* a `tool_result` field but
73+
no caller populates it. For `after_tool_call` this is the most
74+
valuable field a hook would want (e.g. "log the result of every
75+
shell command", "block on PII in tool output"). The result is
76+
available right there at `agent.rs:692` (just after the
77+
`WAITING_TOOL_END` debug log). Wire it through.
78+
- The `before_tool_call` block at `agent.rs:614-622` returns
79+
`INVALID_REQUEST` with the reason in the error message. That
80+
means the *model* sees the block reason in its tool-call result,
81+
which is the right behavior for a guardrail — but the reason is
82+
also the only thing logged. For audit purposes (a hook denied a
83+
tool call), worth emitting a structured tracing event with the
84+
`tool_name`, `session_id`, and `reason` fields rather than just
85+
the existing `tracing::info!` line at `:611`.
86+
- `Config::global()` at `hooks.rs:121` is a process-wide singleton.
87+
For a long-running `goose serve` whose config file gets edited
88+
at runtime, the hooks won't reload — `HookManager::from_config()`
89+
is only called once in the agent constructor. Worth either
90+
documenting "edit config, restart goose serve" or wiring a
91+
config-watcher reload path.
92+
- Two handler types (`command` shell + `url` HTTP) but
93+
`HookHandler` (`hooks.rs:53-60`) makes both `Option<String>`
94+
nothing prevents a config from setting *both* `command` and
95+
`url`, or *neither*. Convert to a tagged enum or at least
96+
validate at load time and surface a config error, otherwise
97+
ambiguous configs fail silently or behave inconsistently.
98+
- The `matcher` regex at `hooks.rs:189-199` is recompiled on
99+
every emit. For a high-frequency tool-call workload, that's
100+
measurable. Compile once at config-load time and store the
101+
`Regex` in `HookEntry`.
102+
- No tests in this PR. For a feature that:
103+
- executes user-supplied shell commands,
104+
- can block tool dispatch,
105+
- has a non-obvious "first block wins" precedence rule,
106+
- has a regex-matching subsystem with skip-on-error semantics,
107+
the absence of any unit tests is a real gap. At minimum:
108+
- block-on-first-deny precedence,
109+
- regex-skip-on-invalid behavior,
110+
- timeout enforcement,
111+
- per-event has_hooks gating.
112+
113+
## Verdict
114+
115+
**request-changes** — the design and the integration points are
116+
sound, but four concrete issues need addressing before merge:
117+
(1) the `after_tool_call` docstring/impl disagreement,
118+
(2) `tool_result` never populated,
119+
(3) `command`/`url` mutual-exclusivity not enforced,
120+
(4) zero tests for a security-sensitive subsystem.
121+
122+
The matcher recompilation, structured audit logging, and
123+
config-watcher reload can be follow-ups, but tests for the
124+
deny-precedence path and the regex-skip behavior should land
125+
with this PR — those are the cases an operator would assume
126+
Just Work, and they're cheap to cover.
127+
128+
## What I learned
129+
130+
A "lifecycle hooks" feature looks like a UX feature but is
131+
actually a security/policy feature: any hook with `block: true`
132+
is implicitly an authorization layer. That changes the bar for
133+
review — config malformation, error handling on hook failure,
134+
audit logging, and ordering semantics all become
135+
correctness-critical, not nice-to-have. The same pattern recurs
136+
in MCP servers, agent permissions, and webhook-style policy
137+
plugins; the temptation to "just call user code at lifecycle
138+
points" almost always evolves into a guardrail/audit subsystem,
139+
and it's cheaper to design for that on day one.

0 commit comments

Comments
 (0)