review: aaif-goose/goose#8842 request-changes, aaif-goose/goose#8834 merge-after-nits

Bojun-Vvibe · Bojun-Vvibe · commit 03f4ba2f3ec3 · 2026-04-26T09:28:01.000+08:00
diff --git a/reviews/2026-W17/block-goose-pr-8834.md b/reviews/2026-W17/block-goose-pr-8834.md
@@ -0,0 +1,130 @@
+# block/goose #8834 — Fix Windows dev loop: beforeDevCommand script + Vite IPv4 bind + test_acp_client.py stdin-flush
+
+- **Repo**: block/goose
+- **PR**: #8834
+- **Author**: seydousakho-star
+- **Head SHA**: 7d26c8d048c38afff05299c040c19d0b3fbd7c47
+- **Base**: main
+- **Size**: +76 / −6 across two files: `test_acp_client.py` (+62/−8)
+  and `ui/goose2/justfile` (+18/−2 in two recipes).
+
+## What it changes
+
+Three Windows-specific fixes to the inner-loop developer experience,
+each documented inline with the *why*:
+
+1. **`test_acp_client.py:11-22`** — UTF-8 console reconfig at startup
+   so progress glyphs (`✓ ✗ 📝`) don't crash on cp1252 Windows
+   consoles. Wrapped in try/except for older Pythons.
+
+2. **`test_acp_client.py:25-49, 75-93`** — `_resolve_goose_cmd()`
+   helper picks `$GOOSE_BIN` → `target/debug/goose[.exe]` →
+   `cargo run -p goose-cli -- acp` in that order. Subprocess
+   construction switches `stderr=subprocess.PIPE` →
+   `subprocess.DEVNULL` and `bufsize=0` → `bufsize=1`, with comments
+   at `:73-83` explaining that an unread `cargo run` stderr fills
+   Windows' ~4KB pipe buffer and deadlocks the JSON-RPC stdio loop.
+
+3. **`ui/goose2/justfile:99-115` and `:147-166`** — Tauri
+   `beforeDevCommand` config rewritten:
+   - Drops `cd ${PROJECT_DIR} && ` prefix (the embedded `&&`
+     was getting parsed by the bash→npm→tauri argv passthrough
+     and truncating the JSON value).
+   - Drops the leading `exec` (cmd.exe doesn't know the bash
+     builtin).
+   - Adds `--host 127.0.0.1` (Vite 7 binds IPv6-only on Windows;
+     WebView2 tries IPv4 first and 404s).
+   - Changes `cwd` from `.` to `..` (Tauri resolves
+     `beforeDevCommand` cwd relative to `src-tauri/`, not where
+     `just` ran).
+
+## Strengths
+
+- **Each fix has a comment explaining the failure mode it addresses.**
+  This is unusually high-quality for an env-fix PR — the comment at
+  `justfile:99-114` enumerates all four sub-fixes and *why* each one
+  matters, so a future maintainer untangling another shell-quoting
+  issue won't have to re-derive the chain. Same applies to the
+  stderr/bufsize comment block at `test_acp_client.py:73-83`.
+- The `_resolve_goose_cmd()` precedence (`:25-49`) — explicit env
+  override, then prebuilt binary, then `cargo run` — is the right
+  ordering for a test that ships in-tree but is sometimes run by
+  CI and sometimes by an interactive developer. The `RuntimeError`
+  with actionable advice at `:48` (build with `cargo build -p
+  goose-cli` or set `GOOSE_BIN`) is good UX.
+- `stderr=subprocess.DEVNULL` is the right call for this test:
+  the test cares only about JSON-RPC framing on stdout, and
+  routing stderr to DEVNULL avoids the pipe-buffer deadlock without
+  needing a reader thread. The alternative — spinning up a
+  background reader — would add complexity for no test value.
+- `bufsize=1` (line-buffered) at `:91` is correct for text-mode I/O
+  on both Windows and POSIX. The previous `bufsize=0` is only
+  meaningful for binary streams and was actively wrong here.
+- `--host 127.0.0.1` is a *less* permissive bind than the default
+  (which on Linux is `localhost` resolving to both v4/v6, and on
+  Windows happens to be IPv6-only) — explicit IPv4 binding is both
+  the cross-platform fix and a small security improvement (no
+  accidental external exposure if Vite ever changes the default).
+
+## Concerns / asks
+
+- **The `dev` and `dev-debug` recipes have copy-pasted explanatory
+  comments** but only the `dev` recipe lists all four sub-fixes;
+  the `dev-debug` block at `:160-163` only covers the `exec` drop
+  and `--host` add, not the `cwd` change. Either:
+  1. Verify that `dev-debug` doesn't suffer from the `&&` and
+     `cwd` issues (and add a one-liner explaining why), or
+  2. If both recipes have all four issues, list all four in
+     both comment blocks. Otherwise a future maintainer reading
+     just `dev-debug` will wonder if the fix is incomplete there.
+- The `--host 127.0.0.1` bind change is a *behavior* change for
+  *all* platforms, not just Windows. On Linux, devs who were
+  reaching the dev server from another box on the LAN (e.g. for
+  cross-device testing) will now get connection refused. Worth a
+  one-liner in the comment block calling that out, or making the
+  host configurable via env var (e.g. `${GOOSE_DEV_HOST:-127.0.0.1}`)
+  so the cross-platform default is safe but the LAN-test workflow
+  still has an escape hatch.
+- `sys.stdout.reconfigure(...)` at `test_acp_client.py:18` swallows
+  `AttributeError, OSError`. The `AttributeError` covers
+  `reconfigure` not existing on Python ≤ 3.6; the `OSError` covers
+  detached/closed streams. Worth a one-line warning print to
+  `sys.stderr` (which is now DEVNULL'd inside the subprocess but
+  still works at the top of the test script) so a Windows dev who
+  *does* hit the cp1252 fallback knows why their output looks
+  garbled.
+- `_resolve_goose_cmd()` checks `os.path.isfile(prebuilt)` at
+  `:43`, but doesn't check that the file is *executable*. On a
+  fresh checkout where `target/debug/goose.exe` exists but lacks
+  `+x` (rare on Windows, possible on POSIX after a `git restore`),
+  the test will fall through to a confusing `Popen` error.
+  `os.access(prebuilt, os.X_OK)` would catch it cleanly.
+
+## Verdict
+
+**merge-after-nits** — the actual fixes are correct, well-commented,
+and address real Windows-only failure modes that block the entire
+inner-loop dev experience there. The asks are: parity between the
+`dev` and `dev-debug` comment blocks, an env-var escape hatch for
+the IPv4 bind, and a couple of minor robustness tweaks in the
+Python helper.
+
+## What I learned
+
+The `cargo run` stderr-deadlock pattern is one of the canonical
+subprocess pitfalls and it bites *exactly* the test scripts that
+spawn cargo because cargo prints compile chatter to stderr by
+default. The fix is always the same — read it on a thread, or
+DEVNULL it — but it's worth remembering that pipe-buffer sizes
+differ across platforms (Windows ~4KB, Linux ~64KB), so a test
+that works on Linux can deadlock on Windows just by being slow
+to drain stderr.
+
+The Tauri `beforeDevCommand` quoting issue is a different category
+of bug: it's a contract leak between four layers (just → bash →
+npm → tauri → cmd.exe). Embedding `&&` in a JSON value passed
+across all four layers is a recipe for argv splitting at the
+wrong point, and the symptom (truncated `--config` JSON) is
+nowhere near the cause (`&&` parsed too early). Per-platform
+escape rules accumulate and the only durable fix is to not put
+shell metacharacters in cross-process argv values at all.
diff --git a/reviews/2026-W17/block-goose-pr-8842.md b/reviews/2026-W17/block-goose-pr-8842.md
@@ -0,0 +1,139 @@
+# block/goose #8842 — feat: lifecycle hooks system
+
+- **Repo**: block/goose
+- **PR**: #8842
+- **Author**: michaelneale (Michael Neale)
+- **Head SHA**: 83c0dcfebf305c8843e4263c716aa0475a5ef401
+- **Base**: main
+- **Size**: +550 / −10 — bulk in new `crates/goose/src/hooks.rs`
+  (+~444), with 6 emit sites added across `crates/goose/src/agents/agent.rs`
+  (+~80) and a sprinkle of glue in the agent constructor and reply loop.
+
+## What it changes
+
+Adds a config-driven lifecycle hooks subsystem with 6 events:
+`before_tool_call`, `after_tool_call`, `on_session_start`,
+`on_session_end`, `before_reply`, `after_reply`
+(`crates/goose/src/hooks.rs:14-25`). Each `HookEntry` pairs an
+optional `matcher` regex (matched against tool name) with one or
+more `HookHandler`s, each of which is either a shell `command`
+that receives the JSON `HookContext` on stdin, or an HTTP `url`
+that receives it as a POST body, with a `timeout` (default 10s).
+
+Hooks are loaded from `goose config.yaml` via
+`HookManager::from_config()` (`hooks.rs:107-116`). The agent
+constructor (`agent.rs:251-255`) instantiates a manager, and three
+emit sites are wired:
+
+- `before_tool_call` (`agent.rs:594-624`): synchronously gates the
+  tool dispatch — a hook returning `block: true` short-circuits
+  with `INVALID_REQUEST` and the supplied reason.
+- `after_tool_call` (`agent.rs:692-708`): fire-and-forget, no
+  blocking semantics.
+- `before_reply` (`agent.rs:1118-1135`): emitted at the top of the
+  reply loop with the user message text in `HookContext.message`.
+
+## Strengths
+
+- Event enum is exhaustive and `serde(rename_all = "snake_case")`
+  (`hooks.rs:16`) so the YAML keys are predictable and unsurprising.
+- The `HookDecision` short-circuit at `hooks.rs:218-221` returns
+  on the *first* `block: true` hook — important so that an
+  early-deny rule can't be silently overridden by a later
+  permissive rule. Matches the principle-of-least-surprise for an
+  authorization layer.
+- `has_hooks()` gate at `agent.rs:594, 694, 1119` avoids the
+  serialization cost of building `HookContext` when no hooks are
+  registered for that event. With the hash-map check at
+  `hooks.rs:163-167` returning `false` on empty/missing entries,
+  the no-hooks code path is essentially free.
+- Matcher regex compilation failure is logged and the hook is
+  *skipped*, not panicked (`hooks.rs:194-200`). Right call —
+  one bad regex shouldn't take down all the other hooks for that
+  event.
+- Timeout default of 10s (`hooks.rs:217`) is sane for a synchronous
+  `before_tool_call` gate; long enough for a network-bound policy
+  check, short enough that a stuck script can't hang the agent.
+
+## Concerns / asks
+
+- **`after_tool_call` is documented as "fire-and-forget" but
+  `agent.rs:707-708` actually `await`s it.** That means a slow
+  after-hook still blocks the reply loop just like a before-hook.
+  Either:
+  1. Spawn it via `tokio::spawn` so the loop returns immediately
+     (matching the docstring), or
+  2. Update the docstring and the comment at `:693` to say
+     "synchronously emitted; respects timeout".
+  As-is, the contract and the implementation disagree, and the
+  consequence is that a misconfigured after-hook with a 30s
+  timeout will pause the chat between every tool call.
+- `tool_result: None` at every emit site (`agent.rs:601, 700,
+  1126`) — the `HookContext` *has* a `tool_result` field but
+  no caller populates it. For `after_tool_call` this is the most
+  valuable field a hook would want (e.g. "log the result of every
+  shell command", "block on PII in tool output"). The result is
+  available right there at `agent.rs:692` (just after the
+  `WAITING_TOOL_END` debug log). Wire it through.
+- The `before_tool_call` block at `agent.rs:614-622` returns
+  `INVALID_REQUEST` with the reason in the error message. That
+  means the *model* sees the block reason in its tool-call result,
+  which is the right behavior for a guardrail — but the reason is
+  also the only thing logged. For audit purposes (a hook denied a
+  tool call), worth emitting a structured tracing event with the
+  `tool_name`, `session_id`, and `reason` fields rather than just
+  the existing `tracing::info!` line at `:611`.
+- `Config::global()` at `hooks.rs:121` is a process-wide singleton.
+  For a long-running `goose serve` whose config file gets edited
+  at runtime, the hooks won't reload — `HookManager::from_config()`
+  is only called once in the agent constructor. Worth either
+  documenting "edit config, restart goose serve" or wiring a
+  config-watcher reload path.
+- Two handler types (`command` shell + `url` HTTP) but
+  `HookHandler` (`hooks.rs:53-60`) makes both `Option<String>` —
+  nothing prevents a config from setting *both* `command` and
+  `url`, or *neither*. Convert to a tagged enum or at least
+  validate at load time and surface a config error, otherwise
+  ambiguous configs fail silently or behave inconsistently.
+- The `matcher` regex at `hooks.rs:189-199` is recompiled on
+  every emit. For a high-frequency tool-call workload, that's
+  measurable. Compile once at config-load time and store the
+  `Regex` in `HookEntry`.
+- No tests in this PR. For a feature that:
+  - executes user-supplied shell commands,
+  - can block tool dispatch,
+  - has a non-obvious "first block wins" precedence rule,
+  - has a regex-matching subsystem with skip-on-error semantics,
+  the absence of any unit tests is a real gap. At minimum:
+  - block-on-first-deny precedence,
+  - regex-skip-on-invalid behavior,
+  - timeout enforcement,
+  - per-event has_hooks gating.
+
+## Verdict
+
+**request-changes** — the design and the integration points are
+sound, but four concrete issues need addressing before merge:
+(1) the `after_tool_call` docstring/impl disagreement,
+(2) `tool_result` never populated,
+(3) `command`/`url` mutual-exclusivity not enforced,
+(4) zero tests for a security-sensitive subsystem.
+
+The matcher recompilation, structured audit logging, and
+config-watcher reload can be follow-ups, but tests for the
+deny-precedence path and the regex-skip behavior should land
+with this PR — those are the cases an operator would assume
+Just Work, and they're cheap to cover.
+
+## What I learned
+
+A "lifecycle hooks" feature looks like a UX feature but is
+actually a security/policy feature: any hook with `block: true`
+is implicitly an authorization layer. That changes the bar for
+review — config malformation, error handling on hook failure,
+audit logging, and ordering semantics all become
+correctness-critical, not nice-to-have. The same pattern recurs
+in MCP servers, agent permissions, and webhook-style policy
+plugins; the temptation to "just call user code at lifecycle
+points" almost always evolves into a guardrail/audit subsystem,
+and it's cheaper to design for that on day one.