Fix SetAgentState failure: deterministic agentBeadID across spawn and session#3677
Fix SetAgentState failure: deterministic agentBeadID across spawn and session#36778888 wants to merge 1 commit intogastownhall:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
utafrali
left a comment
There was a problem hiding this comment.
Correct fix with clear reasoning. Replacing repeated workspace.Find() calls with a single filepath.Dir(r.Path) at construction time is the right approach given the rig path invariant, and the code is meaningfully simpler as a result. The two gaps are a missing regression test for a non-obvious non-determinism bug, and an undocumented behavior change in the capacity-check error path.
| // agentBeadID returns the agent bead ID for a polecat. | ||
| // Format: "<prefix>-<rig>-polecat-<name>" (e.g., "gt-gastown-polecat-Toast", "bd-beads-polecat-obsidian") | ||
| // The prefix is looked up from routes.jsonl to support rigs with custom prefixes. | ||
| // Uses the town root computed at Manager construction for deterministic IDs |
There was a problem hiding this comment.
No regression test was added for the core bug. The failure mode was non-determinism across two Manager instances computing different IDs from the same rig path, and that class of bug is easy to silently reintroduce during a future refactor of NewManager. A simple test that constructs a Manager from a known path and asserts agentBeadID("somecat") returns the same string on repeated calls, regardless of process working directory, would lock this in. Without it, the fix relies entirely on code review to catch regressions.
There was a problem hiding this comment.
Added TestAgentBeadID_Deterministic in commit 0d5634b — it constructs two Managers from the same rig path (including after changing the process working directory) and asserts agentBeadID("somecat") returns identical strings across both. This locks in the fix against future refactors of NewManager.
| } | ||
|
|
||
| hasCapacity, active, err := doltserver.HasConnectionCapacity(townRoot) | ||
| hasCapacity, active, err := doltserver.HasConnectionCapacity(m.townRoot) |
There was a problem hiding this comment.
This is a subtle behavior change worth calling out explicitly. The old code had return nil // Can't determine town root, skip check when workspace.Find failed, meaning a broken workspace lookup silently made every capacity check pass. The new code always calls HasConnectionCapacity(m.townRoot), so if HasConnectionCapacity returns an error for a valid-but-under-load server, it now propagates that error where it previously would have been swallowed. This is almost certainly the right behavior (and was probably a latent bug in the old path), but it changes the failure mode from silent-pass to fail-closed. Worth a brief comment here noting that the old skip-on-error behavior was intentionally removed.
There was a problem hiding this comment.
Added an explicit comment at CheckDoltServerCapacity documenting this intentional behavior change. The old path silently returned nil when workspace.Find failed, making every capacity check pass. The new code propagates errors as fail-closed, which is the correct behavior — better to block a spawn than to overload Dolt. Comment calls this out directly so future readers understand it was deliberate.
…27k) Address PR gastownhall#3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d.
|
The Integration Tests failure on this PR is a pre-existing issue with Root cause: the test uses Once #3689 merges, rebasing this PR should give you green CI. |
…27k) Address PR gastownhall#3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d.
…27k) Address PR gastownhall#3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d.
…27k) Address PR gastownhall#3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d.
…3680) * fix(scheduler): skip missing-bead polecats in capacity count + log zero-capacity dispatch Bug: countWorkingPolecats() counted polecats with missing or unreachable agent beads as "working," inflating the capacity count. With 7 orphan tmux sessions and no agent beads, capacity computed as 5-7=-2 and the scheduler dispatched zero beads for 12+ hours with no log output. Root cause: the conservative fallback (count++ on bead lookup failure) assumed missing-bead means "might be working." In practice, a polecat with no agent bead has no hook — it cannot be working. The Dolt-down case (all lookups fail → count=0) is safe because polecat_spawn.go gates on Dolt health before spawning. Changes: - scheduler.go: treat missing-bead polecats as idle (skip, don't count) - capacity_dispatch.go: log when dispatch skips beads at zero capacity instead of returning silently Pre-existing test fixes (all failing on main CI): - diskspace_unix.go: stat.Bavail is int64 on freebsd but uint64 on linux. Cast to uint64 with nolint:unconvert directive. Fixes TestCrossPlatformBuild/freebsd_amd64. - doltserver_test.go: TestFindBrokenWorkspaces tests connected to a real Dolt server on the default port (3307) instead of using the test's temp directory. Write config.yaml with an unused port to isolate the tests. - scheduler_integration_test.go: TestSchedulerAutoConvoyCreation used bd show to verify the convoy, which hits the "crystallizes" column error on bd v0.57.0 (column dropped in later versions). Replace with direct SQL query against the Dolt test container. Also simplified the tracking dep verification to be non-fatal since cross-rig deps use the external: prefix format. Refs: popandpeek#20 (fork CI validation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use deterministic townRoot in Manager for consistent agentBeadID (gt-lph) agentBeadID() called workspace.Find(m.rig.Path) on every invocation, which could resolve differently or fail depending on call-site context. During spawn, the Manager was constructed from the main rig path and resolved correctly. During StartSession, a new Manager was constructed and workspace.Find could return a different result, computing a different agent bead ID. Fix: compute townRoot once at Manager construction using filepath.Dir(r.Path) (rig path is always filepath.Join(townRoot, rigName)) and store it on the struct. All methods that previously called workspace.Find(m.rig.Path) now use the stored m.townRoot, ensuring deterministic agent bead IDs regardless of call site. * fix: add agentBeadID determinism test and capacity check comment (gt-27k) Address PR #3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d. * fix(env): remove ANTHROPIC_BASE_URL from parent process passthrough AgentEnv() forwarded ANTHROPIC_BASE_URL from os.Getenv() into every spawned agent's environment. When a non-Anthropic agent (e.g., MiniMax deacon with ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic) spawned Claude polecats, they inherited the MiniMax base URL but used an Anthropic API key, causing 401 auth failures on every API call. Agents that need a custom base URL (MiniMax, Groq, etc.) already receive it from their agent config's Env block (rc.Env in settings/config.json role_agents), which is the correct mechanism. The os.Getenv passthrough was redundant for configured agents and harmful for cross-provider spawns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(namepool): read polecat_names from rig config.json as fallback + warn on missing theme NewManager only read namepool config from settings/config.json. If that file was missing but the rig's config.json had polecat_names configured, the names were ignored and ThemeForRig hash was used instead. Now checks rig config.json for polecat_names when settings/config.json has no namepool config. Also logs a warning when a configured theme name can't be resolved (instead of silently falling back to default). Fixes: #3594 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: replace invalid --rig flag with --repo in bd create (ga-movc) bd create does not support --rig. All polecats were hitting 'MR bead creation failed: ... --rig not supported' on gt done. Add GetRigDirForName() to routes.go which resolves a rig name to its directory path via routes.jsonl. In beads.Create(), replace the invalid --rig=<name> arg with --repo=<rig_dir>, which bd create uses to open the target store directly instead of auto-routing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: emmett <ben@popandpeek.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: chrome <leecostello1@gmail.com>
…27k) Address PR gastownhall#3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d.
0d5634b to
76a4f98
Compare
Summary
SetAgentStatefailed with "issue not found" on every polecat spawn becauseagentBeadID()computed different IDs from the main rig path vs the polecat worktree pathImpact
gt polecat listshowed working polecats as "stalled"Test plan
gt slinga bead — verify noSetAgentStatewarningsgt polecat listshows polecat as "working" (not "stalled")gt polecat nukeno longer reports "agent bead not found"