Fix SetAgentState failure: deterministic agentBeadID across spawn and session by 8888 · Pull Request #3677 · gastownhall/gastown

8888 · 2026-04-18T06:24:26Z

Summary

Fixes SetAgentState fails on every polecat spawn: agentBeadID path mismatch #3676
SetAgentState failed with "issue not found" on every polecat spawn because agentBeadID() computed different IDs from the main rig path vs the polecat worktree path
The Manager now uses a deterministic town root so the agent bead ID is consistent regardless of whether it's called during spawn or session start

Impact

Polecats were working but invisible to the Gas Town Control Center dashboard
gt polecat list showed working polecats as "stalled"
Witness escalated unnecessarily on healthy polecats

Test plan

gt sling a bead — verify no SetAgentState warnings
gt polecat list shows polecat as "working" (not "stalled")
Dashboard shows polecat count > 0
gt polecat nuke no longer reports "agent bead not found"

codecov-commenter · 2026-04-18T06:30:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

utafrali

Correct fix with clear reasoning. Replacing repeated workspace.Find() calls with a single filepath.Dir(r.Path) at construction time is the right approach given the rig path invariant, and the code is meaningfully simpler as a result. The two gaps are a missing regression test for a non-obvious non-determinism bug, and an undocumented behavior change in the capacity-check error path.

utafrali · 2026-04-18T06:40:53Z

 // agentBeadID returns the agent bead ID for a polecat.
 // Format: "<prefix>-<rig>-polecat-<name>" (e.g., "gt-gastown-polecat-Toast", "bd-beads-polecat-obsidian")
 // The prefix is looked up from routes.jsonl to support rigs with custom prefixes.
+// Uses the town root computed at Manager construction for deterministic IDs


No regression test was added for the core bug. The failure mode was non-determinism across two Manager instances computing different IDs from the same rig path, and that class of bug is easy to silently reintroduce during a future refactor of NewManager. A simple test that constructs a Manager from a known path and asserts agentBeadID("somecat") returns the same string on repeated calls, regardless of process working directory, would lock this in. Without it, the fix relies entirely on code review to catch regressions.

Added TestAgentBeadID_Deterministic in commit 0d5634b — it constructs two Managers from the same rig path (including after changing the process working directory) and asserts agentBeadID("somecat") returns identical strings across both. This locks in the fix against future refactors of NewManager.

utafrali · 2026-04-18T06:40:53Z

-	}
-
-	hasCapacity, active, err := doltserver.HasConnectionCapacity(townRoot)
+	hasCapacity, active, err := doltserver.HasConnectionCapacity(m.townRoot)


This is a subtle behavior change worth calling out explicitly. The old code had return nil // Can't determine town root, skip check when workspace.Find failed, meaning a broken workspace lookup silently made every capacity check pass. The new code always calls HasConnectionCapacity(m.townRoot), so if HasConnectionCapacity returns an error for a valid-but-under-load server, it now propagates that error where it previously would have been swallowed. This is almost certainly the right behavior (and was probably a latent bug in the old path), but it changes the failure mode from silent-pass to fail-closed. Worth a brief comment here noting that the old skip-on-error behavior was intentionally removed.

Added an explicit comment at CheckDoltServerCapacity documenting this intentional behavior change. The old path silently returned nil when workspace.Find failed, making every capacity check pass. The new code propagates errors as fail-closed, which is the correct behavior — better to block a spawn than to overload Dolt. Comment calls this out directly so future readers understand it was deliberate.

…27k) Address PR gastownhall#3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d.

popandpeek · 2026-04-19T21:30:37Z

The Integration Tests failure on this PR is a pre-existing issue with TestSchedulerAutoConvoyCreation — it's been failing on main CI since Apr 15 (run 24477479864).

Root cause: the test uses bd show --json which queries the crystallizes column (dropped in bd v0.63.3, but CI pins bd v0.57.0). The fix is in PR #3689 — replaces bd show with a direct SQL query.

Once #3689 merges, rebasing this PR should give you green CI.

…27k) Address PR gastownhall#3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d.

…3680) * fix(scheduler): skip missing-bead polecats in capacity count + log zero-capacity dispatch Bug: countWorkingPolecats() counted polecats with missing or unreachable agent beads as "working," inflating the capacity count. With 7 orphan tmux sessions and no agent beads, capacity computed as 5-7=-2 and the scheduler dispatched zero beads for 12+ hours with no log output. Root cause: the conservative fallback (count++ on bead lookup failure) assumed missing-bead means "might be working." In practice, a polecat with no agent bead has no hook — it cannot be working. The Dolt-down case (all lookups fail → count=0) is safe because polecat_spawn.go gates on Dolt health before spawning. Changes: - scheduler.go: treat missing-bead polecats as idle (skip, don't count) - capacity_dispatch.go: log when dispatch skips beads at zero capacity instead of returning silently Pre-existing test fixes (all failing on main CI): - diskspace_unix.go: stat.Bavail is int64 on freebsd but uint64 on linux. Cast to uint64 with nolint:unconvert directive. Fixes TestCrossPlatformBuild/freebsd_amd64. - doltserver_test.go: TestFindBrokenWorkspaces tests connected to a real Dolt server on the default port (3307) instead of using the test's temp directory. Write config.yaml with an unused port to isolate the tests. - scheduler_integration_test.go: TestSchedulerAutoConvoyCreation used bd show to verify the convoy, which hits the "crystallizes" column error on bd v0.57.0 (column dropped in later versions). Replace with direct SQL query against the Dolt test container. Also simplified the tracking dep verification to be non-fatal since cross-rig deps use the external: prefix format. Refs: popandpeek#20 (fork CI validation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use deterministic townRoot in Manager for consistent agentBeadID (gt-lph) agentBeadID() called workspace.Find(m.rig.Path) on every invocation, which could resolve differently or fail depending on call-site context. During spawn, the Manager was constructed from the main rig path and resolved correctly. During StartSession, a new Manager was constructed and workspace.Find could return a different result, computing a different agent bead ID. Fix: compute townRoot once at Manager construction using filepath.Dir(r.Path) (rig path is always filepath.Join(townRoot, rigName)) and store it on the struct. All methods that previously called workspace.Find(m.rig.Path) now use the stored m.townRoot, ensuring deterministic agent bead IDs regardless of call site. * fix: add agentBeadID determinism test and capacity check comment (gt-27k) Address PR #3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d. * fix(env): remove ANTHROPIC_BASE_URL from parent process passthrough AgentEnv() forwarded ANTHROPIC_BASE_URL from os.Getenv() into every spawned agent's environment. When a non-Anthropic agent (e.g., MiniMax deacon with ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic) spawned Claude polecats, they inherited the MiniMax base URL but used an Anthropic API key, causing 401 auth failures on every API call. Agents that need a custom base URL (MiniMax, Groq, etc.) already receive it from their agent config's Env block (rc.Env in settings/config.json role_agents), which is the correct mechanism. The os.Getenv passthrough was redundant for configured agents and harmful for cross-provider spawns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(namepool): read polecat_names from rig config.json as fallback + warn on missing theme NewManager only read namepool config from settings/config.json. If that file was missing but the rig's config.json had polecat_names configured, the names were ignored and ThemeForRig hash was used instead. Now checks rig config.json for polecat_names when settings/config.json has no namepool config. Also logs a warning when a configured theme name can't be resolved (instead of silently falling back to default). Fixes: #3594 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: replace invalid --rig flag with --repo in bd create (ga-movc) bd create does not support --rig. All polecats were hitting 'MR bead creation failed: ... --rig not supported' on gt done. Add GetRigDirForName() to routes.go which resolves a rig name to its directory path via routes.jsonl. In beads.Create(), replace the invalid --rig=<name> arg with --repo=<rig_dir>, which bd create uses to open the target store directly instead of auto-routing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: emmett <ben@popandpeek.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: chrome <leecostello1@gmail.com>

…27k) Address PR gastownhall#3677 review feedback: 1. Add TestAgentBeadID_Deterministic regression test that constructs Managers from the same rig path (including after cwd change) and asserts agentBeadID returns identical strings — locks in the gt-lph fix against future refactors of NewManager. 2. Add comment at CheckDoltServerCapacity documenting the intentional behavior change: the old skip-on-error path (return nil when workspace.Find failed) was silently removed, and errors now propagate as fail-closed — the right behavior for gt-lfc0d.

github-actions bot added the status/needs-triage Inbox — we haven't looked at it yet label Apr 18, 2026

utafrali approved these changes Apr 18, 2026

View reviewed changes

popandpeek mentioned this pull request Apr 19, 2026

fix: deterministic agentBeadID + scheduler capacity fix + CI test repairs #3690

Open

8888 force-pushed the fix/setagentstate-bead-id branch from 0d5634b to 76a4f98 Compare April 20, 2026 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SetAgentState failure: deterministic agentBeadID across spawn and session#3677

Fix SetAgentState failure: deterministic agentBeadID across spawn and session#3677
8888 wants to merge 1 commit intogastownhall:mainfrom
8888:fix/setagentstate-bead-id

8888 commented Apr 18, 2026

Uh oh!

codecov-commenter commented Apr 18, 2026 •

edited

Loading

Uh oh!

utafrali left a comment

Uh oh!

utafrali Apr 18, 2026

Uh oh!

8888 Apr 18, 2026

Uh oh!

utafrali Apr 18, 2026

Uh oh!

8888 Apr 18, 2026

Uh oh!

popandpeek commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

8888 commented Apr 18, 2026

Summary

Impact

Test plan

Uh oh!

codecov-commenter commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

utafrali left a comment

Choose a reason for hiding this comment

Uh oh!

utafrali Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

8888 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

utafrali Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

8888 Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

popandpeek commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Apr 18, 2026 •

edited

Loading