Skip to content

Fix SetAgentState failure: deterministic agentBeadID across spawn and session#3677

Open
8888 wants to merge 1 commit intogastownhall:mainfrom
8888:fix/setagentstate-bead-id
Open

Fix SetAgentState failure: deterministic agentBeadID across spawn and session#3677
8888 wants to merge 1 commit intogastownhall:mainfrom
8888:fix/setagentstate-bead-id

Conversation

@8888
Copy link
Copy Markdown
Contributor

@8888 8888 commented Apr 18, 2026

Summary

  • Fixes SetAgentState fails on every polecat spawn: agentBeadID path mismatch #3676
  • SetAgentState failed with "issue not found" on every polecat spawn because agentBeadID() computed different IDs from the main rig path vs the polecat worktree path
  • The Manager now uses a deterministic town root so the agent bead ID is consistent regardless of whether it's called during spawn or session start

Impact

  • Polecats were working but invisible to the Gas Town Control Center dashboard
  • gt polecat list showed working polecats as "stalled"
  • Witness escalated unnecessarily on healthy polecats

Test plan

  • gt sling a bead — verify no SetAgentState warnings
  • gt polecat list shows polecat as "working" (not "stalled")
  • Dashboard shows polecat count > 0
  • gt polecat nuke no longer reports "agent bead not found"

@github-actions github-actions bot added the status/needs-triage Inbox — we haven't looked at it yet label Apr 18, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@utafrali utafrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct fix with clear reasoning. Replacing repeated workspace.Find() calls with a single filepath.Dir(r.Path) at construction time is the right approach given the rig path invariant, and the code is meaningfully simpler as a result. The two gaps are a missing regression test for a non-obvious non-determinism bug, and an undocumented behavior change in the capacity-check error path.

// agentBeadID returns the agent bead ID for a polecat.
// Format: "<prefix>-<rig>-polecat-<name>" (e.g., "gt-gastown-polecat-Toast", "bd-beads-polecat-obsidian")
// The prefix is looked up from routes.jsonl to support rigs with custom prefixes.
// Uses the town root computed at Manager construction for deterministic IDs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No regression test was added for the core bug. The failure mode was non-determinism across two Manager instances computing different IDs from the same rig path, and that class of bug is easy to silently reintroduce during a future refactor of NewManager. A simple test that constructs a Manager from a known path and asserts agentBeadID("somecat") returns the same string on repeated calls, regardless of process working directory, would lock this in. Without it, the fix relies entirely on code review to catch regressions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TestAgentBeadID_Deterministic in commit 0d5634b — it constructs two Managers from the same rig path (including after changing the process working directory) and asserts agentBeadID("somecat") returns identical strings across both. This locks in the fix against future refactors of NewManager.

}

hasCapacity, active, err := doltserver.HasConnectionCapacity(townRoot)
hasCapacity, active, err := doltserver.HasConnectionCapacity(m.townRoot)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subtle behavior change worth calling out explicitly. The old code had return nil // Can't determine town root, skip check when workspace.Find failed, meaning a broken workspace lookup silently made every capacity check pass. The new code always calls HasConnectionCapacity(m.townRoot), so if HasConnectionCapacity returns an error for a valid-but-under-load server, it now propagates that error where it previously would have been swallowed. This is almost certainly the right behavior (and was probably a latent bug in the old path), but it changes the failure mode from silent-pass to fail-closed. Worth a brief comment here noting that the old skip-on-error behavior was intentionally removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explicit comment at CheckDoltServerCapacity documenting this intentional behavior change. The old path silently returned nil when workspace.Find failed, making every capacity check pass. The new code propagates errors as fail-closed, which is the correct behavior — better to block a spawn than to overload Dolt. Comment calls this out directly so future readers understand it was deliberate.

8888 added a commit to 8888/gastown that referenced this pull request Apr 18, 2026
…27k)

Address PR gastownhall#3677 review feedback:
1. Add TestAgentBeadID_Deterministic regression test that constructs
   Managers from the same rig path (including after cwd change) and
   asserts agentBeadID returns identical strings — locks in the gt-lph
   fix against future refactors of NewManager.
2. Add comment at CheckDoltServerCapacity documenting the intentional
   behavior change: the old skip-on-error path (return nil when
   workspace.Find failed) was silently removed, and errors now
   propagate as fail-closed — the right behavior for gt-lfc0d.
@popandpeek
Copy link
Copy Markdown
Contributor

The Integration Tests failure on this PR is a pre-existing issue with TestSchedulerAutoConvoyCreation — it's been failing on main CI since Apr 15 (run 24477479864).

Root cause: the test uses bd show --json which queries the crystallizes column (dropped in bd v0.63.3, but CI pins bd v0.57.0). The fix is in PR #3689 — replaces bd show with a direct SQL query.

Once #3689 merges, rebasing this PR should give you green CI.

popandpeek pushed a commit to popandpeek/gastown that referenced this pull request Apr 19, 2026
…27k)

Address PR gastownhall#3677 review feedback:
1. Add TestAgentBeadID_Deterministic regression test that constructs
   Managers from the same rig path (including after cwd change) and
   asserts agentBeadID returns identical strings — locks in the gt-lph
   fix against future refactors of NewManager.
2. Add comment at CheckDoltServerCapacity documenting the intentional
   behavior change: the old skip-on-error path (return nil when
   workspace.Find failed) was silently removed, and errors now
   propagate as fail-closed — the right behavior for gt-lfc0d.
popandpeek pushed a commit to popandpeek/gastown that referenced this pull request Apr 19, 2026
…27k)

Address PR gastownhall#3677 review feedback:
1. Add TestAgentBeadID_Deterministic regression test that constructs
   Managers from the same rig path (including after cwd change) and
   asserts agentBeadID returns identical strings — locks in the gt-lph
   fix against future refactors of NewManager.
2. Add comment at CheckDoltServerCapacity documenting the intentional
   behavior change: the old skip-on-error path (return nil when
   workspace.Find failed) was silently removed, and errors now
   propagate as fail-closed — the right behavior for gt-lfc0d.
popandpeek pushed a commit to popandpeek/gastown that referenced this pull request Apr 19, 2026
…27k)

Address PR gastownhall#3677 review feedback:
1. Add TestAgentBeadID_Deterministic regression test that constructs
   Managers from the same rig path (including after cwd change) and
   asserts agentBeadID returns identical strings — locks in the gt-lph
   fix against future refactors of NewManager.
2. Add comment at CheckDoltServerCapacity documenting the intentional
   behavior change: the old skip-on-error path (return nil when
   workspace.Find failed) was silently removed, and errors now
   propagate as fail-closed — the right behavior for gt-lfc0d.
steveyegge pushed a commit that referenced this pull request Apr 20, 2026
…3680)

* fix(scheduler): skip missing-bead polecats in capacity count + log zero-capacity dispatch

Bug: countWorkingPolecats() counted polecats with missing or unreachable
agent beads as "working," inflating the capacity count. With 7 orphan
tmux sessions and no agent beads, capacity computed as 5-7=-2 and the
scheduler dispatched zero beads for 12+ hours with no log output.

Root cause: the conservative fallback (count++ on bead lookup failure)
assumed missing-bead means "might be working." In practice, a polecat
with no agent bead has no hook — it cannot be working. The Dolt-down
case (all lookups fail → count=0) is safe because polecat_spawn.go
gates on Dolt health before spawning.

Changes:
- scheduler.go: treat missing-bead polecats as idle (skip, don't count)
- capacity_dispatch.go: log when dispatch skips beads at zero capacity
  instead of returning silently

Pre-existing test fixes (all failing on main CI):
- diskspace_unix.go: stat.Bavail is int64 on freebsd but uint64 on
  linux. Cast to uint64 with nolint:unconvert directive. Fixes
  TestCrossPlatformBuild/freebsd_amd64.
- doltserver_test.go: TestFindBrokenWorkspaces tests connected to a
  real Dolt server on the default port (3307) instead of using the
  test's temp directory. Write config.yaml with an unused port to
  isolate the tests.
- scheduler_integration_test.go: TestSchedulerAutoConvoyCreation used
  bd show to verify the convoy, which hits the "crystallizes" column
  error on bd v0.57.0 (column dropped in later versions). Replace
  with direct SQL query against the Dolt test container. Also
  simplified the tracking dep verification to be non-fatal since
  cross-rig deps use the external: prefix format.

Refs: popandpeek#20 (fork CI validation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use deterministic townRoot in Manager for consistent agentBeadID (gt-lph)

agentBeadID() called workspace.Find(m.rig.Path) on every invocation, which
could resolve differently or fail depending on call-site context. During spawn,
the Manager was constructed from the main rig path and resolved correctly.
During StartSession, a new Manager was constructed and workspace.Find could
return a different result, computing a different agent bead ID.

Fix: compute townRoot once at Manager construction using filepath.Dir(r.Path)
(rig path is always filepath.Join(townRoot, rigName)) and store it on the
struct. All methods that previously called workspace.Find(m.rig.Path) now
use the stored m.townRoot, ensuring deterministic agent bead IDs regardless
of call site.

* fix: add agentBeadID determinism test and capacity check comment (gt-27k)

Address PR #3677 review feedback:
1. Add TestAgentBeadID_Deterministic regression test that constructs
   Managers from the same rig path (including after cwd change) and
   asserts agentBeadID returns identical strings — locks in the gt-lph
   fix against future refactors of NewManager.
2. Add comment at CheckDoltServerCapacity documenting the intentional
   behavior change: the old skip-on-error path (return nil when
   workspace.Find failed) was silently removed, and errors now
   propagate as fail-closed — the right behavior for gt-lfc0d.

* fix(env): remove ANTHROPIC_BASE_URL from parent process passthrough

AgentEnv() forwarded ANTHROPIC_BASE_URL from os.Getenv() into every
spawned agent's environment. When a non-Anthropic agent (e.g., MiniMax
deacon with ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic) spawned
Claude polecats, they inherited the MiniMax base URL but used an
Anthropic API key, causing 401 auth failures on every API call.

Agents that need a custom base URL (MiniMax, Groq, etc.) already receive
it from their agent config's Env block (rc.Env in settings/config.json
role_agents), which is the correct mechanism. The os.Getenv passthrough
was redundant for configured agents and harmful for cross-provider spawns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(namepool): read polecat_names from rig config.json as fallback + warn on missing theme

NewManager only read namepool config from settings/config.json. If that
file was missing but the rig's config.json had polecat_names configured,
the names were ignored and ThemeForRig hash was used instead.

Now checks rig config.json for polecat_names when settings/config.json
has no namepool config. Also logs a warning when a configured theme name
can't be resolved (instead of silently falling back to default).

Fixes: #3594

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: replace invalid --rig flag with --repo in bd create (ga-movc)

bd create does not support --rig. All polecats were hitting
'MR bead creation failed: ... --rig not supported' on gt done.

Add GetRigDirForName() to routes.go which resolves a rig name to its
directory path via routes.jsonl. In beads.Create(), replace the invalid
--rig=<name> arg with --repo=<rig_dir>, which bd create uses to open
the target store directly instead of auto-routing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: emmett <ben@popandpeek.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: chrome <leecostello1@gmail.com>
…27k)

Address PR gastownhall#3677 review feedback:
1. Add TestAgentBeadID_Deterministic regression test that constructs
   Managers from the same rig path (including after cwd change) and
   asserts agentBeadID returns identical strings — locks in the gt-lph
   fix against future refactors of NewManager.
2. Add comment at CheckDoltServerCapacity documenting the intentional
   behavior change: the old skip-on-error path (return nil when
   workspace.Find failed) was silently removed, and errors now
   propagate as fail-closed — the right behavior for gt-lfc0d.
@8888 8888 force-pushed the fix/setagentstate-bead-id branch from 0d5634b to 76a4f98 Compare April 20, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status/needs-triage Inbox — we haven't looked at it yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SetAgentState fails on every polecat spawn: agentBeadID path mismatch

4 participants