Skip to content

bug: TestReconcileSessionBeads_StartsIdleDrainAfterGrace flaky in CI #497

@rileywhite

Description

@rileywhite

Before you continue

  • I searched existing issues and did not find a duplicate.
  • I read the relevant docs and contributor guidance.

Gas City version

0.13.5 (CI runs against PR branch, built from main + PR commits)

Environment

GitHub Actions CI runner (ubuntu-latest). The test is timing-sensitive and depends on goroutine scheduling, making it sensitive to runner load.

Reproduction

The test fails intermittently in CI but passes locally and on main:

  1. Open any PR that touches cmd/gc/ (e.g. fix: route rig-scoped agents to rig store in session new #496)
  2. CI runs make checkgo test ./...
  3. TestReconcileSessionBeads_StartsIdleDrainAfterGrace fails with:
    session_sleep_test.go:231: idle probe for gc-1 did not complete
    

Not reliably reproducible locally — appears to require a loaded/slow runner.

Expected behavior

Test passes consistently. The waitForIdleProbeReady helper should complete within its deadline.

Actual behavior

On slower CI runners, the idle probe goroutine doesn't set probe.ready within the 5-second polling window, causing a spurious failure. The 5.00s test duration confirms the deadline was fully exhausted.

Logs, screenshots, or traces

--- FAIL: TestReconcileSessionBeads_StartsIdleDrainAfterGrace (5.00s)
    session_sleep_test.go:231: idle probe for gc-1 did not complete
FAIL
FAIL	github.com/gastownhall/gascity/cmd/gc	44.457s

CI run: https://github.com/gastownhall/gascity/actions/runs/24158913386/job/70505007894

The flaky helper (session_sleep_test.go:987-998):

func waitForIdleProbeReady(t *testing.T, dt *drainTracker, beadID string) {
	t.Helper()
	deadline := time.Now().Add(5 * time.Second)
	for time.Now().Before(deadline) {
		if probe, ok := dt.idleProbe(beadID); ok && probe.ready {
			return
		}
		time.Sleep(time.Millisecond)
	}
	t.Fatalf("idle probe for %s did not complete", beadID)
}

The 5-second wall-clock deadline with 1ms polling is tight for an operation that depends on async reconciliation completing in a goroutine.

Additional context

All recent main branch CI runs pass (5/5 green as of 2026-04-08), confirming this is a timing flake rather than a regression. Observed while CI-checking PR #496 (unrelated fix for rig-scoped session store routing, #138).

Possible fixes: increase the deadline, use a channel/signal instead of polling, or use testing.Short() to skip under constrained environments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugBroken behaviorpriority/p1High — core workflow broken

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions