Skip to content

Commit 391222d

Browse files
authored
fix: exclude pre_start-staged CopyFiles from config fingerprint (#682) (#695)
## Summary Fixes #682 — polecat config-drift false positive from non-deterministic CopyFiles hash. `cmd/gc.stageHookFiles` probes workDir-relative paths (`.claude/skills`, `.gemini/settings.json`, `.codex/hooks.json`, etc.) and captures their content hashes into `CopyEntry.ContentHash`. Those destinations are populated by `pre_start` scripts such as `worktree-setup.sh --sync`, so the hash computed at session startup differs from the hash the reconciler re-computes after `pre_start` completes. `CoreFingerprint` diverges, config-drift fires, and the polecat is drained within 1–2 cycles of starting. Every replacement polecat hits the same race, producing the thrash loop described in #682 (same `config_revision`, three distinct `CopyFiles` breakdown hashes across cycles 1/4/6). ## Fix Add `SkipFingerprint bool` to `runtime.CopyEntry`. All four fingerprint consumers (`runtime.hashCoreFields`, `runtime.CoreFingerprintBreakdown`, `runtime.LogCoreFingerprintDrift`, and the dormant `cmd/gc.canonicalConfigHash`) skip entries where `cf.Probed && cf.SkipFingerprint` are both true. The `Probed` precondition is deliberate: it guarantees that a future caller who mistakenly sets `SkipFingerprint=true` on a config-derived entry cannot silently mute a real config change. `TestSkipFingerprintIgnoredOnConfigDerivedEntries` locks this contract in. `stageHookFiles` sets the flag on every workDir-based probed entry (the 7-provider hook loop + the `.claude/skills` stage). It also drops the `runtime.HashPathContent()` call for those entries since the result is now discarded — eliminating a recursive SHA-256 walk of `.claude/skills` on every reconciler tick per polecat. The cityDir fallback (`.gc/settings.json` / `hooks/claude.json`) is deliberately left unmarked so real user edits still drive drain. `TestStageHookFilesIncludesCanonicalClaudeHook` and `TestStageHookFilesFallsBackToLegacyClaudeHook` both positively assert `SkipFingerprint==false` on that branch. ## Test plan - [x] `TestSkipFingerprintExcludesFromCoreHash` — skipped entries produce the same fingerprint as a baseline Config regardless of `ContentHash` (proves exclusion is total, not a `ContentHash==""` shortcut). - [x] `TestSkipFingerprintIgnoredOnConfigDerivedEntries` — `Probed=false, SkipFingerprint=true` on a config-derived entry still contributes to the fingerprint; changes to `Src` still drive drift. Footgun guard. - [x] `TestSkipFingerprintStableUnderFilesystemChurn` — regression test for the exact #682 bug: probe a tmp skills dir, write a file to it, re-probe. Without `SkipFingerprint` the hash drifts (locked in as a precondition assertion so the test can't vacuously pass). With `SkipFingerprint` the hash stays stable. A naive "skip when ContentHash empty" implementation fails this test. - [x] `TestLogCoreFingerprintDriftSkipsExcludedEntries` — diagnostic output does not leak skipped entries. - [x] `TestStageHookFilesIncludesCodexAndCopilotExecutableHooks` extended: every workDir-based probed entry has `SkipFingerprint==true` AND `ContentHash==""` (verifying the dropped `HashPathContent` call). - [x] `TestStageHookFilesIncludesCanonicalClaudeHook` / `TestStageHookFilesFallsBackToLegacyClaudeHook` extended: cityDir fallback entries must NOT have `SkipFingerprint`. - [x] `go build ./...` clean - [x] `go vet ./...` clean - [x] `make check` fmt-check + lint + vet clean - [x] `go test ./internal/runtime/... ./cmd/gc/` PASS (two baseline flakes verified failing identically on `origin/main`: `TestHandleProviderReadinessReturnsNotInstalledWhenBinaryMissing` in `internal/api`, `TestControllerReloadsNamedSessionModeAndAppliesIdleTimeout` in `cmd/gc` — neither is in this PR's code paths) ## Upgrade note Live sessions persisted a `started_config_hash` computed with the old (unstable) `CopyFiles` hashing. On the first reconciler tick after upgrade, those hashes will not match the post-fix `CoreFingerprint`, triggering one legit drift drain per live session. This is a one-time cost that was already happening continuously pre-fix; the thrash loop stops after the single replacement cycle. --------- Co-authored-by: sjarmak <sjarmak@users.noreply.github.com>
1 parent c792dde commit 391222d

6 files changed

Lines changed: 185 additions & 6 deletions

File tree

cmd/gc/cmd_start.go

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -734,6 +734,9 @@ func stageHookFiles(copyFiles []runtime.CopyEntry, cityPath, workDir string) []r
734734

735735
// workDir-based hooks: gemini, codex, opencode, copilot, pi, omp.
736736
// Use path.Join for RelDst (container-target, always forward slashes).
737+
// These paths live inside the agent worktree and are populated by
738+
// pre_start staging (e.g. worktree-setup.sh --sync), so hashing their
739+
// contents would drift across reconciler cycles. See issue #682.
737740
for _, rel := range []string{
738741
path.Join(".gemini", "settings.json"),
739742
path.Join(".codex", "hooks.json"),
@@ -747,7 +750,7 @@ func stageHookFiles(copyFiles []runtime.CopyEntry, cityPath, workDir string) []r
747750
if _, err := os.Stat(abs); err == nil {
748751
copyFiles = append(copyFiles, runtime.CopyEntry{
749752
Src: abs, RelDst: path.Join(relWorkDir, rel),
750-
Probed: true, ContentHash: runtime.HashPathContent(abs),
753+
Probed: true, SkipFingerprint: true,
751754
})
752755
}
753756
}
@@ -756,7 +759,7 @@ func stageHookFiles(copyFiles []runtime.CopyEntry, cityPath, workDir string) []r
756759
if info, err := os.Stat(skillsDir); err == nil && info.IsDir() {
757760
copyFiles = append(copyFiles, runtime.CopyEntry{
758761
Src: skillsDir, RelDst: path.Join(relWorkDir, ".claude", "skills"),
759-
Probed: true, ContentHash: runtime.HashPathContent(skillsDir),
762+
Probed: true, SkipFingerprint: true,
760763
})
761764
}
762765
// cityDir-based hooks: claude (.gc/settings.json).

cmd/gc/cmd_start_test.go

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -284,13 +284,15 @@ func TestStageHookFilesIncludesCodexAndCopilotExecutableHooks(t *testing.T) {
284284
t.Errorf("stageHookFiles() missing %q (got %v)", want, rels)
285285
}
286286
}
287-
// All filesystem-probed entries must be marked Probed with a ContentHash.
288287
for _, entry := range got {
289288
if !entry.Probed {
290289
t.Errorf("stageHookFiles() entry %q not marked Probed", entry.RelDst)
291290
}
292-
if entry.ContentHash == "" {
293-
t.Errorf("stageHookFiles() entry %q has empty ContentHash", entry.RelDst)
291+
if !entry.SkipFingerprint {
292+
t.Errorf("stageHookFiles() workDir entry %q must have SkipFingerprint=true (#682)", entry.RelDst)
293+
}
294+
if entry.ContentHash != "" {
295+
t.Errorf("stageHookFiles() workDir entry %q should skip HashPathContent, got %q (#682)", entry.RelDst, entry.ContentHash)
294296
}
295297
}
296298
}
@@ -319,6 +321,10 @@ func TestStageHookFilesIncludesCanonicalClaudeHook(t *testing.T) {
319321
if entry.ContentHash == "" {
320322
t.Fatal("stageHookFiles() .gc/settings.json has empty ContentHash")
321323
}
324+
// City-level settings MUST still drive config drift — never skip.
325+
if entry.SkipFingerprint {
326+
t.Fatal("stageHookFiles() .gc/settings.json must not have SkipFingerprint (cityDir fallback must drive drain on real edits)")
327+
}
322328
return
323329
}
324330
}
@@ -348,6 +354,10 @@ func TestStageHookFilesFallsBackToLegacyClaudeHook(t *testing.T) {
348354
if entry.ContentHash == "" {
349355
t.Fatal("stageHookFiles() hooks/claude.json has empty ContentHash")
350356
}
357+
// City-level legacy fallback must still drive drain.
358+
if entry.SkipFingerprint {
359+
t.Fatal("stageHookFiles() hooks/claude.json must not have SkipFingerprint (cityDir fallback must drive drain)")
360+
}
351361
return
352362
}
353363
}

cmd/gc/config_hash.go

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,8 +102,15 @@ func canonicalConfigHash(params TemplateParams, overlay map[string]string) strin
102102
h.Write([]byte(params.Hints.OverlayDir)) //nolint:errcheck
103103
h.Write([]byte{0}) //nolint:errcheck
104104

105-
// CopyFiles.
105+
// CopyFiles. Mirrors runtime.hashCoreFields only in excluding probed
106+
// workDir entries marked SkipFingerprint, so this hash stays stable if
107+
// that dormant path is ever wired into drift detection (#682). Unlike
108+
// runtime hashing, this canonical hash still fingerprints CopyFiles by
109+
// Src and RelDst here; it does not use ContentHash or a sentinel.
106110
for _, cf := range params.Hints.CopyFiles {
111+
if cf.Probed && cf.SkipFingerprint {
112+
continue
113+
}
107114
h.Write([]byte(cf.Src)) //nolint:errcheck
108115
h.Write([]byte{0}) //nolint:errcheck
109116
h.Write([]byte(cf.RelDst)) //nolint:errcheck

internal/runtime/fingerprint.go

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,13 @@ func hashCoreFields(h hash.Hash, cfg Config) {
164164
// use Src/RelDst paths. When a probed entry has an empty ContentHash
165165
// (transient I/O error), a stable sentinel is used instead of falling
166166
// back to path-based hashing, which would flip fingerprint modes.
167+
// Probed entries with SkipFingerprint are excluded entirely — see
168+
// issue #682. SkipFingerprint is ignored on config-derived entries
169+
// so real config changes always drive drain.
167170
for _, cf := range cfg.CopyFiles {
171+
if cf.Probed && cf.SkipFingerprint {
172+
continue
173+
}
168174
if cf.Probed {
169175
h.Write([]byte(cf.RelDst)) //nolint:errcheck // hash.Write never errors
170176
h.Write([]byte{0}) //nolint:errcheck // hash.Write never errors
@@ -271,6 +277,9 @@ func CoreFingerprintBreakdown(cfg Config) map[string]string {
271277
}),
272278
"CopyFiles": fieldHash(func(h hash.Hash) {
273279
for _, cf := range cfg.CopyFiles {
280+
if cf.Probed && cf.SkipFingerprint {
281+
continue
282+
}
274283
if cf.Probed {
275284
h.Write([]byte(cf.RelDst))
276285
h.Write([]byte{0})
@@ -335,6 +344,9 @@ func LogCoreFingerprintDrift(w io.Writer, name string, storedBreakdown map[strin
335344
fmt.Fprintf(w, " SessionSetupScript len: %d\n", len(current.SessionSetupScript)) //nolint:errcheck // best-effort diag
336345
case "CopyFiles":
337346
for i, cf := range current.CopyFiles {
347+
if cf.Probed && cf.SkipFingerprint {
348+
continue
349+
}
338350
fmt.Fprintf(w, " CopyFiles[%d]: RelDst=%q ContentHash=%q\n", i, cf.RelDst, cf.ContentHash) //nolint:errcheck // best-effort diag
339351
}
340352
}

internal/runtime/fingerprint_test.go

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ import (
44
"bytes"
55
"os"
66
"path/filepath"
7+
"slices"
78
"testing"
89
)
910

@@ -450,3 +451,139 @@ func TestLogCoreFingerprintDriftCopyFiles(t *testing.T) {
450451
t.Errorf("expected RelDst detail in CopyFiles drift output, got: %s", out)
451452
}
452453
}
454+
455+
// TestSkipFingerprintExcludesFromCoreHash locks in the fix for issue #682:
456+
// CopyEntry values marked SkipFingerprint must not influence CoreFingerprint
457+
// regardless of their ContentHash, Src, or RelDst values.
458+
func TestSkipFingerprintExcludesFromCoreHash(t *testing.T) {
459+
base := Config{Command: "claude"}
460+
skipA := Config{Command: "claude", CopyFiles: []CopyEntry{
461+
{
462+
Src: "/tmp/worktree/.claude/skills", RelDst: ".claude/skills",
463+
Probed: true, ContentHash: "aaaa", SkipFingerprint: true,
464+
},
465+
}}
466+
skipB := Config{Command: "claude", CopyFiles: []CopyEntry{
467+
{
468+
Src: "/tmp/worktree/.claude/skills", RelDst: ".claude/skills",
469+
Probed: true, ContentHash: "bbbb", SkipFingerprint: true,
470+
},
471+
}}
472+
skipEmpty := Config{Command: "claude", CopyFiles: []CopyEntry{
473+
{
474+
Src: "/tmp/worktree/.claude/skills", RelDst: ".claude/skills",
475+
Probed: true, ContentHash: "", SkipFingerprint: true,
476+
},
477+
}}
478+
479+
baseH := CoreFingerprint(base)
480+
if got := CoreFingerprint(skipA); got != baseH {
481+
t.Errorf("skipped entry changed fingerprint: base=%s got=%s", baseH, got)
482+
}
483+
if got := CoreFingerprint(skipB); got != baseH {
484+
t.Errorf("skipped entry with different ContentHash changed fingerprint: base=%s got=%s", baseH, got)
485+
}
486+
if got := CoreFingerprint(skipEmpty); got != baseH {
487+
t.Errorf("skipped entry with empty ContentHash changed fingerprint: base=%s got=%s", baseH, got)
488+
}
489+
// Breakdown must also report the same CopyFiles field hash as the base.
490+
baseBD := CoreFingerprintBreakdown(base)
491+
skipABD := CoreFingerprintBreakdown(skipA)
492+
if baseBD["CopyFiles"] != skipABD["CopyFiles"] {
493+
t.Errorf("skipped entry changed breakdown: base=%s got=%s", baseBD["CopyFiles"], skipABD["CopyFiles"])
494+
}
495+
}
496+
497+
// TestSkipFingerprintIgnoredOnConfigDerivedEntries ensures the doc contract
498+
// is enforced: SkipFingerprint is only honored on probed entries. A
499+
// config-derived entry (Probed=false) with SkipFingerprint=true must still
500+
// contribute to CoreFingerprint so real user edits drive drain.
501+
func TestSkipFingerprintIgnoredOnConfigDerivedEntries(t *testing.T) {
502+
base := Config{Command: "claude", CopyFiles: []CopyEntry{
503+
{Src: "/user/config.json", RelDst: "config.json"},
504+
}}
505+
withSkip := Config{Command: "claude", CopyFiles: []CopyEntry{
506+
{Src: "/user/config.json", RelDst: "config.json", SkipFingerprint: true},
507+
}}
508+
if CoreFingerprint(base) != CoreFingerprint(withSkip) {
509+
t.Error("SkipFingerprint must be ignored on config-derived (Probed=false) entries")
510+
}
511+
// Changing the Src on a config-derived entry must still drive drift,
512+
// even if SkipFingerprint is set.
513+
edited := Config{Command: "claude", CopyFiles: []CopyEntry{
514+
{Src: "/user/config-edited.json", RelDst: "config.json", SkipFingerprint: true},
515+
}}
516+
if CoreFingerprint(withSkip) == CoreFingerprint(edited) {
517+
t.Error("config-derived Src change must drive drift even with SkipFingerprint=true")
518+
}
519+
}
520+
521+
// TestSkipFingerprintStableUnderFilesystemChurn is the regression guard for
522+
// issue #682: a probed entry whose underlying directory is populated between
523+
// probes (simulating pre_start staging) must produce a stable CoreFingerprint
524+
// when marked SkipFingerprint, and a drifting one otherwise.
525+
func TestSkipFingerprintStableUnderFilesystemChurn(t *testing.T) {
526+
workDir := t.TempDir()
527+
skillsDir := filepath.Join(workDir, ".claude", "skills")
528+
if err := os.MkdirAll(skillsDir, 0o755); err != nil {
529+
t.Fatalf("MkdirAll: %v", err)
530+
}
531+
// Phase 1: empty skills dir — resembles template-resolve BEFORE pre_start
532+
// has finished populating the worktree.
533+
before := Config{Command: "claude", CopyFiles: []CopyEntry{{
534+
Src: skillsDir, RelDst: ".claude/skills",
535+
Probed: true, ContentHash: HashPathContent(skillsDir),
536+
}}}
537+
// Phase 2: pre_start completes and drops a file in the worktree.
538+
if err := os.WriteFile(filepath.Join(skillsDir, "new.md"), []byte("content"), 0o644); err != nil {
539+
t.Fatalf("WriteFile: %v", err)
540+
}
541+
after := Config{Command: "claude", CopyFiles: []CopyEntry{{
542+
Src: skillsDir, RelDst: ".claude/skills",
543+
Probed: true, ContentHash: HashPathContent(skillsDir),
544+
}}}
545+
546+
// Without SkipFingerprint the hash MUST drift (this is the bug).
547+
beforeH := CoreFingerprint(before)
548+
afterH := CoreFingerprint(after)
549+
if beforeH == afterH {
550+
t.Fatal("test precondition: HashPathContent must observe filesystem change")
551+
}
552+
553+
// With SkipFingerprint the same mutation must produce a stable hash.
554+
beforeSkip := before
555+
beforeSkip.CopyFiles = slices.Clone(before.CopyFiles)
556+
beforeSkip.CopyFiles[0].SkipFingerprint = true
557+
afterSkip := after
558+
afterSkip.CopyFiles = slices.Clone(after.CopyFiles)
559+
afterSkip.CopyFiles[0].SkipFingerprint = true
560+
561+
if got1, got2 := CoreFingerprint(beforeSkip), CoreFingerprint(afterSkip); got1 != got2 {
562+
t.Errorf("SkipFingerprint did not stabilize CopyFiles hash under churn: before=%s after=%s", got1, got2)
563+
}
564+
}
565+
566+
// TestLogCoreFingerprintDriftSkipsExcludedEntries ensures diagnostic output
567+
// does not leak skipped entries, which would otherwise confuse operators
568+
// debugging real drift.
569+
func TestLogCoreFingerprintDriftSkipsExcludedEntries(t *testing.T) {
570+
stored := map[string]string{
571+
"CopyFiles": "oldhash",
572+
}
573+
current := Config{
574+
Command: "claude",
575+
CopyFiles: []CopyEntry{
576+
{RelDst: "real", ContentHash: "h1"},
577+
{RelDst: "skipped-churn", Probed: true, ContentHash: "h2", SkipFingerprint: true},
578+
},
579+
}
580+
var buf bytes.Buffer
581+
LogCoreFingerprintDrift(&buf, "agent", stored, current)
582+
out := buf.String()
583+
if !bytes.Contains([]byte(out), []byte("real")) {
584+
t.Errorf("expected non-skipped RelDst in drift output, got: %s", out)
585+
}
586+
if bytes.Contains([]byte(out), []byte("skipped-churn")) {
587+
t.Errorf("skipped entry leaked into drift output: %s", out)
588+
}
589+
}

internal/runtime/runtime.go

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -230,6 +230,16 @@ type CopyEntry struct {
230230
// (transient I/O error), the fingerprint uses a stable sentinel rather
231231
// than falling back to path-based hashing.
232232
ContentHash string
233+
// SkipFingerprint excludes this entry from CoreFingerprint entirely.
234+
// Set for probed entries whose contents are produced by pre_start
235+
// staging (e.g. files inside the agent worktree populated by
236+
// worktree-setup.sh). Fingerprinting such entries would conflate
237+
// "config changed" with "pre_start not done yet" and cause false
238+
// config-drift drains. See issue #682. The entry is still staged to
239+
// K8s pods and retained in CopyFiles for container providers — the
240+
// entry is excluded from identity hashing. Only meaningful when
241+
// Probed is true; config-derived entries must still drive drain.
242+
SkipFingerprint bool
233243
}
234244

235245
// HashPathContent returns a hex-encoded SHA-256 of the content at path.

0 commit comments

Comments
 (0)