Skip to content

[COR-345] auth-go refresh tier e2e: ctx-cancel + cross-process rotation race#13

Open
khaong wants to merge 3 commits into
alex/cor-343-auth-go-refresh-tier-e2e-harnessfrom
alex/cor-345-auth-go-refresh-tier-e2e-ctx-cancel-cross-process-rotation
Open

[COR-345] auth-go refresh tier e2e: ctx-cancel + cross-process rotation race#13
khaong wants to merge 3 commits into
alex/cor-343-auth-go-refresh-tier-e2e-harnessfrom
alex/cor-345-auth-go-refresh-tier-e2e-ctx-cancel-cross-process-rotation

Conversation

@khaong

@khaong khaong commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Stacked on #12 (COR-343). Adds the two highest-value e2e scenarios still missing after that harness lands. Three new subprocess tests in tokenmanager/e2e_subprocess_test.go, plus one small additive extension to internal/testoauth.Server. No production code changes.

What lands

  1. TestE2ESub_CtxCancelReleasesLocks — A CLI Ctrl+C'd mid-refresh is the most common real-world scenario the harness doesn't touch. Helper A's 150ms context.WithTimeout fires while the server is stalled; helper B then refreshes against the same LockDir and must succeed in < 2s. Proves both refreshMu and the cross-process proclock were released on the ctx-cancel path. New helper mode refresh-with-timeout.

  2. TestE2ESub_NonCooperatingRaceStrictRevokes — Two helpers with distinct LockDirs but a shared fileStore (non-cooperating actors) race the same family. With IdempotencySuccessor: 0, the loser's first grant gets invalid_grant + family revocation; doRefresh's retry with cur.RefreshToken also fails; surfaces ErrReauthRequired. Server sees GrantCount==3, RefreshGrantCount==1, FamilyRevoked==true. Validates the retry plumbing fires end-to-end against real RFC 9700 server semantics, where COR-343's harness could only exercise it with fakes.

  3. TestE2ESub_NonCooperatingRaceLaxAbsorbs — Same race with IdempotencySuccessor: 5s. The loser's grant returns the already-issued successor idempotently — no retry, no revocation, both succeed. Asserts GrantCount==2, RefreshGrantCount==2, family not revoked. Validates Manager + server idempotency window mesh cleanly.

Extension to internal/testoauth.Server (additive, no behaviour change)

StallNextRefresh() now returns (release func(), stalled <-chan struct{}). The stalled channel closes when the next refresh enters the stall wait — replaces the previous polling-plus-30ms-sleep race-ordering pattern with deterministic synchronisation. Existing self-test adapted to the new return.

Not testable end-to-end (and why)

doRefresh's "first grant fails, retry SUCCEEDS with new RT" branch isn't reachable against a well-behaved RFC 9700 server — the family state is shared between an RT and its successor, so either the first grant succeeds idempotently (within the window) or the first grant + retry both fail (outside the window, family revoked). The branch is defensive code for a server that decouples per-RT invalidation from family state, which RFC 9700 explicitly rules out. The existing unit test TestDoRefresh_RotationRaceRetriesWithNewRT exercises it with a fake refresh fn; that remains the only place. The race-mechanics-observability limit is explicitly documented in Test 2a.

Process

Same TDD subagent flow. Multi-agent PR review caught real items pre-push, all closed in b7d768a (squashed into the test commit's intent):

  • Replaced the polling-plus-30ms-sleep race ordering with the new stalled channel — the latent flake vector both reviewers converged on.
  • Narrowed helperRefreshWithTimeout's OK-criterion to specific known substrings instead of a loose strings.Contains("context") that could mask unrelated errors.
  • Switched the new tests to the existing decodeHelperOutput helper (consistent with older tests).
  • Tightened elapsed thresholds 5000→2000ms.
  • Added a comment documenting Test 2a's observability limits on distinguishing "retry-with-cur-RT" vs "retry-with-original-RT" (both produce the same external state under strict mode; the unit test covers the distinction directly).

Verification

  • gofmt -l . clean, golangci-lint run ./... 0 issues, govulncheck clean
  • go test ./... -race -count=10 — all packages green
  • go test ./tokenmanager/ -race -count=50 -run TestE2ESub_(CtxCancel|NonCooperating)50/50 clean (~2.1s per iteration)
  • GOOS=windows go build ./... clean
  • Bugbot review (post-push)

Branch ordering

Stacked on alex/cor-343-.... Base is set to that branch on this PR so the diff GitHub shows is the cor-345-only delta. When #12 merges, this rebases onto main automatically.

🤖 Generated with Claude Code


Note

Low Risk
Changes are limited to test harness, mock OAuth server, and documentation; production auth paths are unchanged.

Overview
Adds three subprocess e2e tests in tokenmanager/e2e_subprocess_test.go for refresh-tier gaps after COR-343: context cancel mid-refresh (locks released so a follow-up refresh on the same LockDir finishes in <2s), and two-process rotation races with separate LockDirs but a shared file store—strict idempotency (ErrReauthRequired, family revoked, GrantCount shows retry plumbing) vs lax window (both succeed, no revocation).

Introduces helper mode refresh-with-timeout (AUTHGO_E2E_CTX_TIMEOUT_MS) and extends testoauth.Server.StallNextRefresh to return a stalled channel for deterministic “helper A is blocked” sync (replacing sleep/polling for race tests). Existing stall call sites use release, _ := …. Adds an approved design spec doc; no production tokenmanager code changes.

Reviewed by Cursor Bugbot for commit b7d768a. Configure here.

khaong and others added 3 commits May 28, 2026 17:52
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three subprocess tests against the testoauth mock server, covering
production paths the existing harness doesn't exercise:

- TestE2ESub_CtxCancelReleasesLocks: Helper A's refresh ctx times out
  mid-grant (server stalled). Helper B then refreshes quickly with the
  same LockDir, proving both refreshMu and proclock were released on
  the cancellation path. New helper mode refresh-with-timeout.
- TestE2ESub_NonCooperatingRaceStrictRevokes: two helpers with
  distinct LockDirs share a fileStore. With IdempotencySuccessor: 0,
  the loser's first grant gets invalid_grant + family revocation; its
  doRefresh retry with cur.RefreshToken also fails (family revoked);
  surfaces ErrReauthRequired. Validates the retry plumbing fires
  against real server semantics, even though the final outcome is
  reauth.
- TestE2ESub_NonCooperatingRaceLaxAbsorbs: same setup with a 5s
  idempotency window. The loser's grant gets the already-issued
  successor idempotently — no retry fires, no revocation.
  Validates Manager + idempotency window mesh cleanly.

Race-ordering driven by srv.StallNextRefresh + a small polling
helper (waitForGrantHandlerEntry) — the parent waits for helper A's
request to enter the handler before spawning B, so family-state
mutations land in the test's expected order. No production code
changes; pure test additions in tokenmanager/e2e_subprocess_test.go.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ening

- StallNextRefresh now returns (release, stalled<-chan struct{}). The
  stalled channel closes when the next refresh request enters the stall
  wait, eliminating the polling-plus-30ms-sleep race-ordering heuristic
  the race tests used. Both NonCooperatingRace* tests switch to <-stalled
  for deterministic synchronisation on 'A is blocked at the server'.
- helperRefreshWithTimeout OK-criterion narrowed: only context
  sentinels via errors.Is plus 'context deadline'/'context canceled'
  substring matches. Drops the loose strings.Contains('context')
  fallback that could mask unrelated errors.
- Race tests + CtxCancel test now use the existing decodeHelperOutput
  helper consistently (instead of inline json.Decode), matching the
  pattern of older tests in the file.
- CtxCancelReleasesLocks elapsed thresholds tightened 5000ms -> 2000ms
  on both A and B; still 10x margin over expected duration, but
  catches a partial-orphan that held the lock for several seconds.
- Test 2a (NonCooperatingRaceStrictRevokes) comment clarifies that
  GrantCount==3 + RefreshGrantCount==1 proves the retry fires but does
  not externally distinguish 'retry with cur RT' from 'retry with
  original RT' — that distinction is covered by the existing unit test
  TestDoRefresh_RotationRaceRetriesWithNewRT.
- waitForGrantHandlerEntry removed (lint: unused after signal-channel
  replaces polling in both call sites).
- Spec doc synced with the implementation thresholds + signal-channel
  pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@khaong

khaong commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b7d768a. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant