[COR-345] auth-go refresh tier e2e: ctx-cancel + cross-process rotation race#13
Open
khaong wants to merge 3 commits into
Conversation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three subprocess tests against the testoauth mock server, covering production paths the existing harness doesn't exercise: - TestE2ESub_CtxCancelReleasesLocks: Helper A's refresh ctx times out mid-grant (server stalled). Helper B then refreshes quickly with the same LockDir, proving both refreshMu and proclock were released on the cancellation path. New helper mode refresh-with-timeout. - TestE2ESub_NonCooperatingRaceStrictRevokes: two helpers with distinct LockDirs share a fileStore. With IdempotencySuccessor: 0, the loser's first grant gets invalid_grant + family revocation; its doRefresh retry with cur.RefreshToken also fails (family revoked); surfaces ErrReauthRequired. Validates the retry plumbing fires against real server semantics, even though the final outcome is reauth. - TestE2ESub_NonCooperatingRaceLaxAbsorbs: same setup with a 5s idempotency window. The loser's grant gets the already-issued successor idempotently — no retry fires, no revocation. Validates Manager + idempotency window mesh cleanly. Race-ordering driven by srv.StallNextRefresh + a small polling helper (waitForGrantHandlerEntry) — the parent waits for helper A's request to enter the handler before spawning B, so family-state mutations land in the test's expected order. No production code changes; pure test additions in tokenmanager/e2e_subprocess_test.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ening
- StallNextRefresh now returns (release, stalled<-chan struct{}). The
stalled channel closes when the next refresh request enters the stall
wait, eliminating the polling-plus-30ms-sleep race-ordering heuristic
the race tests used. Both NonCooperatingRace* tests switch to <-stalled
for deterministic synchronisation on 'A is blocked at the server'.
- helperRefreshWithTimeout OK-criterion narrowed: only context
sentinels via errors.Is plus 'context deadline'/'context canceled'
substring matches. Drops the loose strings.Contains('context')
fallback that could mask unrelated errors.
- Race tests + CtxCancel test now use the existing decodeHelperOutput
helper consistently (instead of inline json.Decode), matching the
pattern of older tests in the file.
- CtxCancelReleasesLocks elapsed thresholds tightened 5000ms -> 2000ms
on both A and B; still 10x margin over expected duration, but
catches a partial-orphan that held the lock for several seconds.
- Test 2a (NonCooperatingRaceStrictRevokes) comment clarifies that
GrantCount==3 + RefreshGrantCount==1 proves the retry fires but does
not externally distinguish 'retry with cur RT' from 'retry with
original RT' — that distinction is covered by the existing unit test
TestDoRefresh_RotationRaceRetriesWithNewRT.
- waitForGrantHandlerEntry removed (lint: unused after signal-channel
replaces polling in both call sites).
- Spec doc synced with the implementation thresholds + signal-channel
pattern.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit b7d768a. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #12 (COR-343). Adds the two highest-value e2e scenarios still missing after that harness lands. Three new subprocess tests in
tokenmanager/e2e_subprocess_test.go, plus one small additive extension tointernal/testoauth.Server. No production code changes.What lands
TestE2ESub_CtxCancelReleasesLocks— A CLI Ctrl+C'd mid-refresh is the most common real-world scenario the harness doesn't touch. Helper A's 150mscontext.WithTimeoutfires while the server is stalled; helper B then refreshes against the sameLockDirand must succeed in < 2s. Proves bothrefreshMuand the cross-processproclockwere released on the ctx-cancel path. New helper moderefresh-with-timeout.TestE2ESub_NonCooperatingRaceStrictRevokes— Two helpers with distinctLockDirs but a sharedfileStore(non-cooperating actors) race the same family. WithIdempotencySuccessor: 0, the loser's first grant getsinvalid_grant+ family revocation;doRefresh's retry withcur.RefreshTokenalso fails; surfacesErrReauthRequired. Server seesGrantCount==3,RefreshGrantCount==1,FamilyRevoked==true. Validates the retry plumbing fires end-to-end against real RFC 9700 server semantics, where COR-343's harness could only exercise it with fakes.TestE2ESub_NonCooperatingRaceLaxAbsorbs— Same race withIdempotencySuccessor: 5s. The loser's grant returns the already-issued successor idempotently — no retry, no revocation, both succeed. AssertsGrantCount==2,RefreshGrantCount==2, family not revoked. Validates Manager + server idempotency window mesh cleanly.Extension to
internal/testoauth.Server(additive, no behaviour change)StallNextRefresh()now returns(release func(), stalled <-chan struct{}). Thestalledchannel closes when the next refresh enters the stall wait — replaces the previous polling-plus-30ms-sleep race-ordering pattern with deterministic synchronisation. Existing self-test adapted to the new return.Not testable end-to-end (and why)
doRefresh's "first grant fails, retry SUCCEEDS with new RT" branch isn't reachable against a well-behaved RFC 9700 server — the family state is shared between an RT and its successor, so either the first grant succeeds idempotently (within the window) or the first grant + retry both fail (outside the window, family revoked). The branch is defensive code for a server that decouples per-RT invalidation from family state, which RFC 9700 explicitly rules out. The existing unit testTestDoRefresh_RotationRaceRetriesWithNewRTexercises it with a fake refresh fn; that remains the only place. The race-mechanics-observability limit is explicitly documented in Test 2a.Process
Same TDD subagent flow. Multi-agent PR review caught real items pre-push, all closed in
b7d768a(squashed into the test commit's intent):stalledchannel — the latent flake vector both reviewers converged on.helperRefreshWithTimeout's OK-criterion to specific known substrings instead of a loosestrings.Contains("context")that could mask unrelated errors.decodeHelperOutputhelper (consistent with older tests).Verification
gofmt -l .clean,golangci-lint run ./...0 issues,govulncheckcleango test ./... -race -count=10— all packages greengo test ./tokenmanager/ -race -count=50 -run TestE2ESub_(CtxCancel|NonCooperating)— 50/50 clean (~2.1s per iteration)GOOS=windows go build ./...cleanBranch ordering
Stacked on
alex/cor-343-.... Base is set to that branch on this PR so the diff GitHub shows is the cor-345-only delta. When #12 merges, this rebases onto main automatically.🤖 Generated with Claude Code
Note
Low Risk
Changes are limited to test harness, mock OAuth server, and documentation; production auth paths are unchanged.
Overview
Adds three subprocess e2e tests in
tokenmanager/e2e_subprocess_test.gofor refresh-tier gaps after COR-343: context cancel mid-refresh (locks released so a follow-up refresh on the sameLockDirfinishes in <2s), and two-process rotation races with separateLockDirs but a shared file store—strict idempotency (ErrReauthRequired, family revoked,GrantCountshows retry plumbing) vs lax window (both succeed, no revocation).Introduces helper mode
refresh-with-timeout(AUTHGO_E2E_CTX_TIMEOUT_MS) and extendstestoauth.Server.StallNextRefreshto return astalledchannel for deterministic “helper A is blocked” sync (replacing sleep/polling for race tests). Existing stall call sites userelease, _ := …. Adds an approved design spec doc; no productiontokenmanagercode changes.Reviewed by Cursor Bugbot for commit b7d768a. Configure here.