fix(toolbar): stop OAuth deauth loop on hedgehog toggle and transient errors#58699
Draft
posthog[bot] wants to merge 1 commit into
Draft
fix(toolbar): stop OAuth deauth loop on hedgehog toggle and transient errors#58699posthog[bot] wants to merge 1 commit into
posthog[bot] wants to merge 1 commit into
Conversation
… errors The toolbar OAuth flow had three converging issues forcing users to re-authenticate dozens of times in production: - TOOLBAR_OAUTH_SCOPES only granted `user:read`, but the hedgehog mode toggle PATCHes `/api/users/@me/hedgehog_config` which requires `user:write`. Every toggle deterministically returned 403. - toolbarFetch treated every 403 as a session-kill via `tokenExpired()`, so a missing scope (or a project-switch) wiped the session and looped the user through the full PKCE OAuth flow. - withTokenRefresh called `tokenExpired()` on any refresh failure, including transient 5xx / 408 / network errors — one bad proxy round-trip cost users their toolbar session. Also surface a typed `InsecureContextError` from `generatePKCE` when `crypto.subtle` is unavailable (HTTP non-secure contexts) so the user sees a clear "HTTPS or localhost required" toast instead of an opaque TypeError reading `.digest` of undefined. Generated-By: PostHog Code Task-Id: 4bb30fa4-e07b-45b2-94d6-6218b57ae099
Contributor
|
🎭 Playwright didn't run on this PR — your changes touch code that could affect E2E behavior, but Playwright is opt-in via label now to keep CI cost down. Add the Most PRs don't need this. Real regressions still get caught on master and fix-forward. |
Contributor
|
Size Change: -40.8 kB (-0.03%) Total Size: 117 MB 📦 View Changed
ℹ️ View Unchanged
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PostHog Cloud toolbar users were being kicked out of the toolbar repeatedly — telemetry over a 7-day window showed 664,265 'toolbar token expired' events vs only 11,390 successful refreshes (a 58:1 ratio), with at least one paying customer reporting they had to re-authenticate dozens of times before giving up. A separate ticket reported that the toolbar 'flickers back to unauthenticated' the moment the user toggles hedgehog mode.
Three issues converged into one feedback loop:
TOOLBAR_OAUTH_SCOPESwas missinguser:write— the hedgehog mode toggle PATCHes/api/users/@me/hedgehog_config, whichdangerously_get_required_scopesgates onuser:write. Every hedgehog toggle from the toolbar deterministically returned 403.toolbarFetchcalledtokenExpired()on every 403 — so the missing-scope 403 from (1) (or any other permission error, including a benign project switch) nuked the OAuth session and forced the user back through the full PKCE OAuth flow.withTokenRefreshcalledtokenExpired()on every refresh failure — including transient 5xx, 408, and network errors. A single 502 from a proxy hiccup cost the user their toolbar session.A small additional bug surfaced from telemetry:
generatePKCEcallscrypto.subtle.digestunguarded, which throws an opaqueTypeError: Cannot read properties of undefined (reading 'digest')on non-secure HTTP hosts whereSubtleCryptois not exposed.Changes
Backend
user:writetoTOOLBAR_OAUTH_SCOPESso the hedgehog toggle stops 403ing (posthog/settings/web.py).hedgehog_config(posthog/api/test/test_toolbar_oauth_endpoint_auth.py).Frontend
toolbarFetch: stop callingtokenExpired()on 403. A 403 means the token authenticated but lacks scope/permission — re-running OAuth produces the same scopes, so destroying the session can only make things worse. Log and capture telemetry instead (frontend/src/toolbar/toolbarConfigLogic.ts).withTokenRefresh/refreshOAuthTokens: classify refresh failures. Onlyinvalid_grant/invalid_tokenand non-408/non-429 4xx are treated as terminal (clear the session). Transient failures — 5xx, 408, 429, network errors — get one retry with a short backoff and never deauth the user (frontend/src/toolbar/toolbarAuth.ts).generatePKCE: guard against missingcrypto.subtlewith a typedInsecureContextError.confirmAuthenticatenow shows "PostHog toolbar requires HTTPS or localhost" when this specific failure mode hits (frontend/src/toolbar/utils.ts,frontend/src/toolbar/toolbarConfigLogic.ts).How did you test this code?
I am an automated agent — I did not perform manual / browser testing. Automated coverage I added or updated:
frontend/src/toolbar/toolbarConfigLogic.test.ts— added:does not clear session on a 403,keeps session intact when refresh fails with a transient 5xx and retry also fails,keeps session intact when refresh fails with a network error and retry also fails,retries refresh on first 502 and succeeds on second attempt, and renamed the existing terminal-4xx test tocalls tokenExpired when refresh fails with terminal 4xx(usinginvalid_grantpayload).frontend/src/toolbar/utils.test.ts— added ageneratePKCEblock coveringInsecureContextErrorfor missingcrypto.subtleand missingdigest, plus a stubbed-crypto.subtlehappy path.posthog/api/test/test_toolbar_oauth_endpoint_auth.py— addedtest_full_toolbar_token_grants_patch_access(regression for the missing-scope bug) and addeduser:writeto theTestToolbarOAuthScopesConfig.EXPECTED_SCOPESlist.pnpm jest frontend/src/toolbar/utils.test.tsandpnpm jest frontend/src/toolbar/toolbarConfigLogic.test.tsboth pass (19/19 and 142/142). Python ruffcheck/format --checkare clean on the modified files.The pre-existing toolbar test failures (
index.test.ts,actionsTabLogic.test.ts,experimentsTabLogic.test.ts) are unrelated — they fail identically on master because of a missing@posthog/quilldependency.Publish to changelog?
no
🤖 Agent context
Authored end-to-end by an agent acting on a P1 signal report.
crypto.subtle.digestcall). I verified each claim against the current code before changing anything.user:writetoTOOLBAR_OAUTH_SCOPES, vs (b) drop theuser:writerequirement onhedgehog_configPATCH. I chose (a) because it leaves the API contract intact (personal API keys still needuser:writeto update the field) and matches the existing toolbar scope pattern (the toolbar already hasexperiment:write,action:write,survey:write,uploaded_media:write). The narrower option is mentioned in the signal report but trades off security semantics elsewhere — happy to switch on review if preferred.tokenExpired()but skipping it for specific paths (e.g.hedgehog_config). I rejected that as fragile — once the scope set is correct, the remaining 403s are project-switch / permission edge cases where re-auth produces the same scopes and so cannot help. Telemetry preserved via a newtoolbar api forbiddenevent.RefreshErrorwithhttpStatus+oauthErrorand a singleisTerminalRefreshErrorhelper, rather than duck-typing onError.message. Retry is a single attempt with a 400ms backoff — enough to ride out a transient proxy hiccup without compounding latency on a genuinely down auth server.Created with PostHog Code