[wrangler] Add WebSocket auth relay for OAuth login in remote environments#13130
Closed
petebacondarwin wants to merge 27 commits into
Closed
[wrangler] Add WebSocket auth relay for OAuth login in remote environments#13130petebacondarwin wants to merge 27 commits into
petebacondarwin wants to merge 27 commits into
Conversation
🦋 Changeset detectedLatest commit: bbe5a23 The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
This comment was marked as resolved.
This comment was marked as resolved.
c4c371b to
48f43e3
Compare
f938617 to
9591c70
Compare
24443f2 to
ab4de06
Compare
Contributor
|
✅ All changesets look good |
create-cloudflare
@cloudflare/kv-asset-handler
miniflare
@cloudflare/pages-shared
@cloudflare/unenv-preset
@cloudflare/vite-plugin
@cloudflare/vitest-pool-workers
@cloudflare/workers-editor-shared
@cloudflare/workers-utils
wrangler
commit: |
460dc74 to
171202b
Compare
Contributor
|
Codeowners approval required for this PR:
Show detailed file reviewers |
Contributor
Author
|
Marking as blocked until we get security review approval. |
NuroDev
reviewed
Apr 28, 2026
This was referenced Apr 28, 2026
b6c6fbb to
3ccdfcf
Compare
- Add changeset for the new `wrangler login --x-websocket-callback` flag. - Switch DO tests to standard `SELF.fetch` from `cloudflare:test` and replace ad-hoc `setTimeout` waits with deterministic message Promises. - Make the auth worker honour the DO's response: callbacks for sessions with no connected WebSocket now redirect the browser to the consent-denied page instead of consent-granted, and the DO clears its cleanup alarm once the callback is delivered. - Harden `getOauthTokenViaWebSocket` in Wrangler: wrap the WebSocket in try/finally to guarantee cleanup, clear the 120s timeout on success, surface the underlying `ErrorEvent.message` instead of "[object Event]", and attach the message listener before opening the browser to avoid a theoretical race. - Add Wrangler-side tests for the new flow via a `MockWebSocket` stub installed through the global `undici` mock and a `mockOAuthRelayCallback` helper alongside the existing `mockOAuthServerCallback`.
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Fix unhandled promise rejection in `getOauthTokenViaWebSocket` (Devin BUG-0001): hoist `cleanup` and `loginTimeoutHandle` so the outer `finally` removes WebSocket listeners before `ws.close()` on every exit path — successful return, user-denied, parse error, `openInBrowser` throw, and the 120s timeout. Without this, any of those paths could leave `onClose` attached, which then rejects an orphan `messagePromise` when `ws.close()` fires the close event. - Complete the `--experimental-websocket-callback` rename (Devin BUG-0002): update `LoginProps` and the conditional in `login()` to use the renamed property, plus a regression test that the `--x-websocket-callback` alias still works. - Make `MockWebSocket.close()` dispatch the `close` event in a microtask (Devin ANALYSIS-0003) so the listener-cleanup bug is observable in tests. Add a regression test that asserts no orphan rejection escapes when `openInBrowser` throws after listeners have been attached — verified to fail when the cleanup is removed. - Add a README for the `wrangler-auth-worker` package.
Renames `@cloudflare/wrangler-auth-worker` to `@cloudflare/cf-auth-worker` and updates the worker name (`wrangler-auth-worker` -> `cf-auth-worker`), along with all references in the README, test names, and deployment validation snapshots.
When `wrangler login --experimental-websocket-callback` (or its `--x-websocket-callback` alias) can't reach the relay before the browser is opened — connect timeout, connect error, or premature close — Wrangler now logs a warning and transparently falls back to the existing local callback server flow. This handles transient relay outages and gracefully degrades for users on a regular laptop where localhost already works. The behaviour is controlled by a single env var, `WRANGLER_AUTH_WORKER_TIMEOUT` (milliseconds, default 5000): - Positive value: connect timeout of N ms; fall back on pre-open failure - 0: no connect timeout AND no fallback — useful in container/remote environments where you want relay-only behaviour with a clear failure if the relay is unreachable. Pre-open failures throw a new `RelayUnavailableError` (subclass of `UserError`) so `login()` can distinguish them from post-open errors. Failures *after* the browser has been opened keep the existing `UserError` path and never fall back, since the user has already committed to the relay's redirect_uri. Also adds an `autoOpen` flag to `MockWebSocket` so tests can simulate hung or failed connections, and lifts `mockOAuthRelayCallback` out of the inner describe so the fallback tests can drive both the relay and the localhost flow from the same `mockOAuthFlow()` instance.
- Use the newer `env.AUTH_SESSION.getByName(state)` instead of the two-step `idFromName(state)` + `get(id)` in both worker routes. - `handleCallback` now returns 404 (not 409) when no WebSocket is connected: there's no active session for that state, so 'Not Found' fits the semantics better. The other 409 (a second WebSocket connecting to a session that already has one) stays — that case is a genuine resource-state conflict. Adds a direct DO-level test that asserts the new 404. - Make alarm operations consistent: `setAlarm` is now awaited (matching `deleteAlarm` in `handleCallback`) so the alarm is durably written before the upgrade response is returned. `handleWebSocketUpgrade` is consequently now `async`. - Bump cf-auth-worker compat date 2025-03-01 -> 2026-04-01. - Add a test for /session/:state requests with a non-websocket Upgrade header (e.g. `Upgrade: h2c`) — also returns 426. - Align cf-auth-worker tests tsconfig on `@cloudflare/workers-types/experimental` to match the main tsconfig (verified `type:tests` still passes). - Drop unused `as const` from the experimental-websocket-callback flag's `type` field.
The previous undici.WebSocket-based flow called ws.close() after receiving the auth code, which only sends a CLOSE frame and waits for the peer to FIN the underlying TCP connection. The Cloudflare edge keeps that connection open for keep-alive reuse, so the TLS socket lingered for ~9 seconds and held the Node event loop open, making wrangler appear to hang after 'Successfully logged in.'. Switch to the ws package (already a wrangler dep, used in ProxyController) and call ws.terminate() instead of ws.close() in the finally block of getOauthTokenViaWebSocket. terminate() force-destroys the underlying socket, which is fine here because the auth relay has already done its job. Process now exits within a few milliseconds of the success log instead of ~9.3 seconds. Verified with process.getActiveResourcesInfo() before/after.
Picks up the web_socket_auto_reply_to_close default (active from 2026-04-07), so the runtime auto-acks client Close frames on the AuthSession DO.
Adds two Sparrow metrics events and a Sentry capture so we can keep an eye on how the experimental WebSocket auth relay is performing in the wild while it is still hidden behind --experimental-websocket-callback: - 'login user (relay attempt)' is emitted right after the WebSocket is constructed, with the auth worker URL. - 'login user (relay fallback)' is emitted from login()'s catch branch before falling back to the local callback server, with the auth worker URL and the relay failure detail. - Sentry.captureException is called from the same catch branch, tagged feature=wrangler-login-relay at level=warning so the team gets notified when the relay is unavailable. Reports queue in memory and only send once the user opts in, matching the existing wrangler error-capture behaviour. Also makes the runAndFailWebSocket test helper poll MockWebSocket.last instead of waiting a fixed number of setImmediate ticks. The sync work added by sendMetricsEvent (reading the metrics config file) was occasionally pushing the WebSocket constructor past the previous 2-tick window, causing intermittent 'no MockWebSocket was constructed' failures. Polling is more resilient to future timing changes.
The connect promise in getOauthTokenViaWebSocket can reject via timeout, error event, or close event. Previously those rejections returned from the function before reaching the outer try/finally, leaking the WebSocket in CONNECTING state. Hoist the cleanup/loginTimeoutHandle declarations and try block above the connect promise so a single finally covers every exit path: connect failure, post-open errors, openInBrowser throws, parse errors, the 120s post-open timeout, and successful return.
The global vi.mock("ws", ...) added in vitest.setup.ts for the auth
relay tests was intercepting WebSocket from "ws" in every test file,
including LocalRuntimeController.test.ts which opens a real WebSocket
to the local inspector. With the mock in place, events.once(ws, "open")
hangs forever because MockWebSocket only implements addEventListener,
not the Node EventEmitter API.
Move the mock and the per-test MockWebSocket.reset() into user.test.ts
so it only applies where it's needed. The dynamic import inside the
factory avoids the TDZ ReferenceError caused by vitest hoisting vi.mock
above the static MockWebSocket import.
Number("") and Number(" ") both coerce to 0, which would silently
activate the special "no timeout, no fallback" semantics rather than
the documented 5000ms default. A user who set
WRANGLER_AUTH_WORKER_TIMEOUT= to clear the variable, or had it set to
empty via a misconfigured .env, would wait indefinitely for an
unreachable relay with no fallback to the local callback server.
Add an explicit `raw.trim() === ""` guard before `Number(raw)` so
empty/whitespace values fall through to the documented invalid-value
default of 5000ms (with fallback enabled).
Drop the explicit `durable_objects.bindings` config from cf-auth-worker's wrangler.jsonc and access the DO via `ctx.exports.AuthSession` instead of `env.AUTH_SESSION`. The `enable_ctx_exports` compatibility flag is on by default from 2025-11-17 and our compat date is 2026-04-28, so no flag entry is needed. The migration is still required to configure SQLite storage for the class. In tests, switch from `env.AUTH_SESSION` to `exports.AuthSession` imported from `cloudflare:workers` (supported by vitest-pool-workers since Dec 2025). Regenerated worker-configuration.d.ts via `wrangler types --no-include-runtime` — `Env` is now empty and `durableNamespaces` still lists "AuthSession", which is what types `ctx.exports.AuthSession`.
Replace string-replace + concat patterns with parsed URL constructions
so trailing slashes on `WRANGLER_AUTH_WORKER_URL` (and uppercase
schemes) don't produce malformed URLs.
- wsUrl: parse authWorkerUrl, mutate `protocol` (https:->wss: or
http:->ws:) and `pathname`, hand the URL to `new WebSocket()`.
- remoteCallbackUrl: `new URL("/callback", authWorkerUrl).toString()`.
The variable holds an origin (scheme + host, no path) — `origin` is the WHATWG URL term for that, and the rename makes it explicit that a path is not supported. The rename flows through: - env var, factory `VariableNames` union, JSDoc - `getAuthWorkerUrlFromEnv` -> `getAuthWorkerOriginFromEnv` - local `authWorkerUrl` -> `authWorkerOrigin` in `user.ts` - log/Sentry/Sparrow event property names (feature is experimental, no downstream consumer locked in yet) - test assertions, README example Default value is unchanged: `https://auth.devprod.cloudflare.dev`.
The integration test added in 292fbf4 ("treats empty WRANGLER_AUTH_WORKER_TIMEOUT as the default") passed locally but failed intermittently in CI on Linux with "no MockWebSocket was constructed". The 100-iteration setImmediate poll in waitForMockWebSocket appears to race with the inner async chain (PKCE codes + login plumbing) under CI load. Replace it with a focused unit test of getAuthWorkerTimeoutMs() that exercises empty/whitespace/non-numeric/negative values directly. The unit test is faster, doesn't depend on the full login flow, and still catches the original bug — verified by reverting the fix locally: the empty and whitespace cases fail without the trim() guard.
The previous implementation polled MockWebSocket.last via 50 fast setImmediate iterations. On a fast machine that's well under 50ms total, but on a loaded CI box the inner async chain (most notably `crypto.subtle.digest` inside generatePKCECodes, which crosses a libuv worker thread) can take longer than that to reach the `new WebSocket(...)` site, causing the helper to throw before the mock is constructed and the dependent test to fail with mismatched assertions. Switch to a wall-clock-based poll: 5s budget, setTimeout(10) per iteration. This decouples the polling rate from event-loop speed and gives PKCE plenty of room without slowing down the local fast path.
Removes runtime overrides for the relay origin and connect timeout
(`WRANGLER_AUTH_WORKER_ORIGIN`, `WRANGLER_AUTH_WORKER_TIMEOUT`) — both
are now build-time constants so production deployments cannot redirect
the relay to a hostile URL or disable the connect timer.
Worker / DO:
- 405 on non-GET, strict 32–128 char state regex on both routes,
reject upgrades carrying `Origin`, require `Sec-Wrangler-Client`
header (browsers can't set it), allowlist headers forwarded to the
DO, security headers (`Referrer-Policy`, `Cache-Control`,
`X-Content-Type-Options`) on the 307, generic 404 on all errors.
- Session TTL 5 min → 90 s; `delivered` latch rejects replayed
callbacks; DO sends `{code, state}` (or `{error, state}`) so the
client can revalidate; Analytics Engine event on alarm.
- `observability.enabled: false` and `nodejs_compat` removed.
Wrangler client:
- Refuses login when `NODE_TLS_REJECT_UNAUTHORIZED=0`.
- Generates per-session `wsToken`, sends it via `Sec-Wrangler-Client`.
- Caps the WebSocket payload at 4 KiB; strict schema validator on
every relay message (single `code` xor `error`, required `state`,
no unknown keys, prototype-pollution-safe).
- Timing-safe revalidates the echoed `state` before redeeming the code.
- No longer prints the full OAuth URL (which carries `state` and
`code_challenge`) to stdout — only the dash host. Single advisory
line recommends `CLOUDFLARE_API_TOKEN` for headless/CI contexts.
- `writeAuthConfigFile` now writes mode `0o600` and re-`chmod`s on
every save.
- Sentry `beforeSend` scrubs `code=`, `state=`, `code_verifier=`,
`code_challenge=`, `wsToken=`, and JWT-shaped tokens.
Items 33-39 from the security review (the previous commit only covered 1-32). Each item is small enough to fix in-place rather than defer. #33 CORS headers absent - Added `Vary: Origin` to the 307 redirect for cache-key safety. - Inline `// SECURITY: do not set Access-Control-Allow-Origin` guardrail comment over the `SECURITY_HEADERS` block. - Tests assert no `Access-Control-*` headers and that `Vary: Origin` is present on the redirect. #34 DO storage failures unhandled - Wrapped every `storage.put` / `setAlarm` / `deleteAlarm` / `deleteAll` call in the DO with try/catch. Failures are logged via `console.error` and treated as non-fatal: an in-flight session can still dispatch its code even if the alarm/state bookkeeping fails. #35 Case-sensitive `Upgrade` header check - Now `?.toLowerCase() === "websocket"` on both the Worker fetch handler and the DO `handleWebSocketUpgrade`. RFC 6455 §1.3 compliant. - Test exercises `Upgrade: WEBSOCKET` (uppercase). #36 `permessage-deflate` not explicitly disabled - Deferred. Cloudflare's `acceptWebSocket` API doesn't expose a knob for this today; documented as a follow-up. #37 No `Sec-WebSocket-Protocol` requirement - Both sides now negotiate `wrangler-auth-relay-v1`. Worker rejects upgrades whose `Sec-WebSocket-Protocol` doesn't include it (404 to match the existing generic-error pattern). DO echoes the protocol on the 101 response. Wrangler asserts `ws.protocol` matches after open and aborts the login otherwise — defense in depth on top of the existing `Sec-Wrangler-Client` header. - New constant `WRANGLER_RELAY_SUBPROTOCOL` defined in both `protocol.ts` and `auth-relay-constants.ts` (with cross-references) since wrangler can't import from cf-auth-worker. - `MockWebSocket` now records the `protocols` constructor arg and defaults `protocol` to it, so tests can assert wrangler offered the right subprotocol AND simulate a misbehaving relay. #38 `new URL(authWorkerOrigin)` crash - Already obviated — `WRANGLER_AUTH_WORKER_ORIGIN` is a build-time constant, no longer attacker-influenced. - The new `getAuthOriginFromEnv` helper (#39) closes the residual vector for `WRANGLER_AUTH_URL` by wrapping `new URL()` in a graceful `UserError`. #39 String-concat in `generateAuthUrl` - Renamed the `authUrl` parameter to `authOrigin`, moved the `/oauth2/auth` path inside the function body. Makes the leading `?` safe by construction — no caller can smuggle a pre-existing query string. Same emitted URL; no change in behaviour. - New `getAuthOriginFromEnv()` helper in `auth-variables.ts` extracts the origin from `WRANGLER_AUTH_URL` (env var preserved for back-compat) via `new URL().origin`. Throws a graceful `UserError` if the env var is malformed. - Tests for the helper cover default, with-path override, and malformed input.
The cf-auth-worker entry was pinned to an older @cloudflare/workers-types catalog version (4.20260426.1) than the workspace catalog now points at (^4.20260506.1). CI's --frozen-lockfile install correctly flags this as inconsistent.
…ting Add `wrangler login --user=<email>` (alias `-u`) to verify the resulting OAuth identity matches the expected account, and revert the URL suppression mitigation that REVIEW-17452 #11 originally introduced. The original mitigation (suppress the auth URL from stdout) closed the URL-leak vector but did not address the underlying OAuth code-injection class: even with a leaked URL, an attacker authenticating as themselves can race their auth code into the legitimate WebSocket session, causing the victim's wrangler to silently log in as the attacker. URL suppression also broke copy-paste workflows in environments where `openInBrowser` cannot reach a usable browser. After token exchange but before persisting anything to disk, wrangler now calls `GET /user` directly with the freshly issued access token to resolve the account email. Behaviour: * `--user` provided: mechanical email match in any flow; mismatch throws and no token is written. * Localhost without `--user`: print `Logged in as <email>` and proceed (not vulnerable to URL-leak code injection). * Relay without `--user`, interactive: confirm prompt. * Relay without `--user`, non-interactive: hard-fail with an actionable error. * Relay → localhost fallback: localhost rules apply. There is deliberately no `--no-verify-user` opt-out so an LLM-driven prompt-injection attack cannot trick the user into bypassing the check. With the identity check in place the auth URL is printed unconditionally again on both flows so users in environments where the browser cannot auto-open can copy the link manually.
…estart The remote-runtime tail-logs WebSocket (`#activeTail` in `RemoteRuntimeController`) is constructed with `signal: this.#abortController.signal`. The `ws` package wires that signal to the underlying HTTP upgrade request via Node's `addAbortSignal()` helper, so when `onBundleStart` aborts the controller to cancel in-flight preview-session operations, the WebSocket's request is destroyed with `AbortError` and emits an `error` event. Only a `message` listener was attached at construction, so the `error` had no handler and propagated as an unhandled exception — visible in e2e runs as a trailing "Unhandled Errors / Uncaught Exception" in the test report. Attach an `error` listener at WebSocket construction that ignores errors (debug-logged), matching the safeguards already present on the `terminate` paths in `#previewToken` and `teardown()`. This covers the window between WebSocket construction and the next `terminate`, plus any transient network errors during normal operation.
9a44acc to
bbe5a23
Compare
2 tasks
Contributor
Author
|
Closing in favour of implementing the official OAuth Device flow: #14064 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #13117
Fixes #10935
Implements the RFC for a WebSocket-based OAuth callback relay that enables
wrangler loginin environments where localhost isn't accessible from the user's browser (containers, remote VMs, Codespaces, etc.).What this adds
New package:
@cloudflare/cf-auth-workerauth.devprod.cloudflare.devAuthSessionDurable Object using the Hibernation API as a WebSocket relayWrangler changes
--experimental-websocket-callback(alias--x-websocket-callback) onwrangler logingetOauthTokenViaWebSocket()— alternative togetOauthToken()using thewsWebSocket clientcallbackUrlingenerateAuthUrl()(backward-compatible optional param)exchangeAuthCodeForAccessToken()to accept explicit params (backward-compatible, existing callers unchanged)Security hardening (REVIEW-17452)
Following internal security review, the experimental flow has been hardened end-to-end:
https://auth.devprod.cloudflare.dev) and connect timeout (5 s) are build-time constants. Removed theWRANGLER_AUTH_WORKER_ORIGINandWRANGLER_AUTH_WORKER_TIMEOUTenv vars — production deployments cannot redirect the relay to a hostile URL or disable the connect timer.wsTokenbinding (wrangler dev#1,npx wrangler@beta dev index.jsopens an OAuth login screen, errors on auth callback #20). Wrangler generates a per-sessionwsTokenand sends it via a customSec-Wrangler-Clientrequest header on the WebSocket upgrade. Browsers cannot set custom headers onnew WebSocket(...), which doubles as cross-site WebSocket hijacking defence. The DO commitssha256(wsToken)so a stolenstatealone cannot pre-empt the legitimate Wrangler.statein every WebSocket message; Wrangler timing-safe-compares it against the locally generated value before exchanging the code./session/:stateand/callback, reject upgrades carryingOrigin, allowlist headers forwarded to the DO, security headers on the 307 redirect (Referrer-Policy: no-referrer,Cache-Control: no-store, private,X-Content-Type-Options: nosniff), generic 404 on all errors to remove the existence oracle.nodejs_compatremoved andobservability.enabled: falseto suppress URL capture in Workers Logs.deliveredflag rejects replayed/callback. Analytics Engine event on alarm so we can monitor abandonment vs. hijack-attempt timeouts.NODE_TLS_REJECT_UNAUTHORIZED=0is set. Caps WebSocket payloads at 4 KiB. Strict schema validator on relay messages (singlecodexorerror, requiredstate, no unknown keys, prototype-pollution-safe). No longer prints the full OAuth URL (withstate/code_challenge) to stdout — only thedash.cloudflare.comhost. Always emits a single advisory line recommendingCLOUDFLARE_API_TOKENfor headless/CI contexts.~/.wrangler/config/*.tomlis now written with mode0o600and re-chmod-ed on every save.beforeSendredactscode=,state=,code_verifier=,code_challenge=,wsToken=, and JWT-shaped tokens from every string field of an event before transmission.Items deferred to follow-up tickets (per review): WAF rate limiting, RFC 9207
issvalidation, OS keychain integration, R2 audit logging, SPKI cert pinning. None block the experimental rollout.External prerequisites (not in this PR)
https://auth.devprod.cloudflare.dev/callbackmust be added toredirect_urisin the Ory Hydra Terraform config atcloudflare/iam/terraform-internal-oauth-clientson GitLab (project 5889, filetf/production/wrangler.tf)auth.devprod.cloudflare.devis routable on thedevprod.cloudflare.devzone