Skip to content

vMCP: graceful degradation when one upstream IdP fails in multi-upstream auth #5162

@tgrunnagle

Description

@tgrunnagle

Summary

When a vMCP is fronted by the embedded authorization server with multiple upstream
identity providers
configured, the authorization flow is "all or nothing": every
upstream in the chain must succeed before the user receives an authorization code.
If any single upstream is unavailable, returns an error, or the user declines to
consent, the entire authorization fails and no backends are reachable — even
backends that depend solely on upstreams the user already authorized successfully.

There is also no mechanism to retry just the failed upstream; the user has to
restart the whole authorization chain from scratch.

Background

The embedded authorization server walks the configured upstream provider list
sequentially. After each successful callback,
continueChainOrComplete calls
nextMissingUpstream to find
the next provider without stored tokens for the session, and redirects the user
there. Only when every configured upstream has a stored token is the
authorization code issued to the client.

When any leg fails, the failure path deletes every upstream token already
collected for that session
(DeleteUpstreamTokens) and
the client receives a generic access_denied error with the hint
"upstream authentication failed".

Backend auth strategies that rely on the embedded auth server tokens select a
specific upstream by name:

  • UpstreamInjectConfig.ProviderName
  • TokenExchangeConfig.SubjectProviderName
  • AwsStsConfig.SubjectProviderName

(see pkg/vmcp/auth/types/types.go)

So in principle the system already knows which backends depend on which
upstream — that information just isn't used to make partial-availability
decisions.

Problem

Consider a vMCP aggregating three backends:

Backend Required upstream provider
github-mcp github
slack-mcp slack
gdrive-mcp google

If the Slack IdP is having an outage (or the user simply declines the Slack
consent screen), the user cannot use any backend — including github-mcp and
gdrive-mcp, which have nothing to do with Slack. From the user's
perspective the whole vMCP appears down.

Proposed behavior

Two complementary changes:

  1. Graceful degradation on partial upstream failure. Allow the authorization
    flow to complete with a subset of upstreams satisfied. Backends whose
    required upstream is missing should be filtered out of the vMCP's exposed
    tool/resource set (or fail their individual requests with a clear,
    actionable error pointing the user at which upstream is missing) while
    backends with satisfied upstreams remain fully usable.

  2. Retry mechanism for failed/missing upstreams. Expose a way for the user
    to re-attempt authorization for upstreams that previously failed, without
    forcing them to restart the entire chain. Two reasonable variants worth
    considering:

    • Retry-failed-only: re-run only the upstreams missing tokens, keep
      existing successful tokens.
    • Restart-all: discard all stored upstream tokens and re-run the full
      chain (today's behavior, but as an explicit user choice rather than the
      only option).

    This could surface as an endpoint on the embedded auth server (e.g.
    /auth/retry?provider=<name>), an MCP-side tool, and/or an entry in the
    vMCP UI/CLI.

Open questions

  • Identity consistency. Today the chain enforces that the resolved subject
    on later legs matches the first leg
    (callback.go#L391). With
    partial completion, the same constraint must still apply when missing legs
    are eventually completed — a retry must not silently let a different
    identity bind tokens into the same session.
  • Configuration knob. Some operators may want the current all-or-nothing
    semantics (e.g. compliance requires every IdP attestation before any backend
    is reachable). Suggest a per-vMCP / per-auth-server config option, e.g.
    partialUpstreamAuth: allow|require-all (default to require-all to
    preserve current behavior).
  • Required vs. optional upstreams. Should each upstream provider declare
    whether it is required or optional, so the auth server knows when it can
    proceed without a token from a given upstream?
  • Discovery / capability surfacing. When backends are filtered out, the
    vMCP needs to communicate which tools/resources are unavailable and why
    (which upstream is missing) so clients can prompt re-auth.
  • Token refresh failures. Same shape of problem applies when an upstream
    refresh token expires for one provider but not others — should that also
    degrade gracefully via the same mechanism?

Acceptance criteria (proposed)

  • vMCP can be configured to allow partial upstream completion.
  • When configured for partial completion, a failed/declined upstream does
    not invalidate tokens already collected for other upstreams.
  • Backends whose required upstream lacks a token are excluded from the
    vMCP's exposed surface, or return a clear, structured error identifying
    the missing upstream.
  • The user can re-attempt authorization for a specific failed upstream
    without restarting the full chain.
  • The user can also force a full restart of the chain (today's behavior,
    now opt-in rather than implicit).
  • Identity-consistency checks across chain legs still hold under retry.
  • Documentation explains the partial-availability model and how operators
    mark upstreams required vs. optional.

Related code

Metadata

Metadata

Assignees

No one assigned

    Labels

    authenhancementNew feature or requestvmcpVirtual MCP Server related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions