Summary
When a vMCP is fronted by the embedded authorization server with multiple upstream
identity providers configured, the authorization flow is "all or nothing": every
upstream in the chain must succeed before the user receives an authorization code.
If any single upstream is unavailable, returns an error, or the user declines to
consent, the entire authorization fails and no backends are reachable — even
backends that depend solely on upstreams the user already authorized successfully.
There is also no mechanism to retry just the failed upstream; the user has to
restart the whole authorization chain from scratch.
Background
The embedded authorization server walks the configured upstream provider list
sequentially. After each successful callback,
continueChainOrComplete calls
nextMissingUpstream to find
the next provider without stored tokens for the session, and redirects the user
there. Only when every configured upstream has a stored token is the
authorization code issued to the client.
When any leg fails, the failure path deletes every upstream token already
collected for that session
(DeleteUpstreamTokens) and
the client receives a generic access_denied error with the hint
"upstream authentication failed".
Backend auth strategies that rely on the embedded auth server tokens select a
specific upstream by name:
UpstreamInjectConfig.ProviderName
TokenExchangeConfig.SubjectProviderName
AwsStsConfig.SubjectProviderName
(see pkg/vmcp/auth/types/types.go)
So in principle the system already knows which backends depend on which
upstream — that information just isn't used to make partial-availability
decisions.
Problem
Consider a vMCP aggregating three backends:
| Backend |
Required upstream provider |
github-mcp |
github |
slack-mcp |
slack |
gdrive-mcp |
google |
If the Slack IdP is having an outage (or the user simply declines the Slack
consent screen), the user cannot use any backend — including github-mcp and
gdrive-mcp, which have nothing to do with Slack. From the user's
perspective the whole vMCP appears down.
Proposed behavior
Two complementary changes:
-
Graceful degradation on partial upstream failure. Allow the authorization
flow to complete with a subset of upstreams satisfied. Backends whose
required upstream is missing should be filtered out of the vMCP's exposed
tool/resource set (or fail their individual requests with a clear,
actionable error pointing the user at which upstream is missing) while
backends with satisfied upstreams remain fully usable.
-
Retry mechanism for failed/missing upstreams. Expose a way for the user
to re-attempt authorization for upstreams that previously failed, without
forcing them to restart the entire chain. Two reasonable variants worth
considering:
- Retry-failed-only: re-run only the upstreams missing tokens, keep
existing successful tokens.
- Restart-all: discard all stored upstream tokens and re-run the full
chain (today's behavior, but as an explicit user choice rather than the
only option).
This could surface as an endpoint on the embedded auth server (e.g.
/auth/retry?provider=<name>), an MCP-side tool, and/or an entry in the
vMCP UI/CLI.
Open questions
- Identity consistency. Today the chain enforces that the resolved subject
on later legs matches the first leg
(callback.go#L391). With
partial completion, the same constraint must still apply when missing legs
are eventually completed — a retry must not silently let a different
identity bind tokens into the same session.
- Configuration knob. Some operators may want the current all-or-nothing
semantics (e.g. compliance requires every IdP attestation before any backend
is reachable). Suggest a per-vMCP / per-auth-server config option, e.g.
partialUpstreamAuth: allow|require-all (default to require-all to
preserve current behavior).
- Required vs. optional upstreams. Should each upstream provider declare
whether it is required or optional, so the auth server knows when it can
proceed without a token from a given upstream?
- Discovery / capability surfacing. When backends are filtered out, the
vMCP needs to communicate which tools/resources are unavailable and why
(which upstream is missing) so clients can prompt re-auth.
- Token refresh failures. Same shape of problem applies when an upstream
refresh token expires for one provider but not others — should that also
degrade gracefully via the same mechanism?
Acceptance criteria (proposed)
Related code
Summary
When a vMCP is fronted by the embedded authorization server with multiple upstream
identity providers configured, the authorization flow is "all or nothing": every
upstream in the chain must succeed before the user receives an authorization code.
If any single upstream is unavailable, returns an error, or the user declines to
consent, the entire authorization fails and no backends are reachable — even
backends that depend solely on upstreams the user already authorized successfully.
There is also no mechanism to retry just the failed upstream; the user has to
restart the whole authorization chain from scratch.
Background
The embedded authorization server walks the configured upstream provider list
sequentially. After each successful callback,
continueChainOrCompletecallsnextMissingUpstreamto findthe next provider without stored tokens for the session, and redirects the user
there. Only when every configured upstream has a stored token is the
authorization code issued to the client.
When any leg fails, the failure path deletes every upstream token already
collected for that session
(
DeleteUpstreamTokens) andthe client receives a generic
access_deniederror with the hint"upstream authentication failed".Backend auth strategies that rely on the embedded auth server tokens select a
specific upstream by name:
UpstreamInjectConfig.ProviderNameTokenExchangeConfig.SubjectProviderNameAwsStsConfig.SubjectProviderName(see
pkg/vmcp/auth/types/types.go)So in principle the system already knows which backends depend on which
upstream — that information just isn't used to make partial-availability
decisions.
Problem
Consider a vMCP aggregating three backends:
github-mcpgithubslack-mcpslackgdrive-mcpgoogleIf the Slack IdP is having an outage (or the user simply declines the Slack
consent screen), the user cannot use any backend — including
github-mcpandgdrive-mcp, which have nothing to do with Slack. From the user'sperspective the whole vMCP appears down.
Proposed behavior
Two complementary changes:
Graceful degradation on partial upstream failure. Allow the authorization
flow to complete with a subset of upstreams satisfied. Backends whose
required upstream is missing should be filtered out of the vMCP's exposed
tool/resource set (or fail their individual requests with a clear,
actionable error pointing the user at which upstream is missing) while
backends with satisfied upstreams remain fully usable.
Retry mechanism for failed/missing upstreams. Expose a way for the user
to re-attempt authorization for upstreams that previously failed, without
forcing them to restart the entire chain. Two reasonable variants worth
considering:
existing successful tokens.
chain (today's behavior, but as an explicit user choice rather than the
only option).
This could surface as an endpoint on the embedded auth server (e.g.
/auth/retry?provider=<name>), an MCP-side tool, and/or an entry in thevMCP UI/CLI.
Open questions
on later legs matches the first leg
(callback.go#L391). With
partial completion, the same constraint must still apply when missing legs
are eventually completed — a retry must not silently let a different
identity bind tokens into the same session.
semantics (e.g. compliance requires every IdP attestation before any backend
is reachable). Suggest a per-vMCP / per-auth-server config option, e.g.
partialUpstreamAuth: allow|require-all(default torequire-alltopreserve current behavior).
whether it is required or optional, so the auth server knows when it can
proceed without a token from a given upstream?
vMCP needs to communicate which tools/resources are unavailable and why
(which upstream is missing) so clients can prompt re-auth.
refresh token expires for one provider but not others — should that also
degrade gracefully via the same mechanism?
Acceptance criteria (proposed)
not invalidate tokens already collected for other upstreams.
vMCP's exposed surface, or return a clear, structured error identifying
the missing upstream.
without restarting the full chain.
now opt-in rather than implicit).
mark upstreams required vs. optional.
Related code
nextMissingUpstream