vMCP: graceful degradation when one upstream IdP fails in multi-upstream auth

## Summary

When a vMCP is fronted by the embedded authorization server with **multiple upstream
identity providers** configured, the authorization flow is "all or nothing": every
upstream in the chain must succeed before the user receives an authorization code.
If any single upstream is unavailable, returns an error, or the user declines to
consent, the entire authorization fails and *no* backends are reachable — even
backends that depend solely on upstreams the user already authorized successfully.

There is also no mechanism to retry just the failed upstream; the user has to
restart the whole authorization chain from scratch.

## Background

The embedded authorization server walks the configured upstream provider list
sequentially. After each successful callback,
[`continueChainOrComplete`](https://github.com/stacklok/toolhive/blob/main/pkg/authserver/server/handlers/callback.go#L368) calls
[`nextMissingUpstream`](https://github.com/stacklok/toolhive/blob/main/pkg/authserver/server/handlers/handler.go#L118) to find
the next provider without stored tokens for the session, and redirects the user
there. Only when every configured upstream has a stored token is the
authorization code issued to the client.

When any leg fails, the failure path deletes every upstream token already
collected for that session
([`DeleteUpstreamTokens`](https://github.com/stacklok/toolhive/blob/main/pkg/authserver/server/handlers/callback.go#L346)) and
the client receives a generic `access_denied` error with the hint
`"upstream authentication failed"`.

Backend auth strategies that rely on the embedded auth server tokens select a
specific upstream by name:

- `UpstreamInjectConfig.ProviderName`
- `TokenExchangeConfig.SubjectProviderName`
- `AwsStsConfig.SubjectProviderName`

(see [`pkg/vmcp/auth/types/types.go`](https://github.com/stacklok/toolhive/blob/main/pkg/vmcp/auth/types/types.go))

So in principle the system already knows which backends depend on which
upstream — that information just isn't used to make partial-availability
decisions.

## Problem

Consider a vMCP aggregating three backends:

| Backend       | Required upstream provider |
|---------------|----------------------------|
| `github-mcp`  | `github`                   |
| `slack-mcp`   | `slack`                    |
| `gdrive-mcp`  | `google`                   |

If the Slack IdP is having an outage (or the user simply declines the Slack
consent screen), the user cannot use *any* backend — including `github-mcp` and
`gdrive-mcp`, which have nothing to do with Slack. From the user's
perspective the whole vMCP appears down.

## Proposed behavior

Two complementary changes:

1. **Graceful degradation on partial upstream failure.** Allow the authorization
   flow to complete with a subset of upstreams satisfied. Backends whose
   required upstream is missing should be filtered out of the vMCP's exposed
   tool/resource set (or fail their individual requests with a clear,
   actionable error pointing the user at which upstream is missing) while
   backends with satisfied upstreams remain fully usable.

2. **Retry mechanism for failed/missing upstreams.** Expose a way for the user
   to re-attempt authorization for upstreams that previously failed, without
   forcing them to restart the entire chain. Two reasonable variants worth
   considering:
   - **Retry-failed-only**: re-run only the upstreams missing tokens, keep
     existing successful tokens.
   - **Restart-all**: discard all stored upstream tokens and re-run the full
     chain (today's behavior, but as an explicit user choice rather than the
     only option).

   This could surface as an endpoint on the embedded auth server (e.g.
   `/auth/retry?provider=<name>`), an MCP-side tool, and/or an entry in the
   vMCP UI/CLI.

## Open questions

- **Identity consistency.** Today the chain enforces that the resolved subject
  on later legs matches the first leg
  ([callback.go#L391](https://github.com/stacklok/toolhive/blob/main/pkg/authserver/server/handlers/callback.go#L391)). With
  partial completion, the same constraint must still apply when missing legs
  are eventually completed — a retry must not silently let a different
  identity bind tokens into the same session.
- **Configuration knob.** Some operators may *want* the current all-or-nothing
  semantics (e.g. compliance requires every IdP attestation before any backend
  is reachable). Suggest a per-vMCP / per-auth-server config option, e.g.
  `partialUpstreamAuth: allow|require-all` (default to `require-all` to
  preserve current behavior).
- **Required vs. optional upstreams.** Should each upstream provider declare
  whether it is required or optional, so the auth server knows when it can
  proceed without a token from a given upstream?
- **Discovery / capability surfacing.** When backends are filtered out, the
  vMCP needs to communicate which tools/resources are unavailable and why
  (which upstream is missing) so clients can prompt re-auth.
- **Token refresh failures.** Same shape of problem applies when an upstream
  refresh token expires for one provider but not others — should that also
  degrade gracefully via the same mechanism?

## Acceptance criteria (proposed)

- [ ] vMCP can be configured to allow partial upstream completion.
- [ ] When configured for partial completion, a failed/declined upstream does
      not invalidate tokens already collected for other upstreams.
- [ ] Backends whose required upstream lacks a token are excluded from the
      vMCP's exposed surface, or return a clear, structured error identifying
      the missing upstream.
- [ ] The user can re-attempt authorization for a specific failed upstream
      without restarting the full chain.
- [ ] The user can also force a full restart of the chain (today's behavior,
      now opt-in rather than implicit).
- [ ] Identity-consistency checks across chain legs still hold under retry.
- [ ] Documentation explains the partial-availability model and how operators
      mark upstreams required vs. optional.

## Related code

- [pkg/authserver/server/handlers/callback.go](https://github.com/stacklok/toolhive/blob/main/pkg/authserver/server/handlers/callback.go) — chain continuation and failure paths
- [pkg/authserver/server/handlers/handler.go](https://github.com/stacklok/toolhive/blob/main/pkg/authserver/server/handlers/handler.go) — `nextMissingUpstream`
- [pkg/vmcp/auth/types/types.go](https://github.com/stacklok/toolhive/blob/main/pkg/vmcp/auth/types/types.go) — backend strategies that bind to a specific upstream provider
- [pkg/authserver/config.go](https://github.com/stacklok/toolhive/blob/main/pkg/authserver/config.go) — upstream provider configuration

Backend	Required upstream provider
`github-mcp`	`github`
`slack-mcp`	`slack`
`gdrive-mcp`	`google`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vMCP: graceful degradation when one upstream IdP fails in multi-upstream auth #5162

Summary

Background

Problem

Proposed behavior

Open questions

Acceptance criteria (proposed)

Related code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vMCP: graceful degradation when one upstream IdP fails in multi-upstream auth #5162

Description

Summary

Background

Problem

Proposed behavior

Open questions

Acceptance criteria (proposed)

Related code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions