Skip to content

Fix Cedar upstream-claim evaluation on VirtualMCPServer#5002

Merged
jhrozek merged 5 commits into
mainfrom
fix-vmcp-cedar-upstream-provider
Apr 22, 2026
Merged

Fix Cedar upstream-claim evaluation on VirtualMCPServer#5002
jhrozek merged 5 commits into
mainfrom
fix-vmcp-cedar-upstream-provider

Conversation

@jhrozek
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek commented Apr 22, 2026

Summary

  • Why: When a VirtualMCPServer uses an embedded auth server with upstream OIDC providers (e.g. Okta), Cedar policies referencing upstream claims like principal.claim_department fail at runtime with does not have the attribute claim_department. The operator's converter never populated AuthzConfig.PrimaryUpstreamProvider, so Cedar evaluated against the ToolHive-issued AS token instead of the upstream IDP token. The same setup works on the thv run path, which already has the wiring.
  • What: Three commits close the gap — one fixes the converter, one adds a safety check so misconfigurations don't silently authorize against the wrong identity, and one pins the Cedar runtime contract below the fix with a behavioral regression test.
    • Commit 1 derives PrimaryUpstreamProvider from Spec.AuthServerConfig.UpstreamProviders[0].Name in convertIncomingAuth, mirroring injectSubjectProviderIfNeeded (outgoing auth) and injectUpstreamProviderIfNeeded (thv run path). Leaves the field empty when no embedded AS or no upstreams are configured so Cedar correctly falls back to ToolHive-issued claims in those modes.
    • Commit 2 adds validateAuthzUpstreamAvailable to the reconciler chain. When AuthzConfig is set but no upstream IDP is configured, the VirtualMCPServer goes to Failed with AuthServerConfigValidated=False, Reason=AuthzRequiresUpstream. The user-facing message points at spec.authServerConfig.upstreamProviders.
    • Commit 3 adds an integration test that flips PrimaryUpstreamProvider between "okta" and "" with the same identity and policy, asserting the expected permit/deny outcome through the middleware stack. Pins the contract the operator fix relies on.

Fixes #4997

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Integration tests — new TestIntegrationPrimaryUpstreamProviderClaimAttributeAccess in pkg/authz/integration_test.go
  • Linting (task lint-fix)
  • Manual end-to-end verification on a kind cluster (see Verified section below)

Changes

File Change
cmd/thv-operator/pkg/vmcpconfig/converter.go Populate AuthzConfig.PrimaryUpstreamProvider from the first upstream provider when an embedded auth server is configured
cmd/thv-operator/pkg/vmcpconfig/converter_test.go Table-driven coverage: no auth server, empty upstream list, single named upstream, empty-name normalization, multi-upstream first-wins, no authz config
cmd/thv-operator/api/v1beta1/virtualmcpserver_types.go New condition reason ConditionReasonAuthzRequiresUpstream
cmd/thv-operator/controllers/virtualmcpserver_controller.go Extract runAuthValidations and add validateAuthzUpstreamAvailable
cmd/thv-operator/controllers/virtualmcpserver_controller_test.go Seven subtests covering the validator, including the anonymous + authzConfig + no upstream scenario that motivated the safety check
pkg/authz/integration_test.go Behavioral regression test for the Cedar runtime contract: permit/deny flips on PrimaryUpstreamProvider with identical identity and policy

Does this introduce a user-facing change?

Yes. VirtualMCPServer deployments that combine an embedded auth server, upstream OIDC providers, and Cedar authzConfig policies referencing upstream claims now work correctly. Previously, those policies failed with does not have the attribute claim_*. Separately, VirtualMCPServers that declare authzConfig without any upstream provider are now rejected with a Failed phase rather than silently authorizing against the ToolHive-issued token.

Verified

Manual end-to-end verification against a real Okta tenant on a kind cluster.

VirtualMCPServer shape used:

apiVersion: toolhive.stacklok.dev/v1beta1
kind: VirtualMCPServer
spec:
  authServerConfig:
    issuer: https://<as-host>/cedar-claims-test
    upstreamProviders:
      - name: okta
        type: oidc
        oidcConfig:
          issuerUrl: https://<okta-tenant>/oauth2/<auth-server-id>
          clientId: <client-id>
          clientSecretRef:
            name: okta-client-secret
            key: client-secret
          redirectUri: https://<as-host>/cedar-claims-test/oauth/callback
          scopes: [openid, profile, email, offline_access]
  incomingAuth:
    type: oidc
    oidcConfigRef:
      name: embedded-as
      audience: https://<mcp-host>/cedar-claims-test/mcp
      resourceUrl: https://<mcp-host>/cedar-claims-test/mcp
    authzConfig:
      type: inline
      inline:
        policies:
          - |
            permit(principal, action == Action::"call_tool", resource)
            when { principal.claim_groups.contains("engineering") };
          - |
            permit(principal, action == Action::"list_tools", resource);
          - |
            permit(principal, action == Action::"initialize", resource);

The Cedar policy deliberately references principal.claim_groups — a claim that lives on the Okta access token but not on the ToolHive-issued AS token. Without the fix, Cedar would read from the AS token (no groups) and deny. With the fix, Cedar reads from the upstream Okta token (groups: [Everyone, engineering]) and permits.

Evidence from the pod logs after completing an OAuth flow through Okta and issuing tools/list:

Resolved JWT claim keys for Cedar evaluation
  source=upstream
  keys=[auth_time, sub, iss, aud, scp, groups, ...aws/Groups, ver, jti, iat, exp, cid, uid]
cedar authorization check
  principal=Client::"jakub@stacklok.com"
  action=Action::"call_tool"
  resource=Tool::"backend-yardstick_echo"
cedar decision
  decision=allow
  diagnostic.reasons=[policy0]

source=upstream (from pkg/authz/authorizers/cedar/core.go:452) proves the runtime took the upstream branch at core.go:421 — the exact code path enabled by Commit 1. The claim key set matches Okta's access token shape (uid, cid, aws/.../Groups), distinct from the AS-issued token shape which would include tsid and client_id. Pre-fix this log would have read source=token and Cedar would have evaluated against the AS-issued token.

PR 2 validator observed by applying a VirtualMCPServer with authzConfig set but no upstream providers:

status:
  phase: Failed
  conditions:
    - type: AuthServerConfigValidated
      status: "False"
      reason: AuthzRequiresUpstream
      message: >-
        spec.incomingAuth.authzConfig is set but no upstream IDP is
        configured: Cedar policies referencing principal.claim_* would
        evaluate against the ToolHive-issued token instead of the upstream
        IDP token. Configure spec.authServerConfig.upstreamProviders with
        at least one upstream IDP.

Implementation plan

Approved implementation plan

Scope B + D from the planning discussion:

  • PR 1 (B1) — Operator converter auto-sets AuthzConfig.PrimaryUpstreamProvider from Spec.AuthServerConfig.UpstreamProviders[0].Name. Mirrors the existing injectUpstreamProviderIfNeeded helper on the thv run path. Leaves the field empty when no embedded AS or no upstreams, so Cedar falls back to ToolHive claims in those modes.
  • PR 2 (B2) — Reconciler validator rejects the degenerate case where AuthzConfig is set but no upstream is available, surfacing ConfigurationValid=False, Reason=AuthzRequiresUpstream. Protects against silent authorization-on-wrong-identity when the AS and upstream claim namespaces overlap (sub, aud, tsid).
  • Option X — Integration test pinning the runtime contract below the operator fix. Lands as Commit 3 on this PR.
  • PR 4 (D, follow-up) — Separate RFC in toolhive-rfcs proposing explicit authn: true discriminator or per-upstream claim mappers to resolve the modeling gap the oauth-expert flagged (UpstreamProviders[] conflates authN IdP with resource-access OAuth upstreams). Not part of this fix.

Selection for this PR is first-upstream-wins, consistent with all four existing call sites (pkg/runner/middleware.go:391, pkg/runner/middleware.go:355, pkg/vmcp/config/defaults.go:110, cmd/thv-operator/controllers/virtualmcpserver_controller.go:2099). The broader modeling question is tracked for PR 4.

Special notes for reviewers

  • The converter fix picks the first upstream provider. This matches every other call site in the codebase today, but as the oauth-expert flagged in the planning discussion, it's not semantically precise for a multi-upstream deployment where the first declared upstream isn't the authN IdP. A follow-up RFC in toolhive-rfcs will propose an explicit discriminator; this PR is not the right place to introduce that CRD surface.
  • The validator reuses the existing AuthServerConfigValidated condition type rather than introducing a new AuthzConfigValidated. Rationale: the fix the user must make is on spec.authServerConfig.upstreamProviders, so surfacing under that condition points at the right field. A new ConditionReasonAuthzRequiresUpstream disambiguates.
  • runAuthValidations is extracted from runValidations because gocyclo tripped on the added branch. The moved block is byte-for-byte identical to what was there before; no behavior change.
  • During kind-cluster verification we also hit an unrelated streamable-http bug in the vMCP transport: when the MCP client POSTs a JSON-RPC response to a server-initiated request (e.g., a ping reply), the server returns 400 instead of the spec-mandated 202. This is being filed as a separate issue and is not in scope for Cedar policies cannot access upstream IDP claims on VirtualMCPServer with embedded auth server #4997.

Generated with Claude Code

jhrozek added 2 commits April 22, 2026 12:30
VirtualMCPServer's operator converter never set
AuthzConfig.PrimaryUpstreamProvider, so Cedar policies that referenced
upstream claims (e.g. principal.claim_department) failed at runtime.
Cedar evaluated against the ToolHive-issued AS token rather than the
upstream IDP token, and the claim was missing.

Derive the field from Spec.AuthServerConfig.UpstreamProviders[0].Name in
convertIncomingAuth when an embedded auth server with upstream providers
is configured. Mirrors injectSubjectProviderIfNeeded in
virtualmcpserver_controller.go (outgoing auth) and
injectUpstreamProviderIfNeeded in pkg/runner/middleware.go (thv run
path). Leaves the field empty when no embedded AS or no upstreams so
Cedar correctly falls back to ToolHive-issued claims in those modes.

Fixes #4997
When IncomingAuth.AuthzConfig is set but no upstream IDP is configured,
Cedar silently evaluates policies against the ToolHive-issued AS token.
That token's claim namespace (sub, aud, tsid) can overlap upstream
claims and authorize against the wrong identity, so the misconfig must
be surfaced rather than deployed.

Add validateAuthzUpstreamAvailable to the VirtualMCPServer reconciler
chain. When AuthzConfig is set but AuthServerConfig is nil or
UpstreamProviders is empty, mark the server Failed and set
AuthServerConfigValidated=False with reason AuthzRequiresUpstream. The
user-facing message points at spec.authServerConfig.upstreamProviders,
which is where the fix belongs.

Extract runAuthValidations from runValidations so the auth-related
checks live together and gocyclo stays happy. No behavior change in the
moved block.

Belt-and-suspenders companion to the converter fix in the previous
commit: the converter wires the provider name when upstreams exist; this
validator makes the absence of upstreams an explicit failure.

Refs #4997
@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label Apr 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 88.75000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.05%. Comparing base (93e1478) to head (0a27a72).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...perator/controllers/virtualmcpserver_controller.go 88.00% 7 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5002      +/-   ##
==========================================
+ Coverage   69.02%   69.05%   +0.02%     
==========================================
  Files         554      554              
  Lines       73083    73160      +77     
==========================================
+ Hits        50445    50520      +75     
+ Misses      19631    19628       -3     
- Partials     3007     3012       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Behavioral regression guard for the converter fix in the preceding
commit. With the same identity, upstream token, and Cedar policy, flip
PrimaryUpstreamProvider between "okta" and "" and observe the outcome
change from permit to deny through the real middleware stack.

This pins the runtime contract below the operator-layer fix: when the
converter populates PrimaryUpstreamProvider, Cedar resolves
principal.claim_* from the upstream IDP token; when it is empty, Cedar
falls back to the AS-issued token's claims (which do not carry upstream
profile attributes). Uses has-attribute rather than equality so a
failure uniquely signals "the claim source Cedar read from lacked this
attribute" — the exact #4997 regression shape.

Refs #4997
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 22, 2026
@jhrozek jhrozek mentioned this pull request Apr 22, 2026
3 tasks
@jhrozek jhrozek marked this pull request as ready for review April 22, 2026 14:21
Copy link
Copy Markdown
Contributor

@tgrunnagle tgrunnagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also trying the /dev:pr-review skill on this, so there will be some AI-generated comments

Comment thread cmd/thv-operator/controllers/virtualmcpserver_controller.go Outdated
Copy link
Copy Markdown
Contributor

@tgrunnagle tgrunnagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: security-reviewer, k8s-operator-reviewer, test-coverage-reviewer, general-quality-reviewer

Consensus Summary

# Finding Consensus Severity Action
F1 ConditionTypeAuthServerConfigValidated reused for AuthzConfig misconfiguration 10/10 MEDIUM Fix
F2 First-upstream-wins: no signal in multi-upstream deployments 10/10 MEDIUM Discuss
F3 fmt.Errorf("%s", msg)errors.New 10/10 LOW Fix
F4 Unreachable guard branch — ordering dependency undocumented 7/10 LOW Document
F5 No log emitted on AuthzRequiresUpstream failure 7/10 LOW Fix
F6 No positive condition after successful validation (undocumented) 7/10 INFO Document
F7 runValidations doc comment stale after refactor 10/10 LOW Fix
F9 UpdateStatus return value discarded in test 7/10 MEDIUM Fix
F10 Integration test 403 vs 500 Cedar diagnostic ambiguity 7/10 LOW Clarify
F11 No test exercises both validators in sequence via runAuthValidations 10/10 LOW Fix
F12 No double-nil (authServerConfig=nil + authzConfig=nil) converter test 7/10 LOW Fix
F13 No end-to-end test for runAuthValidations through runValidations 7/10 LOW Fix
F14 context.Background()t.Context() in new tests 5/10 LOW Fix

Overall

The core fix is correct and well-motivated: populating PrimaryUpstreamProvider in convertIncomingAuth mirrors the existing injectUpstreamProviderIfNeeded pattern on the thv run path precisely. The converter change is minimal and the validator adds genuine security value by surfacing a previously silent misconfiguration. Testing is thorough at the unit level with comprehensive table-driven cases.

The two MEDIUM findings are not blocking but worth addressing. F1 (condition type semantic mismatch) is the most actionable — ConditionTypeAuthServerConfigValidated=False fires when AuthServerConfig is absent, which points users at the wrong field. Since ConditionReasonAuthzRequiresUpstream is already a new constant in this PR, introducing a corresponding ConditionTypeAuthzConfigValidated would be a clean addition. F2 (first-upstream-wins with no signal) is acknowledged in the special notes as accepted debt with a follow-up RFC planned. The test suite has a structural gap: all new tests call private methods directly, so if runAuthValidations were accidentally dropped from runValidations, no test would catch it (F11/F13). Inline comments are posted only for MEDIUM findings; see the table above for LOW-level items.

Documentation

No documentation files are changed by this PR. The new ConditionReasonAuthzRequiresUpstream failure mode should be noted in operator troubleshooting docs and release notes — existing VirtualMCPServers with authzConfig set but no upstream IDP will transition to Failed phase on their first reconcile after the operator upgrade.


Generated with Claude Code

Comment thread cmd/thv-operator/controllers/virtualmcpserver_controller.go Outdated
Comment thread cmd/thv-operator/pkg/vmcpconfig/converter.go
Comment thread cmd/thv-operator/controllers/virtualmcpserver_controller_test.go Outdated
Key change: narrow validateAuthzUpstreamAvailable to only reject when
an embedded auth server is configured but has no upstream providers.
Previously the validator rejected any authz configuration without an
embedded auth server, which false-positived direct-IdP deployments
where the client presents an Okta/Entra/etc token directly and Cedar
evaluates against identity.Claims via the default branch. Thanks to
Trey for catching this regression during review.

Additional review feedback addressed:
- Advisory status condition AuthzUpstreamSelectionWarning surfaces
  which upstream was auto-selected when multiple are configured (F2).
  Condition is set only on the applicable path; stale conditions are
  removed via RemoveConditionsWithPrefix on non-applicable paths so
  normal VMCPs do not carry a False/NotApplicable advisory row.
- Use errors.New instead of fmt.Errorf("%s", msg) (F3).
- Emit a WARN-equivalent log (logr.Info at V=0, matching the file
  idiom) when rejecting so operators have a grep-able signal (F5).
- Update runValidations doc comment after the earlier runAuthValidations
  extraction (F7).
- Assert UpdateStatus bool return in error-path test subtests (F9).
- Add double-nil (AuthServerConfig nil and AuthzConfig nil) converter
  subtest (F12).
- Switch new tests to t.Context() (F14).

Refs #4997
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 22, 2026
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 22, 2026
@jhrozek jhrozek merged commit a75ea2c into main Apr 22, 2026
43 checks passed
@jhrozek jhrozek deleted the fix-vmcp-cedar-upstream-provider branch April 22, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cedar policies cannot access upstream IDP claims on VirtualMCPServer with embedded auth server

2 participants