Skip to content

RFC-0057: MCPRemoteEndpoint CRD — Unified Remote MCP Server Connectivity#55

Open
JAORMX wants to merge 15 commits intomainfrom
jaosorior/mcpserverentry-direct-remote-backends
Open

RFC-0057: MCPRemoteEndpoint CRD — Unified Remote MCP Server Connectivity#55
JAORMX wants to merge 15 commits intomainfrom
jaosorior/mcpserverentry-direct-remote-backends

Conversation

@JAORMX
Copy link
Contributor

@JAORMX JAORMX commented Mar 12, 2026

Summary

Introduces MCPRemoteEndpoint, a new CRD that unifies remote MCP server connectivity under a single resource with two explicit modes:

  • type: proxy — Deploys a proxy pod with full auth middleware, authz policy, and audit logging. Functionally replaces MCPRemoteProxy.
  • type: direct — No pod deployed; VirtualMCPServer connects directly to the remote URL. Resolves forced-auth on public remotes (#3104) and eliminates unnecessary infrastructure for simple remote backends.

This supersedes THV-0055 (MCPServerEntry CRD).

Motivation

vMCP currently requires MCPRemoteProxy (which spawns thv-proxyrunner pods) to reach remote MCP servers. This creates three problems:

  1. Forced auth on public remotes: MCPRemoteProxy requires OIDC even when vMCP already handles client auth
  2. Dual boundary confusion: externalAuthConfigRef serves two conflicting purposes (vMCP-to-proxy AND proxy-to-remote)
  3. Resource waste: Every remote backend needs a Deployment + Service + Pod just to make an HTTP call

Design Highlights

  • Two explicit modes: type: proxy (full auth middleware, replaces MCPRemoteProxy) and type: direct (zero pods, zero services)
  • CEL validation: Mode-specific field constraints enforced at admission time (e.g., OIDC/telemetry config rejected for type: direct)
  • Validation-only controller for direct mode: No infrastructure created, no remote URL probing
  • Both discovery modes: Works with static (operator-generated ConfigMap) and dynamic (vMCP runtime discovery) from THV-0014
  • MCPRemoteProxy deprecated: Migrated to MCPRemoteEndpoint with type: proxy; MCPRemoteProxy receives no new features
  • Token exchange support: type: direct supports tokenExchange via externalAuthConfigRef for backends requiring OAuth credentials

Related RFCs

  • THV-0008 (Virtual MCP Server)
  • THV-0009 (Remote MCP Proxy)
  • THV-0010 (MCPGroup CRD)
  • THV-0014 (K8s-Aware vMCP)
  • THV-0023 (CRD v1alpha2 Optimization — config refs target MCPRemoteEndpoint)
  • THV-0026 (Header Passthrough)

Test plan

  • Review RFC structure and completeness
  • Validate CRD design against existing patterns
  • Verify mode-specific CEL validation rules
  • Verify security considerations (auth boundaries, token exchange restrictions)
  • Confirm MCPRemoteProxy migration path is documented

JAORMX and others added 4 commits March 12, 2026 09:03
Introduces a new MCPServerEntry CRD that lets VirtualMCPServer connect
directly to remote MCP servers without MCPRemoteProxy infrastructure,
resolving the forced-auth (#3104) and dual-boundary confusion (#4109)
issues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RFC should focus on design intent, not implementation code.
Keep YAML/Mermaid examples, replace Go blocks with prose
describing controller behavior, discovery logic, and TLS
handling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implementation details like specific file paths belong in
the implementation, not the RFC design document.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@yrobla yrobla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-structured RFC with clear motivation, thorough security considerations, and good alternatives analysis. The naming convention (following Istio ServiceEntry) is a nice touch. A few issues worth addressing before implementation:

  • groupRef type inconsistency (string vs object pattern) needs resolution
  • caBundleRef resource type (Secret vs ConfigMap) is ambiguous across sections
  • SSRF mitigation has a gap: HTTPS-only validation doesn't block private IP ranges
  • Auth resolution timing in ListWorkloadsInGroup() description is unclear
  • Static mode CA bundle propagation is unspecified
  • toolConfigRef deferral creates an unacknowledged migration regression for MCPRemoteProxy users

- Clarify groupRef is plain string for consistency with MCPServer/MCPRemoteProxy
- Fix Alt 1 YAML example to use string form for groupRef
- Change caBundleRef to reference ConfigMap (CA certs are public data)
- Add SSRF rationale: CEL IP blocking omitted since internal servers are legitimate
- Clarify auth resolution loads config only, token exchange deferred to request time
- Specify CA bundle volume mount for static mode (PEM files, not env vars)
- Document toolConfigRef migration path via aggregation.tools[].workload

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
yrobla
yrobla previously approved these changes Mar 12, 2026
Copy link
Contributor

@yrobla yrobla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks as a very good improvement to me

amirejaz
amirejaz previously approved these changes Mar 12, 2026
Copy link
Contributor

@amirejaz amirejaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Added one comment

metadata:
name: context7
spec:
remoteURL: https://mcp.context7.com/mcp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could we use a real public MCP server here, such as mcp-spec? context7 is currently being used as the example for both private and public MCP servers, which feels a bit confusing.

@amirejaz
Copy link
Contributor

Question: Would it make sense to rename MCPServerEntry to MCPRemoteServerEntry to improve clarity?

jerm-dro
jerm-dro previously approved these changes Mar 13, 2026
Copy link
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I appreciate the alternatives section.

@lorr1
Copy link

lorr1 commented Mar 17, 2026

We need this!! I love it.

@@ -0,0 +1,904 @@
# RFC-0055: MCPServerEntry CRD for Direct Remote MCP Server Backends

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(naming): if it's for remote MCPs only, why not "MCPRemoteServer"? just by the "MCPServerEntry" it wasn't clear to me it was for remote only

Copy link
Contributor

@ChrisJBurns ChrisJBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed RFC — the three problems are real and worth solving, and the security thinking (no operator-side probing, HTTPS enforcement, credential separation) is solid throughout.

The core question this review raises is: does solving these three problems require a new CRD, or can they be addressed by extending what already exists?

After reading the RFC against the actual codebase, the case for a new CRD is weaker than it appears:

  • Problem #1 (forced OIDC on public remotes) can be fixed by making oidcConfig optional on MCPRemoteProxy — a one-field change (see comment 2).
  • Problem #2 (dual auth boundary confusion) turns out not to exist in the code — externalAuthConfigRef and oidcConfig already serve separate purposes in pkg/vmcp/workloads/k8s.go (see comment 1).
  • Problem #3 (pod waste) can be solved by adding a direct: true flag to MCPRemoteProxy that skips ensureDeployment() and ensureService() and uses remoteURL as the backend URL directly — touching ~4 files vs. the RFC's ~20 (see comments 3, 4).

The RFC does consider alternatives but doesn't seriously evaluate the most natural one: extending MCPRemoteProxy itself. Its alternatives section argues against extending MCPServer (fair) and against config-only approaches (fair), but never evaluates a lightweight mode on the resource that already exists for this exact purpose (comment 3).

Beyond the alternatives analysis, there are several concrete concerns: the operational cost of a new CRD is significantly underestimated (comment 4); the new CRD creates naming confusion alongside MCPServer and MCPRemoteProxy (comment 9); the MCPGroup status model would need restructuring for a third backend type (comment 5); the CA bundle volume-mounting introduces new operator complexity (comment 8); and the SSRF trade-off of moving outbound calls into the vMCP pod deserves explicit acknowledgement (comment 7).

Suggested path forward: Before investing in a new CRD, implement the simpler approach — make oidcConfig optional and add direct: true to MCPRemoteProxy. If real-world usage reveals that the RBAC separation story (platform teams vs. product teams managing backends independently) is a genuine need that can't be satisfied by MCPRemoteProxy's existing RBAC surface, revisit the CRD proposal with that as the primary motivation rather than the pod-waste argument.

If the team decides a new CRD is still the right call after considering the above, comments 5–12 cover the design-level issues that should be resolved before implementation begins.

🤖 Review assisted by Claude Code

@jhrozek
Copy link
Contributor

jhrozek commented Mar 17, 2026

+1 to what @ChrisJBurns said, I think we should consider reusing the existing CRDs and be more explicit in why it's not an option. There's also the cognitive load on the users, the documentation load etc..

@JAORMX
Copy link
Contributor Author

JAORMX commented Mar 18, 2026

+1 to what @ChrisJBurns said, I think we should consider reusing the existing CRDs and be more explicit in why it's not an option. There's also the cognitive load on the users, the documentation load etc..

I actually disagree with this. I think re-using existing CRDs would do us a diservice and further increase technical depth by using APIs that are not fit for purpose because they had very different intentions.

Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com>
Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com>
@ChrisJBurns ChrisJBurns dismissed stale reviews from jerm-dro, amirejaz, and yrobla via c2965b7 March 18, 2026 17:09
ChrisJBurns and others added 4 commits March 18, 2026 18:03
Address all review feedback from specialized agent review:

Critical fixes:
- Add 3 CEL validation rules (proxy requires proxyConfig, proxyConfig
  requires oidcConfig, direct rejects proxyConfig)
- Add spec.type immutability guard to prevent orphaned resources
- Add Phase 0.5 for MCPRemoteProxy controller refactoring prerequisite
- Document StaticBackendConfig Type field gap and KnownFields(true) risk
- Document full RFC 8693 token exchange flow with subject token provenance
- Document vMCP-to-proxy pod token forwarding with MCP auth spec rationale
- Add session constraints for direct mode (single-replica requirement)

Significant fixes:
- Fix Phase 0 deprecation mechanism (Warning events, not deprecatedversion)
- Make MCPGroup status field changes additive (not breaking rename)
- Document name collision handling and WorkloadType enum extension
- Document field indexer registration requirement
- Strengthen SSRF mitigation (NetworkPolicy REQUIRED for IMDS/cluster API)
- Add credential blast radius to threat model
- Mark SSE as deprecated per MCP spec 2025-11-25
- Add MCP-Protocol-Version header injection requirement
- Add reconnection handling section for direct mode
- Expand MCPGroup controller changes to 9 explicit code changes
- Fix Open Question 4 contradiction with groupRef: Required

Documentation and polish:
- Add mode selection guide, CRD short names, printer columns
- Add allow-insecure annotation specification
- Fix THV-0055 broken link, add spec.port to migration table
- Add actionable deprecation timeline, standalone use explanation
- Add RFC naming convention note

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix CEL validation rule placement (struct-level, not field-level)
- Add oldSelf null guard to type immutability rule
- Correct auth flow: accurately describe dual-consumer externalAuthConfigRef
  in proxy mode and single-boundary direct mode
- Remove incorrect Redis session store claim; single-replica is the only
  supported constraint for type:direct
- Fix reconnection to require full MCP initialization handshake on HTTP 404,
  with Last-Event-ID resumption attempt before re-init
- Add server-initiated notifications section (persistent GET stream)
- Restrict embeddedAuthServer and awsSts for type:direct
- Enumerate all MCPGroup controller changes including field indexer and RBAC markers
- Remove false audience validation claim; add as Phase 2 requirement
- Fix broken anchor cross-references
- Document HTTPS enforcement as controller-side only
- Specify allow-insecure annotation value
- Add inline warning to addPlaintextHeaders YAML example
- Add CA bundle MITM trust store warning
- Add emergency credential rotation guidance
- Remove remoteendpoint short name; keep mcpre only

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Remove the HTTPS enforcement requirement and the
toolhive.stacklok.dev/allow-insecure annotation section.
remoteURL is now simply "URL of the remote MCP server".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com>
@ChrisJBurns ChrisJBurns changed the title RFC: MCPServerEntry CRD for direct remote MCP server backends RFC-0057: MCPRemoteEndpoint CRD — Unified Remote MCP Server Connectivity Mar 19, 2026
ChrisJBurns and others added 2 commits March 19, 2026 16:06
Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- **`type: proxy`** — deploys a proxy pod with full auth middleware, authz
policy, and audit logging. Functionally equivalent to `MCPRemoteProxy` and
replaces it.
- **`type: direct`** — no pod deployed; VirtualMCPServer connects directly to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not just a vMCP concern right? Any client would be able to use these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah technically anything inside of the cluster will be able to access that remote endpoint providing it doesn't have any auth on it. Although that would be true anyways regardless if we had the direct MCPRemoteEndpoint or not.

| Scenario | Recommended Mode | Why |
|---|---|---|
| Public, unauthenticated remote (e.g., context7) | `direct` | No auth middleware needed; no pod required |
| Remote requiring only token exchange auth | `direct` | vMCP handles token exchange; one fewer hop |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is misleading. I think we shouldn't single out vmcp auth using token exchange for direct, but rather any vmcp deployments. It would be the same if vmcp provides a static token or drives an oauth consent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call - generalized this row to "Remote with outgoing auth handled by vMCP (token exchange, header injection, etc.)" so it covers all supported auth types, not just token exchange.


**Rule of thumb:** Use `direct` for simple, public, or token-exchange-only
remotes. Use `proxy` when you need an independent auth/authz/audit boundary
per remote, or when the backend needs to be accessible standalone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rather say, use direct for simple, public remotes or remotes fronted by vmcp, use proxy when you need authentication or authorization in standalone mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, reworded to: "Use direct for simple, public remotes or any remote fronted by vMCP where vMCP handles outgoing auth. Use proxy when you need an independent auth/authz/audit boundary per remote, or when the backend needs to be accessible standalone."

| Standalone use (without vMCP) | Yes | No |
| Outgoing auth to remote | Yes (`externalAuthConfigRef`) | Yes (`externalAuthConfigRef`) |
| Header forwarding | Yes (`headerForward`) | Yes (`headerForward`) |
| Custom CA bundle | Yes (`caBundleRef`) | Yes (`caBundleRef`) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what do we need the custom bundle for in this mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah private remote servers

--[externalAuthConfigRef credential]--> Proxy Pod
[proxy pod oidcConfig validates the incoming request]
[proxy pod applies externalAuthConfigRef as outgoing middleware]
--> Remote Server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this renders odd, mind asking claude to make it into a mermaid? (not urgent)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced both ASCII auth flow diagrams with mermaid sequence diagrams.

server's expected audience claim.
- Client token lifetime should exceed the expected duration of the exchange
request. Exchanged tokens are managed by the `golang.org/x/oauth2` token
source and refreshed automatically on expiry per connection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is irrelevant to the RFC isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, removed the implementation detail about golang.org/x/oauth2 token source and client token lifetime.

transport: streamable-http

# REQUIRED: Group membership.
groupRef: engineering-team
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required in vmcp only?

Copy link
Contributor

@ChrisJBurns ChrisJBurns Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make groupRef required only for type: direct, optional for type: proxy - matches current MCPRemoteProxy behavior and the RFC's own Mode Selection Guide which lists "Standalone use without VirtualMCPServer → proxy"

Comment on lines +573 to +574
ConfigMap generation time (static mode has no K8s API access at runtime)
- `vmcp.Backend` needs a `Headers` field for resolved header values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this and want to make sure we are not leaking key material into configmaps..

right how we have headerForward.addHeadersFromSecret in the remote proxy which always has a pod and the secrets re mounted as secretKeyRefs and runConfig only has env variable names.

if we had headerForward on the remote endpoint and Headers field wouldn't we be inlining the secrets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid concern - need to think through the static mode secret handling more carefully to make sure we're not inlining key material into ConfigMaps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, the original wording implied resolving secrets at ConfigMap generation time which would inline key material into the ConfigMap. Updated the RFC to use the same SecretKeyRef pattern MCPRemoteProxy already uses: the operator adds SecretKeyRef env vars to the vMCP Deployment (e.g.
TOOLHIVE_SECRET_HEADER_FORWARD_X_API_KEY_<ENDPOINT_NAME>), and the backend ConfigMap stores only the env var names - never the values. vMCP resolves them at runtime via the existing secrets.EnvironmentProvider. No secret material touches a ConfigMap.

`pkg/vmcp/k8s/backend_reconciler.go` tries resources in order: MCPServer →
MCPRemoteProxy → MCPRemoteEndpoint. Same-name resources across types in the
same namespace always resolve to the first match. Log a warning when a
collision is detected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this fallback mean that if I have a mcpserver named "blah" and I delete it then the next reconcile might happen by accident against mcpremoteproxy "blah" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - the fallback behaviour is surprising. Changed the design to reject collisions at creation time: the MCPRemoteEndpoint controller will reject creation if an MCPServer or MCPRemoteProxy with the same name exists in the namespace, setting ConfigurationValid=False with reason NameCollision. The existing
controllers will also be updated to reject collisions with MCPRemoteEndpoint


// MCPRemoteEndpointProxyConfig struct-level rule:
//
// +kubebuilder:validation:XValidation:rule="has(self.oidcConfig)",message="spec.proxyConfig.oidcConfig is required"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just validation required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep good watch, no reason to use CEL here. Replaced with standard +kubebuilder:validation:Required marker on the field.

ChrisJBurns and others added 2 commits March 20, 2026 15:31
- Generalize mode selection guide to cover all auth types, not just token exchange
- Reword rule of thumb to use "fronted by vMCP" vs "standalone" framing
- Replace ASCII auth flow diagrams with mermaid sequence diagrams
- Clarify dual-consumer auth behavior per mode
- Remove implementation-level token lifetime detail
- Add groupRef rationale inline comment
- Replace name collision fallback with admission-time rejection
- Replace CEL oidcConfig rule with standard kubebuilder Required marker

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use SecretKeyRef env vars on the vMCP Deployment instead of inlining
secret values into the backend ConfigMap. Mirrors the existing pattern
from MCPRemoteProxy. ConfigMap stores only env var names, vMCP resolves
values at runtime via secrets.EnvironmentProvider.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants