Skip to content

Wire SessionStorage into MCPRemoteProxy for HA support#5237

Open
aron-muon wants to merge 2 commits intostacklok:mainfrom
aron-muon:add-mcpremoteproxy-session-storage
Open

Wire SessionStorage into MCPRemoteProxy for HA support#5237
aron-muon wants to merge 2 commits intostacklok:mainfrom
aron-muon:add-mcpremoteproxy-session-storage

Conversation

@aron-muon
Copy link
Copy Markdown
Contributor

@aron-muon aron-muon commented May 9, 2026

Summary

When MCPRemoteProxy runs with multiple replicas behind a load balancer that doesn't preserve client-IP affinity (e.g. AWS ALB across multiple AZs), every non-initialize request fails with Session not found. The transparent proxy validates Mcp-Session-Id against pod-local in-memory state on every hop — transparent_proxy.go even calls this out:

// Guard: reject non-initialize requests with unknown session IDs.
// When multiple proxyrunner replicas share a Redis session store,
// a valid session will always be found.

The transport layer already supports a Redis-backed session store via runner.ScalingConfig.SessionRedis. MCPServer (#4368) and VirtualMCPServer (#4367) both wire it through. MCPRemoteProxy was the only one missing the symmetric work, leaving HA deployments architecturally broken.

What changed:

  • Add MCPRemoteProxySpec.SessionStorage field — same SessionStorageConfig shape used by MCPServer and VirtualMCPServer.
  • populateScalingConfigForRemoteProxy — write the non-sensitive Redis parameters (address/db/keyPrefix) into runner.ScalingConfig.SessionRedis.
  • buildRedisPasswordEnvVarForRemoteProxy — inject THV_SESSION_REDIS_PASSWORD on the proxy Deployment via SecretKeyRef when sessionStorage.passwordRef is set, so the password never lands in the RunConfig ConfigMap. Mirrors VirtualMCPServer's buildRedisPasswordEnvVar exactly.

The change is intentionally a near-verbatim mirror of the MCPServer / VirtualMCPServer implementations to keep review easy — it's the same pattern reviewers have already accepted twice.

Type of change

  • Bug fix
  • New feature

Test plan

  • Unit tests — go test ./cmd/thv-operator/controllers/... (4 + 4 = 8 new test cases pass; no regressions in existing MCPRemoteProxy suite)
  • Build verification — go build ./...
  • CRD regenerated with controller-gen v0.17.3

New tests:

  • TestPopulateScalingConfigForRemoteProxy mirrors TestPopulateScalingConfig from mcpserver_runconfig_test.go. 4 cases including a serialization check that verifies the password never leaks into the RunConfig.
  • TestBuildRedisPasswordEnvVarForRemoteProxy mirrors TestBuildRedisPasswordEnvVar from virtualmcpserver_deployment_test.go. 4 cases covering nil / memory / redis-no-pwd / redis-with-pwd.

API Compatibility

  • This PR does not break the v1beta1 API. The added field (spec.sessionStorage) is optional and behaves identically to the existing nil case when omitted.

Changes

File Lines Change
cmd/thv-operator/api/v1beta1/mcpremoteproxy_types.go +16 Add SessionStorage *SessionStorageConfig with kubebuilder docs explaining the transparent_proxy session-validation requirement
cmd/thv-operator/api/v1beta1/zz_generated.deepcopy.go +5 Generated DeepCopyInto for the new field (controller-gen v0.17.3)
cmd/thv-operator/controllers/mcpremoteproxy_runconfig.go +28 New populateScalingConfigForRemoteProxy helper, called before runConfig is returned
cmd/thv-operator/controllers/mcpremoteproxy_deployment.go +30 New buildRedisPasswordEnvVarForRemoteProxy + call site appending it to the proxy env
cmd/thv-operator/controllers/mcpremoteproxy_runconfig_test.go +88 TestPopulateScalingConfigForRemoteProxy (4 cases)
cmd/thv-operator/controllers/mcpremoteproxy_deployment_test.go +63 TestBuildRedisPasswordEnvVarForRemoteProxy (4 cases)
deploy/charts/operator-crds/files/crds/toolhive.stacklok.dev_mcpremoteproxies.yaml +114 Generated sessionStorage subschema in CRD

Does this introduce a user-facing change?

Yes. New optional field MCPRemoteProxy.spec.sessionStorage enabling Redis-backed shared session state across replicas — required for HA when the upstream load balancer doesn't preserve client-IP affinity. Same shape as MCPServer.spec.sessionStorage so existing operators already know the pattern.

🤖 Generated with Claude Code

When MCPRemoteProxy runs with multiple replicas behind a load balancer
that doesn't preserve client-IP affinity (e.g. AWS ALB across multiple
AZs), every non-initialize request fails with `Session not found` because
the transparent proxy validates `Mcp-Session-Id` against pod-local
in-memory state on every hop. From transparent_proxy.go:

    // Guard: reject non-initialize requests with unknown session IDs.
    // When multiple proxyrunner replicas share a Redis session store,
    // a valid session will always be found.

The transport layer already supports a Redis-backed session store via
runner.ScalingConfig.SessionRedis — MCPServer and VirtualMCPServer wire
it through. MCPRemoteProxy simply never populated it.

This change ports the symmetric work from MCPServer (PR stacklok#4368) and
VirtualMCPServer (PR stacklok#4367) to MCPRemoteProxy:

- Add MCPRemoteProxySpec.SessionStorage field (same SessionStorageConfig
  shape used by MCPServer / VirtualMCPServer)
- populateScalingConfigForRemoteProxy: write the non-sensitive Redis
  parameters (address/db/keyPrefix) into runner.ScalingConfig.SessionRedis
- buildRedisPasswordEnvVarForRemoteProxy: inject THV_SESSION_REDIS_PASSWORD
  on the proxy Deployment via SecretKeyRef when sessionStorage.passwordRef
  is set, so the password never lands in the RunConfig ConfigMap

Tests:
- TestPopulateScalingConfigForRemoteProxy mirrors TestPopulateScalingConfig
  from mcpserver_runconfig_test.go (4 cases including a check that the
  password never leaks into the serialized SessionRedis)
- TestBuildRedisPasswordEnvVarForRemoteProxy mirrors TestBuildRedisPasswordEnvVar
  from virtualmcpserver_deployment_test.go (4 cases covering the matrix
  of nil/memory/redis-no-pwd/redis-with-pwd)

Generated:
- zz_generated.deepcopy.go (controller-gen v0.17.3)
- toolhive.stacklok.dev_mcpremoteproxies.yaml CRD schema (controller-gen v0.17.3)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label May 9, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.88%. Comparing base (9211a36) to head (c52a373).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5237      +/-   ##
==========================================
+ Coverage   67.86%   67.88%   +0.02%     
==========================================
  Files         610      610              
  Lines       62522    62550      +28     
==========================================
+ Hits        42431    42465      +34     
+ Misses      16910    16902       -8     
- Partials     3181     3183       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 9, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 9, 2026
@aron-muon aron-muon marked this pull request as ready for review May 9, 2026 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant