Skip to content

Commit 0cc5777

Browse files
committed
Patch MCPServer spec instead of Update
Fixes #4767. The controller writes finalizers, finalizer removal, and the restart-processed annotation via r.Update. Update is a full PUT, so any spec field the operator does not track — most importantly spec.authzConfig, which a separate authorization controller will soon own — is zeroed on every reconcile. Replace the three Update call sites with an optimistic-lock merge patch. The merge-patch body carries only fields the caller changed, so untouched fields never hit the wire and cannot be clobbered. MergeFromWithOptimisticLock sends resourceVersion as a precondition, giving 409-on-collision semantics for concurrent writers and defending metadata.finalizers (which has no array-merge semantics under merge-patch) against wholesale replacement when another controller is mid-flight adding its own entry. Tests: - Envtest suite writes spec.authzConfig out-of-band and asserts it survives both the finalizer-add reconcile and the restart-annotation reconcile. - Unit suite uses a patch-recording client to assert each migrated call site emits a body carrying the resourceVersion precondition — a deterministic wire-level signal that MergeFromWithOptimisticLock is in effect. A regression to plain MergeFrom would drop the precondition and fail the assertion independent of the higher-level survival test. Also: - .claude/rules/operator.md: new "Spec / metadata patching" section documenting the pattern for future CR writes. Status patching is a separate follow-up (#4633). - Rename the mock-client flag failOnMCPServerUpdate → failOnMCPServerWrite; it now intercepts both Update and Patch on MCPServer, so the name matches reality.
1 parent c3a640b commit 0cc5777

7 files changed

Lines changed: 464 additions & 25 deletions

File tree

.claude/rules/operator.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,3 +67,43 @@ task operator-test # Run unit tests
6767
task operator-e2e-test # Run e2e tests
6868
task crdref-gen # Generate CRD API docs (run from cmd/thv-operator/)
6969
```
70+
71+
## Spec / metadata patching
72+
73+
Never use `r.Update` on a CR spec or metadata: `Update` is a full PUT,
74+
so any field our local copy does not track (e.g. `spec.authzConfig`
75+
written by a separate authorization controller) gets zeroed on every
76+
reconcile.
77+
78+
Use an **optimistic-lock merge patch** instead. The merge-patch body
79+
only contains fields the caller changed, and `MergeFromWithOptimisticLock`
80+
sends `resourceVersion` as a precondition: if the server moved between
81+
our Get and Patch, the apiserver returns 409 and controller-runtime
82+
requeues with a fresh Get.
83+
84+
This is what protects `metadata.finalizers`. Merge-patch has no
85+
array-append semantics — arrays are replaced wholesale — so when our
86+
diff includes `finalizers` (e.g. an `AddFinalizer` call) it must have
87+
been computed from an up-to-date snapshot. The 409 + requeue is what
88+
guarantees that: any concurrent finalizer added by another controller
89+
fails our precondition, and the next reconcile observes it via a fresh
90+
Get before recomputing the diff.
91+
92+
```go
93+
original := mcpServer.DeepCopy()
94+
controllerutil.AddFinalizer(mcpServer, MCPServerFinalizerName)
95+
if err := r.Patch(ctx, mcpServer,
96+
client.MergeFromWithOptions(original, client.MergeFromWithOptimisticLock{})); err != nil {
97+
return ctrl.Result{}, err
98+
}
99+
```
100+
101+
Do not put unrelated work between the `DeepCopy` and the `Patch`: the
102+
wire diff is computed from that snapshot, so anything you mutate in
103+
between leaks into the patch body.
104+
105+
Expect 409s as routine log noise once the external controller lands —
106+
the guard doing its job, not a bug.
107+
108+
Status-subresource patching follows the same "never `Update`" rule and
109+
is covered separately once the shared helper lands (#4633).

cmd/thv-operator/controllers/mcpexternalauthconfig_controller.go

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -177,14 +177,20 @@ func (r *MCPExternalAuthConfigReconciler) handleConfigHashChange(
177177
logger.Info("Triggering reconciliation of MCPServer due to MCPExternalAuthConfig change",
178178
"mcpserver", server.Name, "externalAuthConfig", externalAuthConfig.Name)
179179

180+
// Use an optimistic-lock merge patch rather than Update so we do not
181+
// silently overwrite spec fields (e.g. spec.authzConfig) that are owned
182+
// by another controller. The resourceVersion is sent, so concurrent
183+
// writers cause a 409 Conflict and a clean requeue.
184+
original := server.DeepCopy()
180185
// Add an annotation to the MCPServer to trigger reconciliation
181186
if server.Annotations == nil {
182187
server.Annotations = make(map[string]string)
183188
}
184189
server.Annotations["toolhive.stacklok.dev/externalauthconfig-hash"] = configHash
185190

186-
if err := r.Update(ctx, &server); err != nil {
187-
logger.Error(err, "Failed to update MCPServer annotation", "mcpserver", server.Name)
191+
if err := r.Patch(ctx, &server,
192+
client.MergeFromWithOptions(original, client.MergeFromWithOptimisticLock{})); err != nil {
193+
logger.Error(err, "Failed to patch MCPServer annotation", "mcpserver", server.Name)
188194
// Continue with other servers even if one fails
189195
}
190196
}

cmd/thv-operator/controllers/mcpserver_controller.go

Lines changed: 29 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,9 @@ var remoteProxyRBACRules = []rbacv1.PolicyRule{
105105
// mcpContainerName is the name of the mcp container used in pod templates
106106
const mcpContainerName = "mcp"
107107

108+
// MCPServerFinalizerName is the name of the finalizer for MCPServer
109+
const MCPServerFinalizerName = "mcpserver.toolhive.stacklok.dev/finalizer"
110+
108111
// Restart annotation keys for triggering pod restart
109112
const (
110113
RestartedAtAnnotationKey = "mcpserver.toolhive.stacklok.dev/restarted-at"
@@ -182,25 +185,35 @@ func (r *MCPServerReconciler) Reconcile(ctx context.Context, req ctrl.Request) (
182185
// Check if the MCPServer instance is marked to be deleted — do this before
183186
// any validation or external API calls to avoid unnecessary work during deletion
184187
if mcpServer.GetDeletionTimestamp() != nil {
185-
if controllerutil.ContainsFinalizer(mcpServer, "mcpserver.toolhive.stacklok.dev/finalizer") {
188+
if controllerutil.ContainsFinalizer(mcpServer, MCPServerFinalizerName) {
186189
if err := r.finalizeMCPServer(ctx, mcpServer); err != nil {
187190
return ctrl.Result{}, err
188191
}
189192

190-
controllerutil.RemoveFinalizer(mcpServer, "mcpserver.toolhive.stacklok.dev/finalizer")
191-
err := r.Update(ctx, mcpServer)
192-
if err != nil {
193+
// Use an optimistic-lock merge patch rather than Update so we do not
194+
// silently overwrite spec fields (e.g. spec.authzConfig) that are owned
195+
// by another controller. The resourceVersion is sent, so concurrent
196+
// writers cause a 409 Conflict and a clean requeue.
197+
original := mcpServer.DeepCopy()
198+
controllerutil.RemoveFinalizer(mcpServer, MCPServerFinalizerName)
199+
if err := r.Patch(ctx, mcpServer,
200+
client.MergeFromWithOptions(original, client.MergeFromWithOptimisticLock{})); err != nil {
193201
return ctrl.Result{}, err
194202
}
195203
}
196204
return ctrl.Result{}, nil
197205
}
198206

199207
// Add finalizer for this CR
200-
if !controllerutil.ContainsFinalizer(mcpServer, "mcpserver.toolhive.stacklok.dev/finalizer") {
201-
controllerutil.AddFinalizer(mcpServer, "mcpserver.toolhive.stacklok.dev/finalizer")
202-
err = r.Update(ctx, mcpServer)
203-
if err != nil {
208+
if !controllerutil.ContainsFinalizer(mcpServer, MCPServerFinalizerName) {
209+
// Use an optimistic-lock merge patch rather than Update so we do not
210+
// silently overwrite spec fields (e.g. spec.authzConfig) that are owned
211+
// by another controller. The resourceVersion is sent, so concurrent
212+
// writers cause a 409 Conflict and a clean requeue.
213+
original := mcpServer.DeepCopy()
214+
controllerutil.AddFinalizer(mcpServer, MCPServerFinalizerName)
215+
if err := r.Patch(ctx, mcpServer,
216+
client.MergeFromWithOptions(original, client.MergeFromWithOptimisticLock{})); err != nil {
204217
return ctrl.Result{}, err
205218
}
206219
}
@@ -745,13 +758,18 @@ func (r *MCPServerReconciler) handleRestartAnnotation(ctx context.Context, mcpSe
745758
return false, fmt.Errorf("failed to perform restart: %w", err)
746759
}
747760

748-
// Update the last processed restart timestamp in annotations
761+
// Update the last processed restart timestamp in annotations.
762+
// Use an optimistic-lock merge patch rather than Update so we do not
763+
// silently overwrite spec fields (e.g. spec.authzConfig) that are owned
764+
// by another controller. The resourceVersion is sent, so concurrent
765+
// writers cause a 409 Conflict and a clean requeue.
766+
original := mcpServer.DeepCopy()
749767
if mcpServer.Annotations == nil {
750768
mcpServer.Annotations = make(map[string]string)
751769
}
752770
mcpServer.Annotations[LastProcessedRestartAnnotationKey] = currentRestartedAt
753-
err = r.Update(ctx, mcpServer)
754-
if err != nil {
771+
if err := r.Patch(ctx, mcpServer,
772+
client.MergeFromWithOptions(original, client.MergeFromWithOptimisticLock{})); err != nil {
755773
return false, fmt.Errorf("failed to update MCPServer with last processed restart annotation: %w", err)
756774
}
757775

cmd/thv-operator/controllers/mcpserver_restart_test.go

Lines changed: 27 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -564,10 +564,10 @@ func TestHandleRestartAnnotation_ErrorPaths(t *testing.T) {
564564
testCtx.createDeployment()
565565
testCtx.setRestartAnnotation("2023-01-01T12:00:00Z", "rolling")
566566

567-
// Mock a client that fails only on MCPServer Update operations
567+
// Mock a client that fails only on MCPServer write operations
568568
mockClient := &mockFailingClient{
569-
Client: testCtx.client,
570-
failOnMCPServerUpdate: true,
569+
Client: testCtx.client,
570+
failOnMCPServerWrite: true,
571571
}
572572
testCtx.reconciler.Client = mockClient
573573

@@ -698,14 +698,20 @@ func TestReconcile_HandleRestartAnnotation_ErrorPaths(t *testing.T) {
698698
})
699699
}
700700

701-
// mockFailingClient is a test helper that wraps a real client and can be configured to fail on specific operations
701+
// mockFailingClient is a test helper that wraps a real client and can be configured to fail on specific operations.
702+
//
703+
// failOnMCPServerWrite triggers a mock error on any write (Update or Patch)
704+
// whose target is a *mcpv1beta1.MCPServer. "Write" is used because the
705+
// #4767 migration replaced MCPServer spec Updates with optimistic-lock
706+
// Patches, so a single flag covers both code paths that can mutate the
707+
// resource.
702708
type mockFailingClient struct {
703709
client.Client
704-
failOnGet bool
705-
failOnList bool
706-
failOnUpdate bool
707-
failOnDelete bool
708-
failOnMCPServerUpdate bool
710+
failOnGet bool
711+
failOnList bool
712+
failOnUpdate bool
713+
failOnDelete bool
714+
failOnMCPServerWrite bool
709715
}
710716

711717
func (m *mockFailingClient) Get(ctx context.Context, key client.ObjectKey, obj client.Object, opts ...client.GetOption) error {
@@ -726,7 +732,7 @@ func (m *mockFailingClient) Update(ctx context.Context, obj client.Object, opts
726732
if m.failOnUpdate {
727733
return fmt.Errorf("mock error: Update operation failed")
728734
}
729-
if m.failOnMCPServerUpdate {
735+
if m.failOnMCPServerWrite {
730736
// Check if the object being updated is an MCPServer
731737
if _, isMCPServer := obj.(*mcpv1beta1.MCPServer); isMCPServer {
732738
return fmt.Errorf("mock error: MCPServer Update operation failed")
@@ -735,6 +741,17 @@ func (m *mockFailingClient) Update(ctx context.Context, obj client.Object, opts
735741
return m.Client.Update(ctx, obj, opts...)
736742
}
737743

744+
func (m *mockFailingClient) Patch(
745+
ctx context.Context, obj client.Object, patch client.Patch, opts ...client.PatchOption,
746+
) error {
747+
if m.failOnMCPServerWrite {
748+
if _, isMCPServer := obj.(*mcpv1beta1.MCPServer); isMCPServer {
749+
return fmt.Errorf("mock error: MCPServer Patch operation failed")
750+
}
751+
}
752+
return m.Client.Patch(ctx, obj, patch, opts...)
753+
}
754+
738755
func (m *mockFailingClient) Delete(ctx context.Context, obj client.Object, opts ...client.DeleteOption) error {
739756
if m.failOnDelete {
740757
return fmt.Errorf("mock error: Delete operation failed")
Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.
2+
// SPDX-License-Identifier: Apache-2.0
3+
4+
package controllers
5+
6+
import (
7+
"context"
8+
"strings"
9+
"sync"
10+
"testing"
11+
12+
"github.com/stretchr/testify/assert"
13+
"github.com/stretchr/testify/require"
14+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
15+
"k8s.io/apimachinery/pkg/types"
16+
ctrl "sigs.k8s.io/controller-runtime"
17+
"sigs.k8s.io/controller-runtime/pkg/client"
18+
"sigs.k8s.io/controller-runtime/pkg/client/fake"
19+
20+
mcpv1beta1 "github.com/stacklok/toolhive/cmd/thv-operator/api/v1beta1"
21+
"github.com/stacklok/toolhive/pkg/container/kubernetes"
22+
)
23+
24+
// patchRecordingClient wraps a client.Client and records the marshaled body
25+
// of every Patch call. Tests use it to assert the wire-level flavor of a
26+
// patch — in particular, an optimistic-lock merge patch stamps the
27+
// resourceVersion into the body, so its presence in the recorded body is a
28+
// deterministic signal that MergeFromWithOptimisticLock was in effect.
29+
//
30+
// Patches issued via .Status().Patch do not pass through this wrapper:
31+
// controller-runtime's subresource client is obtained from the embedded
32+
// client.Client and has its own Patch implementation, so the recorder only
33+
// observes spec/metadata patches on the root client.
34+
type patchRecordingClient struct {
35+
client.Client
36+
mu sync.Mutex
37+
patches []recordedPatch
38+
}
39+
40+
type recordedPatch struct {
41+
obj client.Object
42+
body string
43+
}
44+
45+
func (c *patchRecordingClient) Patch(
46+
ctx context.Context, obj client.Object, patch client.Patch, opts ...client.PatchOption,
47+
) error {
48+
// err ignored: patch.Data is json.Marshal of a typed MCPServer, which
49+
// has no channels/funcs/cyclic pointers and cannot fail in practice.
50+
// A failure here would also break the production controller's own
51+
// Patch call and fire other assertions before this one.
52+
if data, err := patch.Data(obj); err == nil {
53+
c.mu.Lock()
54+
c.patches = append(c.patches, recordedPatch{
55+
obj: obj.DeepCopyObject().(client.Object),
56+
body: string(data),
57+
})
58+
c.mu.Unlock()
59+
}
60+
return c.Client.Patch(ctx, obj, patch, opts...)
61+
}
62+
63+
// lastMCPServerPatchBody returns the body of the most recent recorded
64+
// Patch call whose target was an *mcpv1beta1.MCPServer. Returns empty
65+
// string if none was recorded.
66+
func (c *patchRecordingClient) lastMCPServerPatchBody() string {
67+
c.mu.Lock()
68+
defer c.mu.Unlock()
69+
for i := len(c.patches) - 1; i >= 0; i-- {
70+
if _, ok := c.patches[i].obj.(*mcpv1beta1.MCPServer); ok {
71+
return c.patches[i].body
72+
}
73+
}
74+
return ""
75+
}
76+
77+
// TestMCPServerSpecPatchesAreOptimisticLock asserts that each of the three
78+
// MCPServer spec Patch call sites introduced in #4767 emits a merge-patch
79+
// whose body carries the resourceVersion precondition. A regression from
80+
// client.MergeFromWithOptions(orig, client.MergeFromWithOptimisticLock{})
81+
// to plain client.MergeFrom(orig) would drop the precondition and fail
82+
// these assertions, independent of whether the higher-level field-
83+
// clobber survival test still passes.
84+
func TestMCPServerSpecPatchesAreOptimisticLock(t *testing.T) {
85+
t.Parallel()
86+
87+
const namespace = "default"
88+
89+
tests := []struct {
90+
name string
91+
// seed returns the MCPServer fixture placed in the fake client
92+
// before the action runs. Returning a distinct name per case
93+
// keeps parallel subtests from colliding on the shared fake.
94+
seed func() *mcpv1beta1.MCPServer
95+
// action triggers the reconcile path that should emit the
96+
// optimistic-lock Patch under test. It is invoked with a
97+
// recorder-backed reconciler.
98+
action func(t *testing.T, r *MCPServerReconciler, key types.NamespacedName)
99+
}{
100+
{
101+
name: "AddFinalizer",
102+
seed: func() *mcpv1beta1.MCPServer {
103+
s := createTestMCPServer("optlock-add", namespace)
104+
// No finalizer yet — Reconcile should add it.
105+
return s
106+
},
107+
action: func(t *testing.T, r *MCPServerReconciler, key types.NamespacedName) {
108+
t.Helper()
109+
_, _ = r.Reconcile(context.TODO(), ctrl.Request{NamespacedName: key})
110+
},
111+
},
112+
{
113+
name: "RemoveFinalizer",
114+
seed: func() *mcpv1beta1.MCPServer {
115+
s := createTestMCPServer("optlock-remove", namespace)
116+
s.Finalizers = []string{MCPServerFinalizerName}
117+
// DeletionTimestamp forces Reconcile into the
118+
// finalize branch. The fake client accepts an
119+
// already-set timestamp on created objects.
120+
now := metav1.Now()
121+
s.DeletionTimestamp = &now
122+
return s
123+
},
124+
action: func(t *testing.T, r *MCPServerReconciler, key types.NamespacedName) {
125+
t.Helper()
126+
_, _ = r.Reconcile(context.TODO(), ctrl.Request{NamespacedName: key})
127+
},
128+
},
129+
{
130+
name: "RestartAnnotation",
131+
seed: func() *mcpv1beta1.MCPServer {
132+
s := createTestMCPServer("optlock-restart", namespace)
133+
s.Finalizers = []string{MCPServerFinalizerName}
134+
if s.Annotations == nil {
135+
s.Annotations = map[string]string{}
136+
}
137+
s.Annotations[RestartedAtAnnotationKey] = "2026-01-01T00:00:00Z"
138+
s.Annotations[RestartStrategyAnnotationKey] = "immediate"
139+
return s
140+
},
141+
action: func(t *testing.T, r *MCPServerReconciler, key types.NamespacedName) {
142+
t.Helper()
143+
got := &mcpv1beta1.MCPServer{}
144+
require.NoError(t, r.Get(context.TODO(), key, got))
145+
// handleRestartAnnotation is the innermost
146+
// function that issues the Patch under test;
147+
// calling it directly avoids exercising the
148+
// rest of Reconcile, which would issue many
149+
// unrelated writes.
150+
_, _ = r.handleRestartAnnotation(context.TODO(), got)
151+
},
152+
},
153+
}
154+
155+
for _, tc := range tests {
156+
t.Run(tc.name, func(t *testing.T) {
157+
t.Parallel()
158+
159+
seeded := tc.seed()
160+
testScheme := createTestScheme()
161+
fakeClient := fake.NewClientBuilder().
162+
WithScheme(testScheme).
163+
WithObjects(seeded).
164+
WithStatusSubresource(&mcpv1beta1.MCPServer{}).
165+
Build()
166+
recorder := &patchRecordingClient{Client: fakeClient}
167+
reconciler := newTestMCPServerReconciler(
168+
recorder, testScheme, kubernetes.PlatformKubernetes)
169+
170+
tc.action(t, reconciler, types.NamespacedName{
171+
Name: seeded.Name,
172+
Namespace: namespace,
173+
})
174+
175+
body := recorder.lastMCPServerPatchBody()
176+
require.NotEmpty(t, body,
177+
"no MCPServer Patch was recorded; the reconcile path did not emit the expected write")
178+
assert.True(t,
179+
strings.Contains(body, `"resourceVersion"`),
180+
"MCPServer spec patch body did not include a resourceVersion precondition; "+
181+
"MergeFromWithOptimisticLock regression? body=%s", body)
182+
})
183+
}
184+
}

0 commit comments

Comments
 (0)