Skip to content

Commit fced0ad

Browse files
carlydfclaude
andauthored
Use ManagerIdentity API instead of LastModifierIdentity + ignore-last-modifier metadata hack (#220)
<!--- Note to EXTERNAL Contributors --> <!-- Thanks for opening a PR! If it is a significant code change, please **make sure there is an open issue** for this. We work best with you when we have accepted the idea first before you code. --> <!--- For ALL Contributors 👇 --> ## What was changed Replace the `LastModifierIdentity` + `temporal.io/ignore-last-modifier` metadata mechanism with the Temporal server's first-class `ManagerIdentity` API for Worker Deployment ownership. **Controller behavior:** - When the controller is about to make its first routing change to a Worker Deployment (i.e. `ManagerIdentity` is empty), it calls `SetManagerIdentity(Self=true)` to claim ownership before applying the change. The conflict token returned is threaded into the subsequent SetCurrentVersion / SetRampingVersion call. - If `ManagerIdentity` is already set to a different identity, the routing change is submitted as-is and the Temporal server rejects it. The controller returns an error and retries on backoff -> no special manual-mode logic needed. - The current `ManagerIdentity` value is surfaced in `TemporalWorkerDeploymentStatus.managerIdentity`. **Planner:** `ClaimManagerIdentity` bool is now a field on `planner.VersionConfig`, keeping the decision unit-testable alongside other plan decisions. **Removed:** the `LastModifierIdentity`-based forced-manual-strategy logic in `genplan.go` and the `IgnoreLastModifier/RemoveIgnoreLastModifierBuilds` plumbing. ## Why? The old mechanism was a controller-side workaround: it watched `LastModifierIdentity`, forced manual strategy when it saw a non-controller modifier, and required users to set a special metadata key to pass ownership back. This was fragile (race-prone, required reading the metadata contract) and blocked on version-level metadata support. `ManagerIdentity` is a server-enforced, deployment-level field designed exactly for this use case. It gives a clear, atomic protocol for ownership transfer without any special metadata conventions. ## Checklist <!--- add/delete as needed ---> 1. Closes #162 2. How was this tested: - New planner unit tests cover the `ClaimManagerIdentity` decision (claims when `ManagerIdentity` is empty, does not claim when already set by someone else). - Six integration tests updated: three `*-blocked-by-manager-identity` tests verify the controller cannot make routing changes when `ManagerIdentity` is held by an external identity; three `*-unblocked-after-manager-identity-cleared` tests verify the controller proceeds normally once `ManagerIdentity` is cleared. 3. Any docs updates needed? - Yes: `docs/ownership.md` has been rewritten to document the `ManagerIdentity` mechanism and the `temporal worker deployment manager-identity set/unset` CLI commands for taking and returning control. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent e9773f5 commit fced0ad

15 files changed

Lines changed: 292 additions & 145 deletions

api/v1alpha1/worker_types.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,12 @@ type TemporalWorkerDeploymentStatus struct {
196196
// +optional
197197
LastModifierIdentity string `json:"lastModifierIdentity,omitempty"`
198198

199+
// ManagerIdentity is the identity that has exclusive rights to modify this Worker Deployment's routing config.
200+
// When set, clients whose identity does not match will be blocked from making routing changes.
201+
// Empty by default. Use `temporal worker deployment manager-identity set/unset` to change.
202+
// +optional
203+
ManagerIdentity string `json:"managerIdentity,omitempty"`
204+
199205
// VersionCount is the total number of versions currently known by the worker deployment.
200206
// This includes current, target, ramping, and deprecated versions.
201207
// +optional

docs/ownership.md

Lines changed: 34 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -2,57 +2,52 @@
22

33
## Problem
44

5-
If a worker controller is managing a Worker Deployment (ie. the controller is updating the RoutingConfig of the Worker
6-
Deployment), but the user changes something via the CLI (ie. rolls back to the previous current version, or stops the
7-
new target version from ramping because an issue was detected), the controller should not clobber what the human did.
5+
If the worker controller is managing a Worker Deployment (i.e. updating its routing config), but a user makes a manual
6+
change via the CLI, SDK, or gRPC API instead of via the `TemporalWorkerDeployment` CRD interface, the controller should
7+
not clobber the user's change.
88

9-
At some point, after this human has handled their urgent rollback, they will want to let the controller know that it is
10-
authorized to resume making changes to the Routing Config of the Worker Deployment.
9+
Once the user has finished their manual intervention, they need a way to hand ownership back to the controller.
1110

1211
## Solution
1312

14-
_Once it is available in OSS v1.29, the controller will be able to coordinate with other users via the `ManagerIdentity`
15-
field of a Worker Deployment. This runbook will be updated when that is available and implemented by the controller._
13+
The controller uses the Temporal server's `ManagerIdentity` field on Worker Deployments to coordinate exclusive
14+
ownership of routing changes.
1615

17-
In the meantime, the controller will watch the `LastModifierIdentity` field of a Worker Deployment to detect whether
18-
another user has made a change. If another user made a change to the Worker Deployment, the controller will not make
19-
any more changes to ensure a human's change is not clobbered.
16+
When `ManagerIdentity` is set on a Worker Deployment, only clients whose identity matches `ManagerIdentity` can make
17+
routing changes (set current version, set ramping version). The controller's identity is visible in the
18+
`managerIdentity` field of the `TemporalWorkerDeployment` status.
2019

21-
Once you are done making your own changes to the Worker Deployment's current and ramping versions, and you are ready
22-
for the Worker Controller to take over, you can update the metadata to indicate that.
20+
### How the controller claims ownership
2321

24-
There is no Temporal server support for Worker Deployment Version-level metadata, so you'll have to set this value on
25-
the Current Version of your Worker Deployment.
22+
The first time the controller plans a routing change for a Worker Deployment (i.e. when `ManagerIdentity` is empty),
23+
it calls `SetManagerIdentity` to claim ownership before applying the change. Subsequent routing changes succeed because
24+
the controller's identity already matches `ManagerIdentity`.
2625

27-
Note: The controller decodes this metadata value as a string. Be sure to set the value to the string "true" (not the boolean true).
26+
### Taking manual control
27+
28+
To take manual control away from the controller, set `ManagerIdentity` to your own identity:
2829

2930
```bash
30-
temporal worker deployment update-metadata-version \
31-
--deployment-name $MY_DEPLOYMENT \
32-
--build-id $CURRENT_VERSION_BUILD_ID \
33-
--metadata 'temporal.io/ignore-last-modifier="true"'
34-
```
35-
Alternatively, if your CLI supports JSON input:
36-
```bash
37-
temporal worker deployment update-metadata-version \
38-
--deployment-name $MY_DEPLOYMENT \
39-
--build-id $CURRENT_VERSION_BUILD_ID \
40-
--metadata-json '{"temporal.io/ignore-last-modifier":"true"}'
41-
```
42-
In the rare case that you have a nil Current Version when you are passing back ownership, you should set it on your Ramping Version
43-
```bash
44-
temporal worker deployment update-metadata-version \
31+
temporal worker deployment manager-identity set \
4532
--deployment-name $MY_DEPLOYMENT \
46-
--build-id $RAMPING_VERSION_BUILD_ID \
47-
--metadata 'temporal.io/ignore-last-modifier="true"'
33+
--self
4834
```
49-
Or with JSON:
35+
36+
The `--self` flag sets `ManagerIdentity` to the identity of the caller (auto-generated by the CLI if not explicitly
37+
provided via `--identity`; similarly, the SDK uses its own auto-generated or configured identity). After this, the
38+
controller's routing change attempts will fail and it will retry on a backoff until ownership is returned.
39+
40+
You can then make routing changes freely (the server enforces `ManagerIdentity` for all clients, not just the
41+
controller).
42+
43+
### Returning ownership to the controller
44+
45+
When you are done with your manual changes and want the controller to resume, clear `ManagerIdentity`:
46+
5047
```bash
51-
temporal worker deployment update-metadata-version \
52-
--deployment-name $MY_DEPLOYMENT \
53-
--build-id $RAMPING_VERSION_BUILD_ID \
54-
--metadata-json '{"temporal.io/ignore-last-modifier":"true"}'
48+
temporal worker deployment manager-identity unset \
49+
--deployment-name $MY_DEPLOYMENT
5550
```
5651

57-
In the even rarer case that you have nil Current Version and nil Ramping Version, you'll need to use the CLI or SDK to
58-
set a Current or Ramping Version and then do as instructed above.
52+
On the next reconcile, the controller will detect that `ManagerIdentity` is empty, claim it for itself, and resume
53+
managing routing changes.

internal/controller/clientpool/clientpool.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,13 +97,15 @@ type NewClientOptions struct {
9797
TemporalNamespace string
9898
K8sNamespace string
9999
Spec v1alpha1.TemporalConnectionSpec
100+
Identity string
100101
}
101102

102103
func (cp *ClientPool) fetchClientUsingMTLSSecret(secret corev1.Secret, opts NewClientOptions) (*sdkclient.Options, *ClientPoolKey, *ClientAuth, error) {
103104
clientOpts := sdkclient.Options{
104105
Logger: cp.logger,
105106
HostPort: opts.Spec.HostPort,
106107
Namespace: opts.TemporalNamespace,
108+
Identity: opts.Identity,
107109
}
108110

109111
var pemCert []byte
@@ -170,6 +172,7 @@ func (cp *ClientPool) fetchClientUsingAPIKeySecret(secret corev1.Secret, opts Ne
170172
Logger: cp.logger,
171173
HostPort: opts.Spec.HostPort,
172174
Namespace: opts.TemporalNamespace,
175+
Identity: opts.Identity,
173176
ConnectionOptions: sdkclient.ConnectionOptions{
174177
TLS: &tls.Config{},
175178
},
@@ -198,6 +201,7 @@ func (cp *ClientPool) fetchClientUsingNoCredentials(opts NewClientOptions) (*sdk
198201
Logger: cp.logger,
199202
HostPort: opts.Spec.HostPort,
200203
Namespace: opts.TemporalNamespace,
204+
Identity: opts.Identity,
201205
}
202206

203207
key := ClientPoolKey{

internal/controller/execplan.go

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ import (
1212

1313
"github.com/go-logr/logr"
1414
temporaliov1alpha1 "github.com/temporalio/temporal-worker-controller/api/v1alpha1"
15+
"github.com/temporalio/temporal-worker-controller/internal/planner"
1516
"github.com/temporalio/temporal-worker-controller/internal/temporal"
1617
enumspb "go.temporal.io/api/enums/v1"
1718
sdkclient "go.temporal.io/sdk/client"
@@ -150,12 +151,46 @@ func (r *TemporalWorkerDeploymentReconciler) startTestWorkflows(ctx context.Cont
150151
return nil
151152
}
152153

154+
func (r *TemporalWorkerDeploymentReconciler) shouldClaimManagerIdentity(vcfg *planner.VersionConfig) bool {
155+
return vcfg.ManagerIdentity == ""
156+
}
157+
158+
func (r *TemporalWorkerDeploymentReconciler) claimManagerIdentity(
159+
ctx context.Context,
160+
l logr.Logger,
161+
workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment,
162+
deploymentHandler sdkclient.WorkerDeploymentHandle,
163+
vcfg *planner.VersionConfig,
164+
) error {
165+
resp, err := deploymentHandler.SetManagerIdentity(ctx, sdkclient.WorkerDeploymentSetManagerIdentityOptions{
166+
Self: true,
167+
ConflictToken: vcfg.ConflictToken,
168+
Identity: getControllerIdentity(),
169+
})
170+
if err != nil {
171+
l.Error(err, "unable to claim manager identity")
172+
r.Recorder.Eventf(workerDeploy, corev1.EventTypeWarning, ReasonManagerIdentityClaimFailed,
173+
"Failed to claim manager identity: %v", err)
174+
return err
175+
}
176+
l.Info("claimed manager identity", "identity", getControllerIdentity())
177+
// Use the updated conflict token for the subsequent routing config change.
178+
vcfg.ConflictToken = resp.ConflictToken
179+
return nil
180+
}
181+
153182
func (r *TemporalWorkerDeploymentReconciler) updateVersionConfig(ctx context.Context, l logr.Logger, workerDeploy *temporaliov1alpha1.TemporalWorkerDeployment, deploymentHandler sdkclient.WorkerDeploymentHandle, p *plan) error {
154183
vcfg := p.UpdateVersionConfig
155184
if vcfg == nil {
156185
return nil
157186
}
158187

188+
if r.shouldClaimManagerIdentity(vcfg) {
189+
if err := r.claimManagerIdentity(ctx, l, workerDeploy, deploymentHandler, vcfg); err != nil {
190+
return fmt.Errorf("unable to claim manager identity: %w", err)
191+
}
192+
}
193+
159194
if vcfg.SetCurrent {
160195
l.Info("registering new current version", "buildID", vcfg.BuildID)
161196
if _, err := deploymentHandler.SetCurrentVersion(ctx, sdkclient.WorkerDeploymentSetCurrentVersionOptions{

internal/controller/genplan.go

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -80,15 +80,7 @@ func (r *TemporalWorkerDeploymentReconciler) generatePlan(
8080
ScaleDeployments: make(map[*corev1.ObjectReference]uint32),
8181
}
8282

83-
// Check if we need to force manual strategy due to external modification
8483
rolloutStrategy := w.Spec.RolloutStrategy
85-
if w.Status.LastModifierIdentity != getControllerIdentity() &&
86-
w.Status.LastModifierIdentity != serverDeleteVersionIdentity &&
87-
w.Status.LastModifierIdentity != "" &&
88-
!temporalState.IgnoreLastModifier {
89-
l.Info(fmt.Sprintf("Forcing Manual rollout strategy since Worker Deployment was modified by a user with a different identity '%s'; to allow controller to make changes again, set 'temporal.io/ignore-last-modifier=true' in the metadata of your Current or Ramping Version; see ownership runbook at docs/ownership.md for more details.", w.Status.LastModifierIdentity))
90-
rolloutStrategy.Strategy = temporaliov1alpha1.UpdateManual
91-
}
9284

9385
// Resolve gate input if gate is configured
9486
var gateInput []byte

internal/controller/reconciler_events_test.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -627,20 +627,20 @@ func TestUpdateVersionConfig_EmitsEventOnFailure(t *testing.T) {
627627
{
628628
name: "SetCurrentFailed",
629629
handle: &stubWDHandle{setCurrentErr: errors.New("simulated SetCurrentVersion failure")},
630-
config: &planner.VersionConfig{BuildID: "build-abc", SetCurrent: true},
630+
config: &planner.VersionConfig{BuildID: "build-abc", SetCurrent: true, ManagerIdentity: "some-manager"},
631631
expectedReason: ReasonVersionPromotionFailed,
632632
},
633633
{
634634
name: "SetRampingFailed",
635635
handle: &stubWDHandle{setRampingErr: errors.New("simulated SetRampingVersion failure")},
636-
config: &planner.VersionConfig{BuildID: "build-abc", RampPercentage: 25},
636+
config: &planner.VersionConfig{BuildID: "build-abc", RampPercentage: 25, ManagerIdentity: "some-manager"},
637637
expectedReason: ReasonVersionPromotionFailed,
638638
},
639639
{
640640
// SetCurrentVersion succeeds; UpdateVersionMetadata fails.
641641
name: "MetadataUpdateFailed",
642642
handle: &stubWDHandle{updateMetaErr: errors.New("simulated UpdateVersionMetadata failure")},
643-
config: &planner.VersionConfig{BuildID: "build-abc", SetCurrent: true},
643+
config: &planner.VersionConfig{BuildID: "build-abc", SetCurrent: true, ManagerIdentity: "some-manager"},
644644
expectedReason: ReasonMetadataUpdateFailed,
645645
},
646646
}

internal/controller/state_mapper.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ func (m *stateMapper) mapToStatus(targetBuildID string) *v1alpha1.TemporalWorker
3434
}
3535

3636
status.LastModifierIdentity = m.temporalState.LastModifierIdentity
37+
status.ManagerIdentity = m.temporalState.ManagerIdentity
3738

3839
// Get build IDs directly from temporal state
3940
currentBuildID := m.temporalState.CurrentBuildID

internal/controller/util.go

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,16 @@ import (
1818
// status API and may change between releases. Do not write alerting or automation
1919
// that depends on these strings.
2020
const (
21-
ReasonPlanGenerationFailed = "PlanGenerationFailed"
22-
ReasonPlanExecutionFailed = "PlanExecutionFailed"
23-
ReasonDeploymentCreateFailed = "DeploymentCreateFailed"
24-
ReasonDeploymentDeleteFailed = "DeploymentDeleteFailed"
25-
ReasonDeploymentScaleFailed = "DeploymentScaleFailed"
26-
ReasonDeploymentUpdateFailed = "DeploymentUpdateFailed"
27-
ReasonTestWorkflowStartFailed = "TestWorkflowStartFailed"
28-
ReasonVersionPromotionFailed = "VersionPromotionFailed"
29-
ReasonMetadataUpdateFailed = "MetadataUpdateFailed"
21+
ReasonPlanGenerationFailed = "PlanGenerationFailed"
22+
ReasonPlanExecutionFailed = "PlanExecutionFailed"
23+
ReasonDeploymentCreateFailed = "DeploymentCreateFailed"
24+
ReasonDeploymentDeleteFailed = "DeploymentDeleteFailed"
25+
ReasonDeploymentScaleFailed = "DeploymentScaleFailed"
26+
ReasonDeploymentUpdateFailed = "DeploymentUpdateFailed"
27+
ReasonTestWorkflowStartFailed = "TestWorkflowStartFailed"
28+
ReasonVersionPromotionFailed = "VersionPromotionFailed"
29+
ReasonMetadataUpdateFailed = "MetadataUpdateFailed"
30+
ReasonManagerIdentityClaimFailed = "ManagerIdentityClaimFailed"
3031
)
3132

3233
const (

internal/controller/worker_controller.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,7 @@ func (r *TemporalWorkerDeploymentReconciler) Reconcile(ctx context.Context, req
184184
K8sNamespace: workerDeploy.Namespace,
185185
TemporalNamespace: workerDeploy.Spec.WorkerOptions.TemporalNamespace,
186186
Spec: temporalConnection.Spec,
187+
Identity: getControllerIdentity(),
187188
})
188189
if err != nil {
189190
l.Error(err, "invalid Temporal auth secret")

internal/planner/planner.go

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ type Plan struct {
2727
ShouldCreateDeployment bool
2828
VersionConfig *VersionConfig
2929
TestWorkflows []WorkflowConfig
30+
3031
// Build IDs of versions from which the controller should
3132
// remove IgnoreLastModifierKey from the version metadata
3233
RemoveIgnoreLastModifierBuilds []string
@@ -45,6 +46,11 @@ type VersionConfig struct {
4546
SetCurrent bool
4647
// Acceptable values [0,100]
4748
RampPercentage int32
49+
50+
// ManagerIdentity is the current manager identity of the worker deployment in Temporal.
51+
// An empty string indicates the controller should claim the identity before applying
52+
// any routing config changes.
53+
ManagerIdentity string
4854
}
4955

5056
// WorkflowConfig defines a workflow to be started
@@ -546,9 +552,14 @@ func getVersionConfigDiff(
546552
}
547553
}
548554

555+
managerIdentity := ""
556+
if temporalState != nil {
557+
managerIdentity = temporalState.ManagerIdentity
558+
}
549559
vcfg := &VersionConfig{
550-
ConflictToken: conflictToken,
551-
BuildID: status.TargetVersion.BuildID,
560+
ConflictToken: conflictToken,
561+
BuildID: status.TargetVersion.BuildID,
562+
ManagerIdentity: managerIdentity,
552563
}
553564

554565
// If there is no current version and presence of unversioned pollers is not confirmed for all

0 commit comments

Comments
 (0)