fix: clean up Temporal server-side versioning data on TWD deletion#240
Conversation
|
PTAL @carlydf |
jaypipes
left a comment
There was a problem hiding this comment.
@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.
Thanks, resolved both the comments! |
jaypipes
left a comment
There was a problem hiding this comment.
rock on :) nice work on this @anujagrawal380!
carlydf
left a comment
There was a problem hiding this comment.
we need integration tests for this before merging to main / including it in a release
@carlydf Added the integration tests. PTAL |
|
Hi @anujagrawal380 , could you fix the linters! Would love to include this in our next release |
d6a305c to
9fd0c74
Compare
@carlydf @jaypipes Added few more minor improvements here: 9fd0c74 . PTAL! |
|
@anujagrawal380 we hope to include this change in our release early next week, without the timeout addition. Hopefully you can respond to the feedback when you get back on Monday! To allow the tests to pass in a reasonable amount of time, you can override the pollerTTL to make pollers "die" faster on the server side. If we don't get a response by Tuesday, we will pick your commits into a separate PR (so you will still be attributed), remove the timeout, make the tests pass and merge it. Hopefully we can come to an agreement about the timeout behavior though! |
…blocking unversioned workers Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
…ion finalizer - Add 5-minute deletionCleanupTimeout to prevent TWD stuck in Terminating state indefinitely if Temporal server is unavailable - Return errors from version/deployment deletion to trigger requeue until versions actually clear (pollers disappear as pods terminate) - Add update/patch verbs and finalizers RBAC marker for TemporalConnections - Fix comment-spacing lint on new kubebuilder:rbac markers
9fd0c74 to
04b61a2
Compare
Thanks @carlydf @jaypipes for the thorough review. Made the changes as suggested. Please let me know if any more changes are required! |
- Drop l.Error before returning err from removeConnectionFinalizerIfUnused; controller-runtime logs returned errors automatically - Before deleting the TWD in the deletion test, update to v2.0 image, start v2.0 workers, and set v2.0 as the ramping version at 50% — exercises the clear-ramping-version path in handleDeletion and verifies the ramping version is nil after cleanup
…eletion Per Temporal server behavior, the ramping version must be cleared before setting current to unversioned to avoid a window where traffic is split between unversioned workers and the still-active ramping version.
…Token SetRampingVersion mutates the deployment, making the ConflictToken from the initial Describe stale. Re-describe after Step 1 so SetCurrentVersion in Step 2 uses a valid token instead of failing on the first reconcile.
jaypipes
left a comment
There was a problem hiding this comment.
excellent work on this @anujagrawal380 thank you so much! :)
The RBAC yaml must be generated from kubebuilder annotations, not edited manually. Running make manifests consolidates the new temporalconnections and temporalconnections/finalizers rules with existing entries.
When AllAtOnce strategy promotes v2.0 to current before the test sets it as ramping, revert to v1.0 as current first so SetRampingVersion succeeds. Also regenerate CRD YAML with new k8s API fields picked up by make manifests.
|
@carlydf Can we please rerun tests again here? |
|
@carlydf Sorry, lint failed. Please again! |
@anujagrawal380, this test failed in https://github.com/temporalio/temporal-worker-controller/actions/runs/25011104275/job/73248144790. I'm re-running to see if it was just flaky and maybe will pass the second time. At the same time, ideally we would not have a flaky test since that will cause issues for future PRs. Could you take a look at the failure please? |
…tion test The AllAtOnce reconciler sets ManagerIdentity on the Temporal deployment, blocking other identities from calling SetCurrentVersion. Use a dedicated SDK client with Identity: "temporal-worker-controller" for the revert call, and wrap the entire setup in an eventually loop so the race is handled robustly rather than failing on the first conflict.
@carlydf Added a commit, please rerun once more! |
With AllAtOnce the controller continuously re-promotes v2.0 to current, fighting any attempt by the test to set v2.0 as ramping. Switch to Manual strategy and use the existing setCurrentVersion / setRampingVersion helpers (which use defaults.ControllerIdentity) so the test drives versioning state explicitly and handleDeletion's later cleanup calls match the ManagerIdentity.
@anujagrawal380 it failed again. Have not looked at why. Is this passing locally? To run locally you can do: but replace the |
TemporalWorkerDeploymentto run Temporal server-side cleanup before K8s deletionTemporalConnectionto prevent it from being deleted while any TWD still references itProblem
When a
TemporalWorkerDeploymentCRD is deleted (e.g., switching back to plain Deployments), the Temporal server retains the build ID routing configuration. The matching service continues routing new tasks to the deleted build ID's physical queue, while unversioned workers poll a different physical queue. Tasks sit inScheduledstate indefinitely with no errors.A secondary race condition exists: Helm deletes both the
TemporalConnectionandTWDin the same upgrade. Without the connection, the controller cannot talk to Temporal to clean up. This is solved by adding a finalizer to theTemporalConnectionthat blocks its deletion until all referencing TWDs are gone.Changes
internal/controller/worker_controller.go:TWD finalizer (
temporal.io/worker-deployment-cleanup):handleDeletion()which:BuildID: "") -- the critical step that unblocks task dispatchSkipDrainage: trueTemporalConnection finalizer (
temporal.io/connection-in-use):TemporalConnectionduring normal TWD reconciliation viaensureConnectionFinalizer()removeConnectionFinalizerIfUnused()during TWD deletion, after checking no other TWDs in the same namespace reference the connectionRBAC updates:
update;patchverbs fortemporalconnections(wasget;list;watch)updateverb fortemporalconnections/finalizersDeletion flow
Issue #55
Closes #166