Skip to content

fix: clean up Temporal server-side versioning data on TWD deletion#240

Merged
carlydf merged 16 commits intotemporalio:mainfrom
anujagrawal380:fix/twd-leaves-stale-versioning-data
Apr 27, 2026
Merged

fix: clean up Temporal server-side versioning data on TWD deletion#240
carlydf merged 16 commits intotemporalio:mainfrom
anujagrawal380:fix/twd-leaves-stale-versioning-data

Conversation

@anujagrawal380
Copy link
Copy Markdown
Contributor

@anujagrawal380 anujagrawal380 commented Mar 24, 2026

  • Add a finalizer to TemporalWorkerDeployment to run Temporal server-side cleanup before K8s deletion
  • Add a finalizer to TemporalConnection to prevent it from being deleted while any TWD still references it
  • On TWD deletion, set current version to unversioned, clear ramping version, and delete registered versions

Problem

When a TemporalWorkerDeployment CRD is deleted (e.g., switching back to plain Deployments), the Temporal server retains the build ID routing configuration. The matching service continues routing new tasks to the deleted build ID's physical queue, while unversioned workers poll a different physical queue. Tasks sit in Scheduled state indefinitely with no errors.

A secondary race condition exists: Helm deletes both the TemporalConnection and TWD in the same upgrade. Without the connection, the controller cannot talk to Temporal to clean up. This is solved by adding a finalizer to the TemporalConnection that blocks its deletion until all referencing TWDs are gone.

Changes

internal/controller/worker_controller.go:

TWD finalizer (temporal.io/worker-deployment-cleanup):

  • Added to all TWD resources during normal reconciliation
  • On deletion, triggers handleDeletion() which:
    1. Sets the current version to unversioned (BuildID: "") -- the critical step that unblocks task dispatch
    2. Clears any ramping version
    3. Deletes all registered versions with SkipDrainage: true
    4. Attempts to delete the deployment record itself
    5. Removes the connection finalizer if no other TWDs reference it
    6. Removes its own finalizer, allowing K8s to complete deletion

TemporalConnection finalizer (temporal.io/connection-in-use):

  • Added to the TemporalConnection during normal TWD reconciliation via ensureConnectionFinalizer()
  • Prevents the connection from being deleted while any TWD still references it
  • Removed by removeConnectionFinalizerIfUnused() during TWD deletion, after checking no other TWDs in the same namespace reference the connection
  • Guarantees the connection is always available during TWD cleanup -- no race condition with Helm deleting both resources simultaneously

RBAC updates:

  • Added update;patch verbs for temporalconnections (was get;list;watch)
  • Added update verb for temporalconnections/finalizers

Deletion flow

Helm upgrade (TWD disabled)
  |
  v
Helm deletes TWD CRD + TemporalConnection CRD simultaneously
  |
  +--> TemporalConnection: has finalizer, K8s sets DeletionTimestamp but blocks deletion
  |
  +--> TWD: has finalizer, K8s sets DeletionTimestamp, triggers Reconcile
         |
         v
       handleDeletion() runs:
         1. Fetches TemporalConnection (guaranteed to exist via finalizer)
         2. Connects to Temporal server
         3. Sets current version to unversioned
         4. Deletes versions
         5. Removes connection finalizer (no other TWDs reference it)
         6. Removes TWD finalizer
         |
         v
       TWD deleted by K8s
         |
         v
       TemporalConnection: no more finalizers, deleted by K8s

Issue #55
Closes #166

@anujagrawal380 anujagrawal380 requested review from a team and jlegrone as code owners March 24, 2026 18:18
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 24, 2026

CLA assistant check
All committers have signed the CLA.

@anujagrawal380
Copy link
Copy Markdown
Contributor Author

PTAL @carlydf

Copy link
Copy Markdown
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.

Comment thread internal/controller/worker_controller.go Outdated
Comment thread internal/controller/worker_controller.go Outdated
@anujagrawal380
Copy link
Copy Markdown
Contributor Author

@anujagrawal380 awesome contribution, thank you so much for this PR! I couple really minor comments below, but overall excellent work.

Thanks, resolved both the comments!

Copy link
Copy Markdown
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rock on :) nice work on this @anujagrawal380!

Copy link
Copy Markdown
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need integration tests for this before merging to main / including it in a release

@anujagrawal380
Copy link
Copy Markdown
Contributor Author

we need integration tests for this before merging to main / including it in a release

@carlydf Added the integration tests. PTAL

@carlydf
Copy link
Copy Markdown
Collaborator

carlydf commented Apr 22, 2026

Hi @anujagrawal380 , could you fix the linters! Would love to include this in our next release

@anujagrawal380 anujagrawal380 force-pushed the fix/twd-leaves-stale-versioning-data branch 3 times, most recently from d6a305c to 9fd0c74 Compare April 22, 2026 18:07
@anujagrawal380
Copy link
Copy Markdown
Contributor Author

anujagrawal380 commented Apr 22, 2026

Hi @anujagrawal380 , could you fix the linters! Would love to include this in our next release

@carlydf @jaypipes Added few more minor improvements here: 9fd0c74 . PTAL!

Comment thread internal/controller/worker_controller.go Outdated
Copy link
Copy Markdown
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree with @carlydf's concern about the deleteCleanupTimeout. I think that should be removed.

@carlydf
Copy link
Copy Markdown
Collaborator

carlydf commented Apr 24, 2026

@anujagrawal380 we hope to include this change in our release early next week, without the timeout addition. Hopefully you can respond to the feedback when you get back on Monday! To allow the tests to pass in a reasonable amount of time, you can override the pollerTTL to make pollers "die" faster on the server side.
Example

If we don't get a response by Tuesday, we will pick your commits into a separate PR (so you will still be attributed), remove the timeout, make the tests pass and merge it. Hopefully we can come to an agreement about the timeout behavior though!

…blocking unversioned workers

Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
Signed-off-by: Anuj Agrawal <anujagrawal380@gmail.com>
…ion finalizer

- Add 5-minute deletionCleanupTimeout to prevent TWD stuck in Terminating
  state indefinitely if Temporal server is unavailable
- Return errors from version/deployment deletion to trigger requeue until
  versions actually clear (pollers disappear as pods terminate)
- Add update/patch verbs and finalizers RBAC marker for TemporalConnections
- Fix comment-spacing lint on new kubebuilder:rbac markers
@anujagrawal380 anujagrawal380 force-pushed the fix/twd-leaves-stale-versioning-data branch from 9fd0c74 to 04b61a2 Compare April 26, 2026 14:18
@anujagrawal380
Copy link
Copy Markdown
Contributor Author

@anujagrawal380 we hope to include this change in our release early next week, without the timeout addition. Hopefully you can respond to the feedback when you get back on Monday! To allow the tests to pass in a reasonable amount of time, you can override the pollerTTL to make pollers "die" faster on the server side. Example

If we don't get a response by Tuesday, we will pick your commits into a separate PR (so you will still be attributed), remove the timeout, make the tests pass and merge it. Hopefully we can come to an agreement about the timeout behavior though!

Thanks @carlydf @jaypipes for the thorough review. Made the changes as suggested. Please let me know if any more changes are required!

Comment thread internal/controller/worker_controller.go Outdated
Comment thread internal/controller/worker_controller.go Outdated
Comment thread internal/tests/internal/deletion_integration_test.go
- Drop l.Error before returning err from removeConnectionFinalizerIfUnused;
  controller-runtime logs returned errors automatically
- Before deleting the TWD in the deletion test, update to v2.0 image, start
  v2.0 workers, and set v2.0 as the ramping version at 50% — exercises the
  clear-ramping-version path in handleDeletion and verifies the ramping
  version is nil after cleanup
Comment thread internal/tests/internal/deletion_integration_test.go Outdated
…eletion

Per Temporal server behavior, the ramping version must be cleared before
setting current to unversioned to avoid a window where traffic is split
between unversioned workers and the still-active ramping version.
…Token

SetRampingVersion mutates the deployment, making the ConflictToken from
the initial Describe stale. Re-describe after Step 1 so SetCurrentVersion
in Step 2 uses a valid token instead of failing on the first reconcile.
Comment thread internal/controller/worker_controller.go Outdated
Copy link
Copy Markdown
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent work on this @anujagrawal380 thank you so much! :)

Copy link
Copy Markdown
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

Comment thread helm/temporal-worker-controller/templates/rbac.yaml
The RBAC yaml must be generated from kubebuilder annotations, not edited
manually. Running make manifests consolidates the new temporalconnections
and temporalconnections/finalizers rules with existing entries.
Copy link
Copy Markdown
Member

@Shivs11 Shivs11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty!

When AllAtOnce strategy promotes v2.0 to current before the test sets it
as ramping, revert to v1.0 as current first so SetRampingVersion succeeds.
Also regenerate CRD YAML with new k8s API fields picked up by make manifests.
@anujagrawal380
Copy link
Copy Markdown
Contributor Author

@carlydf Can we please rerun tests again here?

@anujagrawal380
Copy link
Copy Markdown
Contributor Author

@carlydf Sorry, lint failed. Please again!

@carlydf
Copy link
Copy Markdown
Collaborator

carlydf commented Apr 27, 2026

@carlydf Sorry, lint failed. Please again!

@anujagrawal380, this test failed in https://github.com/temporalio/temporal-worker-controller/actions/runs/25011104275/job/73248144790. I'm re-running to see if it was just flaky and maybe will pass the second time. At the same time, ideally we would not have a flaky test since that will cause issues for future PRs. Could you take a look at the failure please?

    --- FAIL: TestIntegration/deletion-sets-current-to-unversioned (5.61s)

…tion test

The AllAtOnce reconciler sets ManagerIdentity on the Temporal deployment,
blocking other identities from calling SetCurrentVersion. Use a dedicated
SDK client with Identity: "temporal-worker-controller" for the revert call,
and wrap the entire setup in an eventually loop so the race is handled
robustly rather than failing on the first conflict.
@anujagrawal380
Copy link
Copy Markdown
Contributor Author

@carlydf Sorry, lint failed. Please again!

@anujagrawal380, this test failed in https://github.com/temporalio/temporal-worker-controller/actions/runs/25011104275/job/73248144790. I'm re-running to see if it was just flaky and maybe will pass the second time. At the same time, ideally we would not have a flaky test since that will cause issues for future PRs. Could you take a look at the failure please?

    --- FAIL: TestIntegration/deletion-sets-current-to-unversioned (5.61s)

@carlydf Added a commit, please rerun once more!

With AllAtOnce the controller continuously re-promotes v2.0 to current,
fighting any attempt by the test to set v2.0 as ramping. Switch to Manual
strategy and use the existing setCurrentVersion / setRampingVersion helpers
(which use defaults.ControllerIdentity) so the test drives versioning state
explicitly and handleDeletion's later cleanup calls match the ManagerIdentity.
@carlydf
Copy link
Copy Markdown
Collaborator

carlydf commented Apr 27, 2026

@carlydf Sorry, lint failed. Please again!

@anujagrawal380, this test failed in https://github.com/temporalio/temporal-worker-controller/actions/runs/25011104275/job/73248144790. I'm re-running to see if it was just flaky and maybe will pass the second time. At the same time, ideally we would not have a flaky test since that will cause issues for future PRs. Could you take a look at the failure please?

    --- FAIL: TestIntegration/deletion-sets-current-to-unversioned (5.61s)

@carlydf Added a commit, please rerun once more!

@anujagrawal380 it failed again. Have not looked at why. Is this passing locally? To run locally you can do:

KUBEBUILDER_ASSETS=/Users/cdf/Desktop/temporal-worker-controller/bin/k8s/1.27.1-darwin-arm64 go test -tags test_dep ./internal/tests/internal
   -run "TestIntegration/deletion-sets-current-to-unversioned" -timeout 600s 

but replace the KUBEBUILDER_ASSETS env var with the correct path for your system. To generate those assets and find out the right path, you can run make test-integration and then Ctrl+C the test after the KUBEBUILDER_ASSETS line is printed.

 % make test-integration
test -s /Users/cdf/Desktop/temporal-worker-controller/bin/controller-gen && /Users/cdf/Desktop/temporal-worker-controller/bin/controller-gen --version | grep -q v0.19.0 || \
        GOBIN=/Users/cdf/Desktop/temporal-worker-controller/bin go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.19.0
GOWORK=off GO111MODULE=on /Users/cdf/Desktop/temporal-worker-controller/bin/controller-gen rbac:roleName=manager-role crd:allowDangerousTypes=true,maxDescLen=0,generateEmbeddedObjectMeta=true paths=./api/... paths=./internal/... paths=./cmd/... \
    output:crd:artifacts:config=helm/temporal-worker-controller-crds/templates
python3 hack/sync-rbac-rules.py
Synced RBAC rules from config/rbac/role.yaml → helm/temporal-worker-controller/templates/rbac.yaml
GOWORK=off GO111MODULE=on /Users/cdf/Desktop/temporal-worker-controller/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths=./api/... paths=./internal/... paths=./cmd/...
test -s /Users/cdf/Desktop/temporal-worker-controller/bin/setup-envtest || GOBIN=/Users/cdf/Desktop/temporal-worker-controller/bin go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest
Running integration tests...
KUBEBUILDER_ASSETS="/Users/cdf/Desktop/temporal-worker-controller/bin/k8s/1.27.1-darwin-arm64" go test -v -tags test_dep ./internal/tests/internal -run TestIntegration
^Cmake: *** [test-integration] Interrupt: 2

@carlydf carlydf merged commit b8d9428 into temporalio:main Apr 27, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cleanup of Temporal deployments when TemporalWorker CRD is deleted

5 participants