|
| 1 | +# CD Rollouts with the Temporal Worker Controller |
| 2 | + |
| 3 | +This guide describes patterns for integrating the Temporal Worker Controller into a CD pipeline, intended as guidance once you are already using Worker Versioning in steady state. |
| 4 | + |
| 5 | +> **Note:** The examples below illustrate common integration patterns but are not guaranteed to work verbatim with every version of each tool. API fields, configuration keys, and default behaviors change between releases. Always verify against the documentation for the specific tool you are using. |
| 6 | +
|
| 7 | +For migration help, see [migration-to-versioned.md](migration-to-versioned.md). |
| 8 | + |
| 9 | +## Understanding the conditions |
| 10 | + |
| 11 | +The `TemporalWorkerDeployment` resource exposes two standard conditions on `status.conditions` that CD tools and scripts can consume. |
| 12 | + |
| 13 | +### `Ready` |
| 14 | + |
| 15 | +`Ready=True` means the controller successfully reached Temporal **and** the target version is the current version in Temporal. This is the primary signal that a rollout has finished and the worker is fully operational. |
| 16 | + |
| 17 | +`Ready=True` with reason `RolloutComplete` when the rollout has finished. |
| 18 | + |
| 19 | +`Ready=False` while either condition is not met. The `reason` field tells you why: |
| 20 | + |
| 21 | +| Reason | Meaning | |
| 22 | +|---|---| |
| 23 | +| `WaitingForPollers` | Target version's Deployment exists but workers haven't registered with Temporal yet | |
| 24 | +| `WaitingForPromotion` | Workers are registered (Inactive) but not yet promoted to Current | |
| 25 | +| `Ramping` | Progressive strategy is ramping traffic to the new version | |
| 26 | +| Error reasons (see Progressing below) | A blocking error is preventing progress | |
| 27 | + |
| 28 | +### `Progressing` |
| 29 | + |
| 30 | +`Progressing=True` means a rollout is actively in-flight and the controller is making forward progress. `Progressing=False` means either the rollout is done (`Ready=True`) or a blocking error is preventing progress. |
| 31 | + |
| 32 | +When `Progressing=False` due to an error, the `reason` field identifies what went wrong: |
| 33 | + |
| 34 | +| Reason | Meaning | |
| 35 | +|---|---| |
| 36 | +| `RolloutComplete` | Not an error — the rollout finished successfully | |
| 37 | +| `TemporalConnectionNotFound` | The referenced `TemporalConnection` resource doesn't exist | |
| 38 | +| `AuthSecretInvalid` | The credential secret is missing, malformed, or has an expired certificate | |
| 39 | +| `TemporalClientCreationFailed` | The controller can't reach the Temporal server (dial/health-check failure) | |
| 40 | +| `TemporalStateFetchFailed` | The controller reached the server but can't read the worker deployment state | |
| 41 | +| `PlanGenerationFailed` | Internal error generating the reconciliation plan | |
| 42 | +| `PlanExecutionFailed` | Internal error executing the plan (e.g., a Kubernetes API call failed) | |
| 43 | + |
| 44 | +Once the underlying problem is fixed, the next successful reconcile will restore `Progressing` and `Ready` to the correct state. |
| 45 | + |
| 46 | +## Triggering a rollout |
| 47 | + |
| 48 | +A rollout starts when you change the pod template in your `TemporalWorkerDeployment` spec — a changed pod spec produces a new Build ID, which the controller treats as a new version to roll out. |
| 49 | + |
| 50 | +With Helm (image tag update): |
| 51 | + |
| 52 | +```yaml |
| 53 | +# values.yaml |
| 54 | +image: |
| 55 | + repository: my-registry/my-worker |
| 56 | + tag: "v2.3.0" |
| 57 | +``` |
| 58 | +
|
| 59 | +```bash |
| 60 | +helm upgrade my-worker ./chart --values values.yaml |
| 61 | +``` |
| 62 | + |
| 63 | +With a plain manifest: |
| 64 | + |
| 65 | +```yaml |
| 66 | +# twd.yaml |
| 67 | +spec: |
| 68 | + template: |
| 69 | + spec: |
| 70 | + containers: |
| 71 | + - name: worker |
| 72 | + image: my-registry/my-worker:v2.3.0 |
| 73 | +``` |
| 74 | +
|
| 75 | +```bash |
| 76 | +kubectl apply -f twd.yaml |
| 77 | +``` |
| 78 | + |
| 79 | +The controller picks up the change on the next reconcile loop (within seconds) and begins the rollout. |
| 80 | + |
| 81 | +## kubectl |
| 82 | + |
| 83 | +`kubectl wait` can block a pipeline script until `Ready=True`: |
| 84 | + |
| 85 | +```bash |
| 86 | +kubectl apply -f twd.yaml |
| 87 | +kubectl wait temporalworkerdeployment/my-worker \ |
| 88 | + --for=condition=Ready \ |
| 89 | + --timeout=10m |
| 90 | +``` |
| 91 | + |
| 92 | +Set `--timeout` to exceed the longest expected rollout time — for progressive strategies this is the sum of all `pauseDuration` values plus the time for workers to start and register. `kubectl wait` exits non-zero on timeout, which you can use to fail the pipeline. |
| 93 | + |
| 94 | +## Helm |
| 95 | + |
| 96 | +### Helm 4 |
| 97 | + |
| 98 | +Helm 4 uses [kstatus](https://github.com/kubernetes-sigs/cli-utils/tree/master/pkg/kstatus) for its `--wait` implementation ([HIP-0022](https://helm.sh/community/hips/hip-0022/)). kstatus understands the standard Kubernetes conditions contract and should block until `Ready=True` on your `TemporalWorkerDeployment`: |
| 99 | + |
| 100 | +```bash |
| 101 | +helm upgrade my-worker ./chart --values values.yaml --wait --timeout 10m |
| 102 | +``` |
| 103 | + |
| 104 | +> **Verify:** Check your Helm 4 release notes — kstatus behavior and the `--wait` flag semantics have evolved across point releases. |
| 105 | +
|
| 106 | +### Helm 3 |
| 107 | + |
| 108 | +Helm 3's `--wait` only covers a hardcoded set of native resource types (Deployments, StatefulSets, DaemonSets, Jobs, Pods) and does not inspect conditions on custom resources. A separate `kubectl wait` step is one approach: |
| 109 | + |
| 110 | +```bash |
| 111 | +helm upgrade my-worker ./chart --values values.yaml |
| 112 | +kubectl wait temporalworkerdeployment/my-worker \ |
| 113 | + --for=condition=Ready \ |
| 114 | + --timeout=10m \ |
| 115 | + --namespace my-namespace |
| 116 | +``` |
| 117 | + |
| 118 | +## ArgoCD |
| 119 | + |
| 120 | +ArgoCD does not have a generic fallback that automatically checks `status.conditions` on unknown CRD types. For any resource whose group (`temporal.io`) is not in ArgoCD's built-in health check registry, ArgoCD silently skips that resource when computing application health. A [custom Lua health check](https://argo-cd.readthedocs.io/en/stable/operator-manual/health/) is the standard mechanism for teaching ArgoCD how to assess a CRD's health. |
| 121 | + |
| 122 | +The two standard conditions (`Ready`, `Progressing`) keep the Lua simple — it only needs to read the condition type and status, not any controller-specific status fields. The following script is a starting point; adapt it to your ArgoCD version and any site-specific requirements: |
| 123 | + |
| 124 | +```yaml |
| 125 | +# In your argocd-cm ConfigMap |
| 126 | +data: |
| 127 | + resource.customizations.health.temporal.io_TemporalWorkerDeployment: | |
| 128 | + local ready = nil |
| 129 | + local progressing = nil |
| 130 | + if obj.status ~= nil and obj.status.conditions ~= nil then |
| 131 | + for _, c in ipairs(obj.status.conditions) do |
| 132 | + if c.type == "Ready" then ready = c end |
| 133 | + if c.type == "Progressing" then progressing = c end |
| 134 | + end |
| 135 | + end |
| 136 | + if ready ~= nil and ready.status == "True" then |
| 137 | + return {status = "Healthy", message = ready.message} |
| 138 | + end |
| 139 | + if progressing ~= nil then |
| 140 | + if progressing.status == "True" then |
| 141 | + return {status = "Progressing", message = progressing.message} |
| 142 | + else |
| 143 | + return {status = "Degraded", message = progressing.message} |
| 144 | + end |
| 145 | + end |
| 146 | + return {status = "Progressing", message = "Waiting for conditions"} |
| 147 | +``` |
| 148 | +
|
| 149 | +With a health check like this in place: |
| 150 | +
|
| 151 | +- ArgoCD shows **Healthy** once `Ready=True`. |
| 152 | +- ArgoCD shows **Progressing** while a rollout is in-flight (`Progressing=True`). |
| 153 | +- ArgoCD shows **Degraded** when progress is blocked (`Progressing=False` with an error reason). |
| 154 | + |
| 155 | +If you use [sync waves](https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/) and workers must be fully rolled out before a dependent service is updated, place the `TemporalWorkerDeployment` in an earlier wave. |
| 156 | + |
| 157 | +> **Verify:** ArgoCD's health customization API and Lua runtime have changed across versions. Test your health check script in a non-production environment before relying on it to gate sync waves. |
| 158 | + |
| 159 | +## Flux |
| 160 | + |
| 161 | +### Kustomization |
| 162 | + |
| 163 | +Flux's `Kustomization` controller uses kstatus to assess resource health. Because `TemporalWorkerDeployment` emits a standard `Ready` condition, Flux should treat it as healthy when `Ready=True`. Adding an explicit `healthChecks` entry makes the dependency visible and ensures Flux waits on the `TemporalWorkerDeployment` before marking the Kustomization as ready: |
| 164 | + |
| 165 | +```yaml |
| 166 | +apiVersion: kustomize.toolkit.fluxcd.io/v1 |
| 167 | +kind: Kustomization |
| 168 | +metadata: |
| 169 | + name: my-workers |
| 170 | + namespace: flux-system |
| 171 | +spec: |
| 172 | + interval: 5m |
| 173 | + path: ./workers |
| 174 | + prune: true |
| 175 | + sourceRef: |
| 176 | + kind: GitRepository |
| 177 | + name: my-repo |
| 178 | + healthChecks: |
| 179 | + - apiVersion: temporal.io/v1alpha1 |
| 180 | + kind: TemporalWorkerDeployment |
| 181 | + name: my-worker |
| 182 | + namespace: my-namespace |
| 183 | + timeout: 10m |
| 184 | +``` |
| 185 | + |
| 186 | +Set `timeout` to exceed the longest expected rollout duration. |
| 187 | + |
| 188 | +### HelmRelease |
| 189 | + |
| 190 | +Flux's `helm-controller` uses kstatus by default for post-install/post-upgrade health assessment, so a `HelmRelease` deploying your worker chart should automatically wait for `Ready=True` on any `TemporalWorkerDeployment` resources in the release: |
| 191 | + |
| 192 | +```yaml |
| 193 | +apiVersion: helm.toolkit.fluxcd.io/v2 |
| 194 | +kind: HelmRelease |
| 195 | +metadata: |
| 196 | + name: my-worker |
| 197 | + namespace: flux-system |
| 198 | +spec: |
| 199 | + interval: 5m |
| 200 | + timeout: 10m # should exceed the longest expected rollout |
| 201 | + chart: |
| 202 | + spec: |
| 203 | + chart: ./chart |
| 204 | + sourceRef: |
| 205 | + kind: GitRepository |
| 206 | + name: my-repo |
| 207 | +``` |
| 208 | + |
| 209 | +> **Verify:** kstatus integration details and the `healthChecks` API have evolved across Flux releases. Check the Flux documentation for your version. |
0 commit comments