Skip to content

Add event recording and status conditions for worker deployments#203

Merged
carlydf merged 20 commits intomainfrom
add_events
Mar 7, 2026
Merged

Add event recording and status conditions for worker deployments#203
carlydf merged 20 commits intomainfrom
add_events

Conversation

@thearcticwatch
Copy link
Copy Markdown
Contributor

@thearcticwatch thearcticwatch commented Feb 21, 2026

What changed: Added Kubernetes events and status conditions

(TemporalConnectionHealthy, RolloutReady) to the worker controller
reconciliation loop.

##Why: Reconciliation failures were only visible in controller logs —
events and conditions let users diagnose issues directly via kubectl.

  1. Closes Add events to the TemporalWorkerDeployment CRD when there is a problem #28

  2. How was this tested:
    added unit tests and functional tests

  3. Any docs updates needed?
    N/A

  4. Is this risky? Explain

Making a change to the CRD (adding conditions) opens up the risk that users could upgrade the controller but fail to upgrade their CRD. In this case, it is ok if new features are silently ignored, but we don't want the controller to panic or fail to successfully do the actions that were available in the previous CRD version. I believe that this change is safe even if someone forgets to upgrade their CRD, because when this new controller runs against a v1.2.0 CRD:

  • No panic. The controller calls r.Status().Update(ctx, twd) with conditions populated in memory. The API server validates against the CRD schema and prunes unknown fields (standard behavior for structural schemas without x-kubernetes-preserve-unknown-fields). The status write succeeds with a 200 and the conditions are silently dropped before storage.
  • Kubernetes Events work fine. Events are written as separate events.k8s.io/v1 resources, completely independent of the TWD CRD schema. All r.Recorder.Eventf(...) calls will succeed normally.
  • Conditions simply don't persist. kubectl get twd foo -o yaml will show no conditions field. The controller sets them in memory on every reconcile, tries to write, and the API server drops them. Functionally the controller does the right thing, it just can't communicate the health status via conditions until the CRD is upgraded.

@thearcticwatch thearcticwatch requested review from a team and jlegrone as code owners February 21, 2026 00:39
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Feb 21, 2026

CLA assistant check
All committers have signed the CLA.

Comment thread internal/controller/worker_controller.go Outdated
Copy link
Copy Markdown
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also make fmt-imports will solve some of your lint errors

@thearcticwatch thearcticwatch enabled auto-merge (squash) February 21, 2026 14:14
Copy link
Copy Markdown
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! just did initial review, we should still add a functional test once these comments are addressed.

I found https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#events and https://book.kubebuilder.io/reference/raising-events#creating-events helpful while reviewing.

Comment thread internal/controller/execplan.go Outdated
Comment thread internal/controller/execplan.go
Comment thread internal/controller/worker_controller.go Outdated
Comment thread internal/controller/worker_controller.go Outdated
Comment thread internal/controller/worker_controller.go Outdated
Comment thread internal/controller/execplan.go
Comment thread internal/controller/execplan.go Outdated
Comment thread internal/controller/worker_controller.go Outdated
Copy link
Copy Markdown
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really close from my perspective. will push a commit showing what I mean about the stricter string types for EventType and ConditionType.

Comment thread internal/controller/execplan.go Outdated
Comment thread internal/controller/worker_controller.go Outdated
carlydf and others added 3 commits March 3, 2026 15:33
"Registration" already has a meaning in Temporal versioning (a worker
polling for the first time creates a version record). "Promotion" better
describes setting a version as current or ramping, which moves it forward
in the rollout lifecycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread internal/controller/worker_controller.go Outdated
Comment thread internal/controller/worker_controller.go Outdated
Copy link
Copy Markdown
Collaborator

@carlydf carlydf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but @Shivs11 if you could take a peek at my refactor of ClientPool which I did so we could emit a separate event type for invalid secret vs failed dial to Temporal server, that would be great!

@carlydf carlydf disabled auto-merge March 6, 2026 02:43
@carlydf carlydf enabled auto-merge (squash) March 6, 2026 02:43
@carlydf carlydf closed this Mar 6, 2026
auto-merge was automatically disabled March 6, 2026 02:44

Pull request was closed

@carlydf carlydf reopened this Mar 6, 2026
@carlydf carlydf merged commit 872bc38 into main Mar 7, 2026
14 checks passed
@carlydf carlydf deleted the add_events branch March 7, 2026 00:08
Shivs11 added a commit that referenced this pull request Mar 20, 2026
)

<!--- Note to EXTERNAL Contributors -->
<!-- Thanks for opening a PR! 
If it is a significant code change, please **make sure there is an open
issue** for this.
We work best with you when we have accepted the idea first before you
code. -->

<!--- For ALL Contributors 👇 -->

## What was changed
- WISOTT
- Note: This bug fix was an unfortunate regression that was introduced
with
[this](#203).
This intends on fixing that.

## Why?
- Bug fix!

## Checklist
<!--- add/delete as needed --->

1. Closes <!-- add issue number here -->

2. How was this tested:
<!--- Please describe how you tested your changes/how we can test them
-->

3. Any docs updates needed?
<!--- update README if applicable
      or point out where to update docs.temporal.io -->
Shivs11 added a commit that referenced this pull request Mar 20, 2026
)

<!--- Note to EXTERNAL Contributors -->
<!-- Thanks for opening a PR! 
If it is a significant code change, please **make sure there is an open
issue** for this.
We work best with you when we have accepted the idea first before you
code. -->

<!--- For ALL Contributors 👇 -->

## What was changed
- WISOTT
- Note: This bug fix was an unfortunate regression that was introduced
with
[this](#203).
This intends on fixing that.

## Why?
- Bug fix!

## Checklist
<!--- add/delete as needed --->

1. Closes <!-- add issue number here -->

2. How was this tested:
<!--- Please describe how you tested your changes/how we can test them
-->

3. Any docs updates needed?
<!--- update README if applicable
      or point out where to update docs.temporal.io -->
carlydf added a commit that referenced this pull request Mar 23, 2026
## Summary

- Adds `clientpool_test.go` with 8 unit tests covering the auth code
paths that had no test coverage
- Two tests are explicit regression guards for the bugs fixed in #227
and #232
- Makes `dialFn` and `systemCertPoolFn` injectable on `ClientPool` (no
behavior change in production) to enable testing without network I/O or
OS trust store dependencies

## Regression tests

**`TestFetchMTLS_CACertAppendsToSystemPool`** — guards against the PR
#212 bug (fixed in #227): `fetchClientUsingMTLSSecret` used
`x509.NewCertPool()` (empty) instead of `x509.SystemCertPool()`,
silently dropping system root CAs and breaking Temporal Cloud
connections. The test injects a fake system pool and verifies both the
injected system CAs and the custom `ca.crt` are present in the returned
pool. This test fails if the fix is reverted.

**`TestDialAndUpsert_APIKeySkipsCheckHealth`** — guards against the PR
#203 bug (fixed in #232): `DialAndUpsertClient` called `CheckHealth`
unconditionally, which fails on Temporal Cloud with namespace-scoped API
keys. The test uses an injected mock client and asserts `CheckHealth` is
never called for `AuthModeAPIKey`. This test fails if the fix is
reverted.

## Test plan

- [x] `go test ./internal/controller/clientpool/... -v` — all 8 tests
pass
- [x] `go build ./...` — no compilation errors
- [x] Manually revert the PR #227 fix →
`TestFetchMTLS_CACertAppendsToSystemPool` fails
- [x] Manually revert the PR #232 fix →
`TestDialAndUpsert_APIKeySkipsCheckHealth` fails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
shashwatsuri pushed a commit to shashwatsuri/temporal-worker-controller that referenced this pull request Apr 28, 2026
…poralio#203)

## What changed: Added Kubernetes events and status conditions          
  (TemporalConnectionHealthy, RolloutReady) to the worker controller
  reconciliation loop.           
                                                                       
##Why: Reconciliation failures were only visible in controller logs —  
  events and conditions let users diagnose issues directly via kubectl.

1. Closes temporalio#28 

2. How was this tested:
added unit tests and functional tests

3. Any docs updates needed?
N/A

4. Is this risky? Explain

Making a change to the CRD (adding conditions) opens up the risk that
users could upgrade the controller but fail to upgrade their CRD. In
this case, it is ok if new features are silently ignored, but we don't
want the controller to panic or fail to successfully do the actions that
were available in the previous CRD version. I believe that this change
is safe even if someone forgets to upgrade their CRD, because when this
new controller runs against a v1.2.0 CRD:
- No panic. The controller calls r.Status().Update(ctx, twd) with
conditions populated in memory. The API server validates against the CRD
schema and prunes unknown fields (standard behavior for structural
schemas without x-kubernetes-preserve-unknown-fields). The status write
succeeds with a 200 and the conditions are silently dropped before
storage.
- Kubernetes Events work fine. Events are written as separate
events.k8s.io/v1 resources, completely independent of the TWD CRD
schema. All r.Recorder.Eventf(...) calls will succeed normally.
- Conditions simply don't persist. kubectl get twd foo -o yaml will show
no conditions field. The controller sets them in memory on every
reconcile, tries to write, and the API server drops them. Functionally
the controller does the right thing, it just can't communicate the
health status via conditions until the CRD is upgraded.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Carly de Frondeville <cdefrondeville@berkeley.edu>
shashwatsuri pushed a commit to shashwatsuri/temporal-worker-controller that referenced this pull request Apr 28, 2026
…emporalio#232)

<!--- Note to EXTERNAL Contributors -->
<!-- Thanks for opening a PR! 
If it is a significant code change, please **make sure there is an open
issue** for this.
We work best with you when we have accepted the idea first before you
code. -->

<!--- For ALL Contributors 👇 -->

## What was changed
- WISOTT
- Note: This bug fix was an unfortunate regression that was introduced
with
[this](temporalio#203).
This intends on fixing that.

## Why?
- Bug fix!

## Checklist
<!--- add/delete as needed --->

1. Closes <!-- add issue number here -->

2. How was this tested:
<!--- Please describe how you tested your changes/how we can test them
-->

3. Any docs updates needed?
<!--- update README if applicable
      or point out where to update docs.temporal.io -->
shashwatsuri pushed a commit to shashwatsuri/temporal-worker-controller that referenced this pull request Apr 28, 2026
## Summary

- Adds `clientpool_test.go` with 8 unit tests covering the auth code
paths that had no test coverage
- Two tests are explicit regression guards for the bugs fixed in temporalio#227
and temporalio#232
- Makes `dialFn` and `systemCertPoolFn` injectable on `ClientPool` (no
behavior change in production) to enable testing without network I/O or
OS trust store dependencies

## Regression tests

**`TestFetchMTLS_CACertAppendsToSystemPool`** — guards against the PR
temporalio#212 bug (fixed in temporalio#227): `fetchClientUsingMTLSSecret` used
`x509.NewCertPool()` (empty) instead of `x509.SystemCertPool()`,
silently dropping system root CAs and breaking Temporal Cloud
connections. The test injects a fake system pool and verifies both the
injected system CAs and the custom `ca.crt` are present in the returned
pool. This test fails if the fix is reverted.

**`TestDialAndUpsert_APIKeySkipsCheckHealth`** — guards against the PR
temporalio#203 bug (fixed in temporalio#232): `DialAndUpsertClient` called `CheckHealth`
unconditionally, which fails on Temporal Cloud with namespace-scoped API
keys. The test uses an injected mock client and asserts `CheckHealth` is
never called for `AuthModeAPIKey`. This test fails if the fix is
reverted.

## Test plan

- [x] `go test ./internal/controller/clientpool/... -v` — all 8 tests
pass
- [x] `go build ./...` — no compilation errors
- [x] Manually revert the PR temporalio#227 fix →
`TestFetchMTLS_CACertAppendsToSystemPool` fails
- [x] Manually revert the PR temporalio#232 fix →
`TestDialAndUpsert_APIKeySkipsCheckHealth` fails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add events to the TemporalWorkerDeployment CRD when there is a problem

4 participants