Validate TWD spec via CRD CEL rules at apply time (fixes #62)#293
Merged
Validate TWD spec via CRD CEL rules at apply time (fixes #62)#293
Conversation
Move TemporalWorkerDeployment spec validation from reconcile time to apply time by embedding CEL validation rules in the CRD schema. Previously validation only fired when the optional TWD webhook was enabled; now the API server enforces it universally on every create/update. Rules added to worker_types.go (regenerated in the CRD manifest): - metadata.name length <= 63 (root object) - Progressive strategy requires non-empty steps (RolloutStrategy) - rampPercentage strictly increasing across steps (RolloutStrategy) - gate: input and inputFrom are mutually exclusive (RolloutStrategy) - gate.inputFrom: exactly one of configMapKeyRef/secretKeyRef (RolloutStrategy) - pauseDuration >= 30s per step (RolloutStep) The ValidateCreate() call and 5-minute requeue in the reconciler are removed — the CRD schema now guarantees the spec is valid before the reconciler sees it. The webhook Go code is kept for defence-in-depth when webhook.enabled=true. Closes #62 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues with the initial CEL rules surfaced in k8s 1.27 envtest: 1. gate.input is x-kubernetes-preserve-unknown-fields: true — CEL cannot reference it structurally, so has(self.gate.input) failed to compile. Removed the gate input/inputFrom mutual-exclusivity rule; the webhook Go code still enforces it when enabled. 2. Any list traversal on self.steps at the RolloutStrategy level (map, isSorted) hits the schema-level CEL cost budget in k8s 1.27 because the cost estimator does not propagate maxItems from the steps array schema up to the parent object's rule. Removed the rampPercentage ordering rule from the CRD; the webhook Go code still enforces it when enabled. Remaining CRD rules (all O(1)): name length, Progressive requires steps, pauseDuration >= 30s per step, gate.inputFrom exclusivity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jaypipes
reviewed
Apr 23, 2026
The CRD CEL rules cannot express rampPercentage ordering or gate input/inputFrom exclusivity (list traversal hits the schema cost budget; gate.input is x-kubernetes-preserve-unknown-fields). These checks live only in the webhook Go code, which is optional. Without a fallback, users who deploy without the webhook and submit an invalid spec (e.g. decreasing rampPercentage) would silently get no rollout progress and no signal — a dangerous failure mode since a decreasing ramp could cause auto-upgrade workflows to roll back. Re-add ValidateCreate() in the reconciler as a fallback: - On failure, emit a Warning event and set ConditionProgressing=False / ConditionReady=False with reason InvalidSpec. - Return early with a 5-minute requeue so the invalid spec is never acted on (no plan generated, no Temporal API calls). - Add ReasonInvalidSpec constant to worker_types.go. - Add TestReconcile_InvalidSpec_EmitsEventAndSetsCondition to cover the new path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returning ctrl.Result{Requeue: true, RequeueAfter: 5m} is unnecessary:
the controller watches TWD objects, so any spec correction the user
applies triggers an immediate reconciliation automatically. Remove the
timer and return ctrl.Result{}, nil instead.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The CRD schema enforces name length, Progressive-requires-steps, pauseDuration >= 30s, and gate.inputFrom exclusivity at admission time. Keeping them in the webhook Go code was dead redundancy. validateRolloutStrategy now only checks the two constraints the CRD cannot enforce (gate.input is x-kubernetes-preserve-unknown-fields, rampPercentage ordering requires list traversal that exceeds the k8s 1.27 CEL cost budget): - rampPercentage strictly increasing across steps - gate.input and gate.inputFrom mutually exclusive Removed the corresponding test cases from the webhook unit tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests hit a real kube-apiserver via envtest (the existing webhook suite infrastructure) so they verify that the x-kubernetes-validations blocks in the generated CRD are syntactically correct and semantically enforced by the API server — independently of the webhook Go code. Covers all four CRD-level rules: - name length > 63 → rejected - Progressive strategy with no steps → rejected - Progressive step with pauseDuration < 30s → rejected - gate.inputFrom with both configMapKeyRef and secretKeyRef → rejected - valid TWD → accepted Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
carlydf
commented
Apr 23, 2026
carlydf
commented
Apr 23, 2026
…traint Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er into cdf/crd-cel-validation-issue-62
shashwatsuri
pushed a commit
to shashwatsuri/temporal-worker-controller
that referenced
this pull request
Apr 28, 2026
… (temporalio#293) ## Summary Fixes temporalio#62 — validates `TemporalWorkerDeployment` spec at apply time instead of reconcile time. ### Approach Embed CEL validation rules directly in the CRD schema (`x-kubernetes-validations`) so the API server enforces them universally at admission time — no webhook infrastructure required. The optional TWD webhook remains to provide apply-time feedback to those who enable it. ### CRD CEL rules added (`worker_types.go` → regenerated CRD manifest) | Rule | Placed on | Enforced by | |---|---|---| | `metadata.name` ≤ 63 chars | `TemporalWorkerDeployment` root | CRD | | Progressive strategy requires steps | `RolloutStrategy` | CRD | | `rampPercentage` steps capped at 20 | `steps` field (`maxItems`) | CRD | | `pauseDuration` ≥ 30s per step | `RolloutStep` | CRD | | `gate.inputFrom`: exactly one of `configMapKeyRef`/`secretKeyRef` | `RolloutStrategy` | CRD | Two constraints from the original webhook could **not** be expressed as CRD CEL rules in k8s 1.27 (cost budget limit for list traversal; `gate.input` is `x-kubernetes-preserve-unknown-fields` and invisible to CEL): - `rampPercentage` strictly increasing across steps - `gate.input` and `gate.inputFrom` mutually exclusive ### Fallback for webhook-only constraints The reconciler calls `ValidateCreate()` as a fallback for the two constraints the CRD cannot enforce. On failure it emits a `Warning` event and sets `ConditionProgressing=False` / `ConditionReady=False` with reason `InvalidSpec`, then returns without requeueing (the watch re-triggers when the user corrects the spec). Adds `ReasonInvalidSpec` constant. ### Webhook cleanup `validateRolloutStrategy` in the webhook now only checks the two constraints not in the CRD. All checks that duplicated CRD rules were removed. ### Testing - **Unit tests**: updated to match narrowed webhook scope (removed cases now enforced by CRD) - **Reconciler unit test**: `TestReconcile_InvalidSpec_EmitsEventAndSetsCondition` verifies the event/condition path - **envtest integration tests** (`temporalworkerdeployment_cel_validation_test.go`): hit a real kube-apiserver to verify each CRD CEL rule is syntactically valid and semantically correct, independently of the webhook ## Test plan - [ ] `KUBEBUILDER_ASSETS=... go test ./...` passes locally - [ ] Apply a TWD with name > 63 chars — API server rejects immediately - [ ] Apply a TWD with `strategy: Progressive` and no steps — API server rejects immediately - [ ] Apply a TWD with `strategy: Progressive` and > 20 steps — API server rejects immediately - [ ] Apply a TWD with a step `pauseDuration < 30s` — API server rejects immediately - [ ] Apply a TWD with decreasing `rampPercentage` (webhook disabled) — reconciler sets `InvalidSpec` condition and emits Warning event, does not act on the spec 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #62 — validates
TemporalWorkerDeploymentspec at apply time instead of reconcile time.Approach
Embed CEL validation rules directly in the CRD schema (
x-kubernetes-validations) so the API server enforces them universally at admission time — no webhook infrastructure required. The optional TWD webhook remains to provide apply-time feedback to those who enable it.CRD CEL rules added (
worker_types.go→ regenerated CRD manifest)metadata.name≤ 63 charsTemporalWorkerDeploymentrootRolloutStrategyrampPercentagesteps capped at 20stepsfield (maxItems)pauseDuration≥ 30s per stepRolloutStepgate.inputFrom: exactly one ofconfigMapKeyRef/secretKeyRefRolloutStrategyTwo constraints from the original webhook could not be expressed as CRD CEL rules in k8s 1.27 (cost budget limit for list traversal;
gate.inputisx-kubernetes-preserve-unknown-fieldsand invisible to CEL):rampPercentagestrictly increasing across stepsgate.inputandgate.inputFrommutually exclusiveFallback for webhook-only constraints
The reconciler calls
ValidateCreate()as a fallback for the two constraints the CRD cannot enforce. On failure it emits aWarningevent and setsConditionProgressing=False/ConditionReady=Falsewith reasonInvalidSpec, then returns without requeueing (the watch re-triggers when the user corrects the spec). AddsReasonInvalidSpecconstant.Webhook cleanup
validateRolloutStrategyin the webhook now only checks the two constraints not in the CRD. All checks that duplicated CRD rules were removed.Testing
TestReconcile_InvalidSpec_EmitsEventAndSetsConditionverifies the event/condition pathtemporalworkerdeployment_cel_validation_test.go): hit a real kube-apiserver to verify each CRD CEL rule is syntactically valid and semantically correct, independently of the webhookTest plan
KUBEBUILDER_ASSETS=... go test ./...passes locallystrategy: Progressiveand no steps — API server rejects immediatelystrategy: Progressiveand > 20 steps — API server rejects immediatelypauseDuration < 30s— API server rejects immediatelyrampPercentage(webhook disabled) — reconciler setsInvalidSpeccondition and emits Warning event, does not act on the spec🤖 Generated with Claude Code