Skip to content

Add ComputeDomain v1beta2 as a hub, CRD conversion webhook, and Helm-templated CRD#1005

Draft
shivamerla wants to merge 1 commit intokubernetes-sigs:mainfrom
shivamerla:cd_conversion_webhook
Draft

Add ComputeDomain v1beta2 as a hub, CRD conversion webhook, and Helm-templated CRD#1005
shivamerla wants to merge 1 commit intokubernetes-sigs:mainfrom
shivamerla:cd_conversion_webhook

Conversation

@shivamerla
Copy link
Copy Markdown
Contributor

@shivamerla shivamerla commented Apr 10, 2026

This change introduces:

  • A new resource.nvidia.com/v1beta2 version as the storage/conversion hub for ComputeDomain and keep v1beta1 as a deprecated served version, and wire the CRD to a conversion webhook so the apiserver can round-trip between API specs.
  • Webook configuration for conversion webhooks have to be embedded into the CRD unlike validating/mutating webhooks. So ship the ComputeDomains CRD as a Helm chart template so webhook clientConfig (service name/namespace/port, optional caBundle, cert-manager CA injection) resolves at install time instead of baking static cluster settings into the manifest.
  • Helm validation now rejects resources.computeDomains.enabled=true when webhook.enabled=false, matching the CRD's conversion.webhook requirement.
  • Deprecated numNodes field is stored as an annotation to facilitate the conversion between stored v1beta2 to served v1beta1 version.

Some caveats with the conversion webhooks:

  • The conversion webhook must be enabled when ComputeDomains are enabled: spec.conversion.strategy=Webhook on the CRD requires a reachable webhook, with webhook.enabled=false the cluster cannot honor v1beta1/v1beta2 conversion.
  • Users must install with webhook.enabled=true (and valid TLS) or disable ComputeDomains (resources.computeDomains.enabled=false).
  • For the above reason, for production environments cert-manager or equivalent will be a required pre-requisite.
  • The ComputeComains CRD is intentionally templatized (Helm templates/, not only chart crds/): With that CRDs cannot be directly applied by kubectl, instead install or upgrade via Helm (or reproduce the same templating) so clientConfig and optional cert-manager annotations align with the deployed webhook.
  • Because of the above reason, helm chart deletion will remove the CRD as well and any CR instances associated with it. Ideal thing to do is to create a separate helm chart for CRDs, so added a TODO for that.

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 10, 2026
@shivamerla shivamerla marked this pull request as draft April 10, 2026 00:59
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026
@shivamerla shivamerla force-pushed the cd_conversion_webhook branch from 3ec6cec to 4ca7248 Compare April 10, 2026 01:14
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2026
@shivamerla shivamerla force-pushed the cd_conversion_webhook branch from 4ca7248 to 85b0a0c Compare April 10, 2026 02:04
@shivamerla
Copy link
Copy Markdown
Contributor Author

shivamerla commented Apr 10, 2026

fixes #998 #596 and #364

@shivamerla
Copy link
Copy Markdown
Contributor Author

The linter failures look like false positives. It flags v1beta1 as deprecated even where it’s required in the conversion code :)

@shivamerla
Copy link
Copy Markdown
Contributor Author

Create CD using v1beta1.

---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: imex-channel-injection
spec:
  numNodes: 1
  channel:
    resourceClaimTemplate:
      name: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
  name: imex-channel-injection
spec:
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: imex-channel-0
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-0

kubectl create -f imex-injection.yaml:

Warning: resource.nvidia.com/v1beta1 ComputeDomain is deprecated; use resource.nvidia.com/v1beta2 ComputeDomain
computedomain.resource.nvidia.com/imex-channel-injection created
pod/imex-channel-injection created

Get is served as v1beta2 by default (note the annotation with numNodes):

$ kubectl get computedomain -o yaml
apiVersion: v1
items:
- apiVersion: resource.nvidia.com/v1beta2
  kind: ComputeDomain
  metadata:
    annotations:
      resource.nvidia.com/computedomain-num-nodes: "1"
    creationTimestamp: "2026-04-09T16:23:23Z"
    finalizers:
    - resource.nvidia.com/computeDomain
    generation: 1
    name: imex-channel-injection
    namespace: default
    resourceVersion: "49547125"
    uid: dac300df-2282-44bd-9b1e-fccdf67000e4
  spec:
    channel:
      allocationMode: Single
      resourceClaimTemplate:
        name: imex-channel-0
  status:
    nodes:
    - cliqueID: ""
      index: -1
      ipAddress: 192.168.114.73
      name: nim-operator-1vg0z43
      status: Ready
    status: Ready
kind: List
metadata:
  resourceVersion: ""

Get v1beta1 version explicitly:

nvidia@nim-operator-8vg0z43:~/shiva/webhook/helm$ kubectl get computedomains.v1beta1.resource.nvidia.com -o yaml
Warning: resource.nvidia.com/v1beta1 ComputeDomain is deprecated; use resource.nvidia.com/v1beta2 ComputeDomain
apiVersion: v1
items:
- apiVersion: resource.nvidia.com/v1beta1
  kind: ComputeDomain
  metadata:
    creationTimestamp: "2026-04-09T16:23:23Z"
    finalizers:
    - resource.nvidia.com/computeDomain
    generation: 1
    name: imex-channel-injection
    namespace: default
    resourceVersion: "49547125"
    uid: dac300df-2282-44bd-9b1e-fccdf67000e4
  spec:
    channel:
      allocationMode: Single
      resourceClaimTemplate:
        name: imex-channel-0
    numNodes: 1
  status:
    nodes:
    - cliqueID: ""
      index: -1
      ipAddress: 192.168.114.73
      name: nim-operator-1vg0z43
      status: Ready
    status: Ready
kind: List
metadata:
  resourceVersion: ""

Get v1beta2 version explicitly:

$ kubectl get computedomains.v1beta2.resource.nvidia.com -o yaml
apiVersion: v1
items:
- apiVersion: resource.nvidia.com/v1beta2
  kind: ComputeDomain
  metadata:
    annotations:
      resource.nvidia.com/computedomain-num-nodes: "1"
    creationTimestamp: "2026-04-09T16:23:23Z"
    finalizers:
    - resource.nvidia.com/computeDomain
    generation: 1
    name: imex-channel-injection
    namespace: default
    resourceVersion: "49547125"
    uid: dac300df-2282-44bd-9b1e-fccdf67000e4
  spec:
    channel:
      allocationMode: Single
      resourceClaimTemplate:
        name: imex-channel-0
  status:
    nodes:
    - cliqueID: ""
      index: -1
      ipAddress: 192.168.114.73
      name: nim-operator-1vg0z43
      status: Ready
    status: Ready
kind: List
metadata:
  resourceVersion: ""

@shivamerla shivamerla requested review from dims and klueska April 10, 2026 02:22
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shivamerla

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shivamerla shivamerla force-pushed the cd_conversion_webhook branch from 49c8fd5 to 012b531 Compare April 10, 2026 14:56
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2026
…templated CRD

This change introduces:

- A new 'resource.nvidia.com/v1beta2' version as the storage/conversion hub for ComputeDomain and keep 'v1beta1' as a deprecated served version, and wire the CRD to a conversion webhook so the apiserver can round-trip between API specs.
- Webook configuration for conversion webhooks have to be embedded into the CRD unlike validating/mutating webhooks. So ship the computedomains CRD as a Helm chart template so webhook clientConfig (service name/namespace/port, optional caBundle, cert-manager CA injection) resolves at install time instead of baking static cluster settings into the manifest.
- Helm validation now rejects resources.computeDomains.enabled=true when webhook.enabled=false, matching the CRD's conversion.webhook requirement.
- Deprecated 'numNodes' field is stored as an annotation to facilitate the conversion between stored v1beta2 to served v1beta1 version.

Some caveats with the conversion webhooks:

- The conversion webhook must be enabled when ComputeDomains are enabled: spec.conversion.strategy=Webhook on the CRD requires a reachable webhook, with webhook.enabled=false the cluster cannot honor v1beta1/v1beta2 conversion.
- Users must install with webhook.enabled=true (and valid TLS) or disable ComputeDomains (resources.computeDomains.enabled=false).
- For the above reason, for production environments 'cert-manager' will be a required pre-requisite.
- The ComputeComains CRD is intentionally templatized (Helm templates/, not only chart crds/): With that CRDs cannot be directly applied by kubectl, instead install or upgrade via Helm (or reproduce the same templating) so clientConfig and optional cert-manager annotations align with the deployed webhook.
- Because of the above reason, helm chart deletion will remove the CRD as well and any CR instances associated with it. Ideal thing to do is to create a separate helm chart for CRDs, so added a TODO for that.

Signed-off-by: Shiva Krishna, Merla <smerla@nvidia.com>
@shivamerla shivamerla force-pushed the cd_conversion_webhook branch from bdba9e3 to a0b6153 Compare April 11, 2026 00:12
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants