Skip to content

NVIDIA-472: Add CSR approval for DPU worker nodes#42

Merged
tsorya merged 1 commit intorh-ecosystem-edge:mainfrom
linoyaslan:feature-csr-approval
Feb 24, 2026
Merged

NVIDIA-472: Add CSR approval for DPU worker nodes#42
tsorya merged 1 commit intorh-ecosystem-edge:mainfrom
linoyaslan:feature-csr-approval

Conversation

@linoyaslan
Copy link
Copy Markdown
Collaborator

@linoyaslan linoyaslan commented Feb 18, 2026

Implements CSR approval for DPU worker nodes joining hosted clusters.

Implementation

  • Dedicated controller: CSRApprovalReconciler polls hosted clusters every 30s
  • CSR validation: Hostname matching, DPU ownership verification, certificate content checks
  • Client caching: ClientManager caches kubeconfig clients per hosted cluster
  • Independent operation: Runs alongside main controller

Architecture

Two controllers in same manager pod:

  • dpfhcpprovisioner - Main provisioning logic
  • csrapproval - CSR auto-approval polling

Example of log output

2026-02-18T09:52:50Z    INFO    CSR Approval Controller reconciling     {"controller": "csrapproval", "controllerGroup": "provisioning.dpu.hcp.io", "controllerKind": "DPFHCPProvisioner", "DPFHCPProvisioner": {"name":"doca","namespace":"clusters"}, "namespace": "clusters", "name": "doca", "reconcileID": "72c69952-cfb8-4a93-8fa0-6427860d80ea", "namespace": "clusters", "name": "doca"}
2026-02-18T09:52:50Z    INFO    Found pending CSRs      {"controller": "csrapproval", "controllerGroup": "provisioning.dpu.hcp.io", "controllerKind": "DPFHCPProvisioner", "DPFHCPProvisioner": {"name":"doca","namespace":"clusters"}, "namespace": "clusters", "name": "doca", "reconcileID": "72c69952-cfb8-4a93-8fa0-6427860d80ea", "count": 1}
2026-02-18T09:52:50Z    INFO    Filtered CSRs by signer {"controller": "csrapproval", "controllerGroup": "provisioning.dpu.hcp.io", "controllerKind": "DPFHCPProvisioner", "DPFHCPProvisioner": {"name":"doca","namespace":"clusters"}, "namespace": "clusters", "name": "doca", "reconcileID": "72c69952-cfb8-4a93-8fa0-6427860d80ea", "bootstrapCount": 1, "servingCount": 0, "otherCount": 0}
2026-02-18T09:52:50Z    INFO    Processing CSR  {"controller": "csrapproval", "controllerGroup": "provisioning.dpu.hcp.io", "controllerKind": "DPFHCPProvisioner", "DPFHCPProvisioner": {"name":"doca","namespace":"clusters"}, "namespace": "clusters", "name": "doca", "reconcileID": "72c69952-cfb8-4a93-8fa0-6427860d80ea", "csrName": "csr-lxhcj", "hostname": "nvd-srv-22...", "type": "bootstrap"}
2026-02-18T09:52:50Z    INFO    CSR approved    {"controller": "csrapproval", "controllerGroup": "provisioning.dpu.hcp.io", "controllerKind": "DPFHCPProvisioner", "DPFHCPProvisioner": {"name":"doca","namespace":"clusters"}, "namespace": "clusters", "name": "doca", "reconcileID": "72c69952-cfb8-4a93-8fa0-6427860d80ea", "csrName": "csr-lxhcj", "hostname": "nvd-srv-22...", "type": "bootstrap"}

Summary by CodeRabbit

  • New Features
    • CSR auto-approval for hosted clusters with continuous monitoring and a status condition reflecting activity.
  • Configuration
    • RBAC/ClusterRole expanded to grant read access to DPU resources for CSR validation.
  • Tests
    • Extensive tests added for CSR parsing, content validation, ownership checks, hostname extraction, and controller behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a CSR auto-approval subsystem: new CSRApprovalReconciler and CSRApprover with hosted-cluster client caching, CSR content and owner validation, automated approval flow, tests, RBAC/helm rule additions, and per-controller event recorders and constants.

Changes

Cohort / File(s) Summary
API Type Constants
api/v1alpha1/dpfhcpprovisioner_types.go
Added CSRAutoApprovalActive condition type and three reason constants (ReasonCSRApprovalActive, ReasonKubeconfigNotAvailable, ReasonHostedClusterNotReachable).
Controller Setup & Initialization
cmd/main.go
Registered CSR approval controller and recorder; instantiated NewCSRApprover; switched to per-component event recorders and wired CSRApprovalReconciler into manager.
RBAC & Helm
config/rbac/role.yaml, helm/dpf-hcp-provisioner-operator/templates/clusterrole.yaml
Added ClusterRole rules granting get,list,watch on dpus in provisioning.dpu.nvidia.com.
Common Constants
internal/common/constants.go
Removed ControllerName; added ProvisionerControllerName and CSRApprovalControllerName.
CSR Client Management
internal/controller/csrapproval/client.go
Added ClientManager to cache per-hosted-cluster clientsets, build REST configs from kubeconfig secrets (replace server with internal endpoint), test connectivity, and support invalidation.
CSR Hostname & Filtering
internal/controller/csrapproval/hostname.go, internal/controller/csrapproval/hostname_test.go
Added extraction of hostnames from bootstrap/serving CSRs, pending-state check, signer filtering, and unit tests.
CSR Content Validation
internal/controller/csrapproval/csr_content_validation.go, internal/controller/csrapproval/csr_content_validation_test.go
Introduced ValidateBootstrapCSR and ValidateServingCSR with detailed identity/usages/org/CN/DNS/IP checks and comprehensive tests.
CSR Ownership Validation
internal/controller/csrapproval/csr_owner_validation.go, internal/controller/csrapproval/csr_owner_validation_test.go
Added Validator and ValidationResult to verify DPU existence in management cluster and Node existence in hosted cluster; added unit tests for DPU lookup behavior.
CSR Approver Orchestration
internal/controller/csrapproval/csrapproval.go
Added CSRApprover with ProcessCSRs and StopCSRWatch to manage kubeconfig checks, hosted-client lifecycle, list/process CSRs, approve via UpdateApproval, set CSRAutoApprovalActive condition, and emit events; includes polling/requeue behavior.
CSR Approval Controller
internal/controller/csrapproval/controller.go, internal/controller/csrapproval/suite_test.go
Added CSRApprovalReconciler that watches DPFHCPProvisioner, delegates to CSRApprover, stops watch on deletion, and includes RBAC scaffolding and test suite bootstrap.
Test Suite / Wiring
internal/controller/suite_test.go
Updated wiring to use ProvisionerControllerName for recorders and cleanup handlers.

Sequence Diagram

sequenceDiagram
    participant Op as Operator
    participant Mgmt as Management\r\nCluster
    participant Hosted as Hosted\r\nCluster
    participant DPU as DPU/Node\r\nStore

    Op->>Mgmt: Reconcile DPFHCPProvisioner
    Mgmt-->>Op: DPFHCPProvisioner + KubeConfigSecretRef

    Op->>Mgmt: Fetch kubeconfig secret
    Mgmt-->>Op: kubeconfig data

    Op->>Hosted: Build/reuse client (ClientManager cache)
    Op->>Hosted: Test connectivity (ServerVersion)
    Hosted-->>Op: reachable / error

    Op->>Hosted: List pending CSRs
    Hosted-->>Op: CSR list

    loop for each pending CSR
        Op->>Op: Extract hostname from CSR
        Op->>Op: Validate CSR content (usages, org, CN/DNS/IP)
        Op->>Mgmt: Check DPU exists
        Mgmt-->>Op: DPU found / not found
        alt DPU found
            Op->>Hosted: Check Node exists
            Hosted-->>Op: Node found / not found
        end
        alt all validations pass
            Op->>Hosted: Approve CSR (UpdateApproval)
            Op->>Mgmt: Record approval event
        else validation fails
            Op->>Mgmt: Record denial/error event
        end
    end

    Op->>Mgmt: Update CSRAutoApprovalActive condition
    Op->>Op: Requeue after 30s
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding CSR approval functionality for DPU worker nodes, which is the primary focus of this comprehensive feature implementation.
Docstring Coverage ✅ Passed Docstring coverage is 96.15% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
internal/controller/csrapproval/csr_owner_validation.go (1)

89-103: Use Get by name instead of List + linear scan.

dpuExists lists all DPUs in the namespace and iterates to find a matching name. A direct Get call is more efficient (O(1) API call vs. listing potentially many objects). Same applies to nodeExists below.

♻️ Proposed refactor for dpuExists
 func (v *Validator) dpuExists(ctx context.Context, hostname string) (bool, error) {
-	dpuList := &dpuprovisioningv1alpha1.DPUList{}
-	if err := v.mgmtClient.List(ctx, dpuList, client.InNamespace(v.dpuNamespace)); err != nil {
-		return false, err
-	}
-
-	for _, dpu := range dpuList.Items {
-		if dpu.Name == hostname {
-			return true, nil
-		}
-	}
-
-	return false, nil
+	dpu := &dpuprovisioningv1alpha1.DPU{}
+	err := v.mgmtClient.Get(ctx, client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname}, dpu)
+	if err != nil {
+		if client.IgnoreNotFound(err) == nil {
+			return false, nil
+		}
+		return false, err
+	}
+	return true, nil
 }
♻️ Proposed refactor for nodeExists
 func (v *Validator) nodeExists(ctx context.Context, hostname string) (bool, error) {
-	nodeList, err := v.hostedClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
-	if err != nil {
-		return false, err
-	}
-
-	for _, node := range nodeList.Items {
-		if node.Name == hostname {
-			return true, nil
-		}
-	}
-
-	return false, nil
+	_, err := v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{})
+	if err != nil {
+		if errors.IsNotFound(err) {
+			return false, nil
+		}
+		return false, err
+	}
+	return true, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 89 -
103, Replace the List+scan approach in dpuExists and nodeExists with direct Get
calls: for dpuExists use v.mgmtClient.Get(ctx, types.NamespacedName{Name:
hostname, Namespace: v.dpuNamespace}, &dpuprovisioningv1alpha1.DPU{}) and return
true,nil if found, return false,nil on apierrors.IsNotFound(err) and return
false,err for other errors; for nodeExists (cluster-scoped) use
v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname}, &corev1.Node{}) with
the same NotFound/other-error handling and return true,nil when present. Ensure
you import types and apierrors if not already present.
internal/controller/csrapproval/hostname_test.go (1)

297-391: Consider consolidating duplicated CSR helper functions.

createTestCSR, createTestCSRWithOrganization, and createServingCSRWithDNS share the same key generation, CSR creation, PEM encoding, and base64 encoding boilerplate. A single parameterized helper would reduce duplication.

♻️ Proposed consolidated helper
-func createTestCSR(cn string) []byte {
-	privateKey, err := rsa.GenerateKey(rand.Reader, 2048)
-	if err != nil {
-		panic(err)
-	}
-	template := x509.CertificateRequest{
-		Subject: pkix.Name{
-			CommonName: cn,
-		},
-	}
-	csrBytes, err := x509.CreateCertificateRequest(rand.Reader, &template, privateKey)
-	if err != nil {
-		panic(err)
-	}
-	pemBlock := pem.EncodeToMemory(&pem.Block{
-		Type:  "CERTIFICATE REQUEST",
-		Bytes: csrBytes,
-	})
-	return []byte(base64.StdEncoding.EncodeToString(pemBlock))
-}
-
-func createTestCSRWithOrganization(cn, org string) []byte { ... }
-func createServingCSRWithDNS(cn, org, dnsName string) []byte { ... }
+// buildTestCSR creates a test CSR with optional organization and DNS names.
+func buildTestCSR(cn string, org string, dnsNames ...string) []byte {
+	privateKey, err := rsa.GenerateKey(rand.Reader, 2048)
+	if err != nil {
+		panic(err)
+	}
+	template := x509.CertificateRequest{
+		Subject: pkix.Name{
+			CommonName: cn,
+		},
+		DNSNames: dnsNames,
+	}
+	if org != "" {
+		template.Subject.Organization = []string{org}
+	}
+	csrBytes, err := x509.CreateCertificateRequest(rand.Reader, &template, privateKey)
+	if err != nil {
+		panic(err)
+	}
+	pemBlock := pem.EncodeToMemory(&pem.Block{
+		Type:  "CERTIFICATE REQUEST",
+		Bytes: csrBytes,
+	})
+	return []byte(base64.StdEncoding.EncodeToString(pemBlock))
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/hostname_test.go` around lines 297 - 391, The
three helpers createTestCSR, createTestCSRWithOrganization, and
createServingCSRWithDNS duplicate key generation, CSR creation, PEM encoding,
and base64 encoding; replace them with a single parameterized helper (e.g.,
buildCSR(cn string, orgs []string, dnsNames []string) []byte) that generates the
RSA key, constructs an x509.CertificateRequest using Subject.CommonName and
Subject.Organization from orgs, sets DNSNames from dnsNames, creates the CSR,
PEM-encodes it and returns base64 bytes; update or remove the three original
functions to call or forward to buildCSR to eliminate duplication.
internal/controller/csrapproval/client.go (1)

50-68: Cached client is never invalidated on error — stale connections will persist.

When a cached clientset becomes stale (e.g., kubeconfig rotated, API server restarted), GetHostedClusterClient keeps returning it without validation. InvalidateClient exists but nothing calls it on connection failures. TestConnection is defined but never invoked from within GetHostedClusterClient.

Consider invalidating the cache entry when the caller encounters connection errors, or proactively test the cached client before returning it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/client.go` around lines 50 - 68,
GetHostedClusterClient currently returns cached entries from cm.hcClients
without validating them; call the existing TestConnection method on the cached
kubernetes.Clientset (keyed by namespace+"/"+name) before returning and if
TestConnection returns an error, call InvalidateClient for that key and fall
through to createHostedClusterClient to rebuild the client; ensure
createHostedClusterClient results are only cached on successful creation and
that errors from TestConnection lead to recreating the client instead of
returning a stale one.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 31-39: ClientManager.hcClients is a plain map accessed
concurrently and can panic; add a sync.RWMutex field (e.g., hcMu sync.RWMutex)
to the ClientManager struct and use it to protect all accesses: use
hcMu.RLock()/RUnlock() around read paths in GetHostedClusterClient and any
callers that only read, and use hcMu.Lock()/Unlock() around write paths where
you create or delete entries (e.g., when inserting a new *kubernetes.Clientset
in GetHostedClusterClient or removing entries in InvalidateClient). Ensure every
access to hcClients (reads and writes) in the methods referenced
(GetHostedClusterClient, InvalidateClient, and any other helpers) is guarded by
the appropriate lock.
- Around line 90-93: The client currently sets config.Timeout = 0 which allows
API calls (used by the CSR list/approve logic) to block indefinitely; change the
timeout to a reasonable value shorter than the 30s polling interval (e.g., set
config.Timeout to 10s or 15s) so list/approve operations in
internal/controller/csrapproval/client.go will fail fast instead of hanging;
update the assignment of config.Timeout (near where config.QPS and config.Burst
are set) to a non-zero duration and ensure any callers expect a timeout error
instead of a hang.

In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 257-293: The CSR IP validation is inconsistent:
validateServingIdentities currently only checks nil and IsUnspecified while
hasValidIPAddress (unused) also rejects loopback and multicast; update
validateServingIdentities to call hasValidIPAddress(ip) for each
certReq.IPAddresses and return a clear error (e.g., "CSR contains invalid IP
address: <ip>") when it returns false, or alternatively remove the unused
hasValidIPAddress if you prefer the simpler checks—prefer the former to
centralize IP logic in hasValidIPAddress.

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 262-295: In setCondition, avoid appending a duplicate when the
condition already exists by introducing a found flag to distinguish "condition
found but unchanged" from "condition not found"; iterate over
dpfhcp.Status.Conditions, set found=true when existing.Type ==
provisioningv1alpha1.CSRAutoApprovalActive, and only replace the element and set
changed=true if existing.Status != status || existing.Reason != reason, then
break; after the loop, if !found append the new condition (and set
changed=true), and finally call a.mgmtClient.Status().Update(ctx, dpfhcp) only
if changed.
- Around line 170-193: The approvedCount is incremented for any nil error from
processCSR, but processCSR currently returns nil for both skipped
(validation-failed) and actually-approved CSRs; change processCSR signature to
return (bool, error) where the bool indicates whether the CSR was approved
(true) or skipped (false), update its return sites to return (false, nil) for
validation skips and (true, nil) for real approvals, then update the callers in
the loops that currently call processCSR (the bootstrap and serving CSR loops)
to unpack (approved, err) and only increment approvedCount when approved ==
true; ensure all usages of processCSR (including any tests) are updated
accordingly.

In `@internal/controller/csrapproval/hostname_test.go`:
- Around line 297-326: The helper functions createTestCSR,
createTestCSRWithOrganization, and createServingCSRWithDNS are double-encoding
the CSR (PEM then base64) but parseCSRRequest/ExtractHostname expect raw PEM
bytes; remove the base64.StdEncoding.EncodeToString(...) call in each helper and
return the PEM bytes directly (e.g., return pemBlock or []byte(pemBlock)) so
tests supply plain PEM instead of base64-encoded PEM.

---

Duplicate comments:
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 211-232: The parseCSRRequest function is incorrectly
base64-decoding csr.Spec.Request (which is already raw []byte); remove the call
to base64.StdEncoding.DecodeString and instead pass csr.Spec.Request directly to
pem.Decode, adjust variable names accordingly (e.g., remove csrBlock), and
delete the now-unused encoding/base64 import; keep the existing error checks and
the call to x509.ParseCertificateRequest on pemBlock.Bytes.

---

Nitpick comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 50-68: GetHostedClusterClient currently returns cached entries
from cm.hcClients without validating them; call the existing TestConnection
method on the cached kubernetes.Clientset (keyed by namespace+"/"+name) before
returning and if TestConnection returns an error, call InvalidateClient for that
key and fall through to createHostedClusterClient to rebuild the client; ensure
createHostedClusterClient results are only cached on successful creation and
that errors from TestConnection lead to recreating the client instead of
returning a stale one.

In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 89-103: Replace the List+scan approach in dpuExists and nodeExists
with direct Get calls: for dpuExists use v.mgmtClient.Get(ctx,
types.NamespacedName{Name: hostname, Namespace: v.dpuNamespace},
&dpuprovisioningv1alpha1.DPU{}) and return true,nil if found, return false,nil
on apierrors.IsNotFound(err) and return false,err for other errors; for
nodeExists (cluster-scoped) use v.mgmtClient.Get(ctx, types.NamespacedName{Name:
hostname}, &corev1.Node{}) with the same NotFound/other-error handling and
return true,nil when present. Ensure you import types and apierrors if not
already present.

In `@internal/controller/csrapproval/hostname_test.go`:
- Around line 297-391: The three helpers createTestCSR,
createTestCSRWithOrganization, and createServingCSRWithDNS duplicate key
generation, CSR creation, PEM encoding, and base64 encoding; replace them with a
single parameterized helper (e.g., buildCSR(cn string, orgs []string, dnsNames
[]string) []byte) that generates the RSA key, constructs an
x509.CertificateRequest using Subject.CommonName and Subject.Organization from
orgs, sets DNSNames from dnsNames, creates the CSR, PEM-encodes it and returns
base64 bytes; update or remove the three original functions to call or forward
to buildCSR to eliminate duplication.

Comment thread internal/controller/csrapproval/client.go
Comment thread internal/controller/csrapproval/client.go Outdated
Comment thread internal/controller/csrapproval/csr_content_validation.go
Comment thread internal/controller/csrapproval/csrapproval.go Outdated
Comment thread internal/controller/csrapproval/csrapproval.go
Comment thread internal/controller/csrapproval/hostname_test.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
internal/controller/csrapproval/csr_owner_validation.go (2)

119-133: Same improvement applies to nodeExists — use a direct Get instead of listing all nodes.

Listing all nodes in the hosted cluster on every CSR is expensive, especially as clusters grow. A direct Get by name is more efficient:

♻️ Proposed refactor
 func (v *Validator) nodeExists(ctx context.Context, hostname string) (bool, error) {
-	nodeList, err := v.hostedClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
-	if err != nil {
-		return false, err
-	}
-
-	for _, node := range nodeList.Items {
-		if node.Name == hostname {
-			return true, nil
-		}
-	}
-
-	return false, nil
+	_, err := v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{})
+	if err != nil {
+		if apierrors.IsNotFound(err) {
+			return false, nil
+		}
+		return false, err
+	}
+	return true, nil
 }

You'll need to import apierrors "k8s.io/apimachinery/pkg/api/errors".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 119 -
133, Replace the expensive list in nodeExists with a direct node Get: call
v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) (in
function nodeExists) and handle errors using apierrors.IsNotFound(err) to return
(false, nil) when the node doesn't exist, otherwise return (false, err) for
other errors; add the import apierrors "k8s.io/apimachinery/pkg/api/errors".

104-117: Consider using a direct Get instead of listing all DPUs.

dpuExists lists every DPU in the namespace and iterates to find a name match. Since you're matching by dpu.Name == hostname, a direct Get by name would be O(1) vs O(n) and avoids transferring the full list:

♻️ Proposed refactor
 func (v *Validator) dpuExists(ctx context.Context, hostname string) (bool, error) {
-	dpuList := &dpuprovisioningv1alpha1.DPUList{}
-	if err := v.mgmtClient.List(ctx, dpuList, client.InNamespace(v.dpuNamespace)); err != nil {
-		return false, err
-	}
-
-	for _, dpu := range dpuList.Items {
-		if dpu.Name == hostname {
-			return true, nil
-		}
-	}
-
-	return false, nil
+	dpu := &dpuprovisioningv1alpha1.DPU{}
+	key := client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname}
+	if err := v.mgmtClient.Get(ctx, key, dpu); err != nil {
+		if client.IgnoreNotFound(err) == nil {
+			return false, nil
+		}
+		return false, err
+	}
+	return true, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 104 -
117, Replace the List+loop in Validator.dpuExists with a direct Get by
namespaced name: construct a dpuprovisioningv1alpha1.DPU object and call
v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname, Namespace:
v.dpuNamespace}, &dpu); if Get returns nil return (true, nil), if it returns a
NotFound error (use k8s.io/apimachinery/pkg/api/errors.IsNotFound) return
(false, nil), otherwise return (false, err). This avoids iterating over dpuList
and keeps the function signature and return semantics the same.
internal/controller/csrapproval/csrapproval.go (1)

260-261: A new Validator is instantiated per CSR — consider reusing it across the batch.

NewValidator is created inside processCSR, which is called for every pending CSR. The Validator struct is lightweight, but it triggers list operations (dpuExists, nodeExists) that query the API for each CSR. Creating the Validator once in processPendingCSRs and passing it to processCSR would allow potential future optimizations (e.g., caching the DPU list once per polling cycle).

♻️ Proposed refactor sketch
 // In processPendingCSRs, create the validator once:
+	validator := NewValidator(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace)
 
 // Then pass it to processCSR:
-func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string) error {
+func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string, validator *Validator) error {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csrapproval.go` around lines 260 - 261, The
Validator is being created per CSR inside processCSR (via NewValidator), causing
repeated list queries; instead, create a single Validator in processPendingCSRs
and pass it into processCSR as a parameter (e.g., add a *Validator arg to
processCSR and remove the NewValidator call inside it). Update all callers of
processCSR accordingly, and ensure the existing methods on Validator (dpuExists,
nodeExists) are used from the shared instance so future caching or batched list
optimizations can be added centrally.
internal/controller/csrapproval/csr_content_validation.go (1)

162-208: Consider extracting a generic usage validation helper to reduce duplication.

validateBootstrapUsages and validateServingUsages are nearly identical, differing only in the required usage set. A single validateRequiredUsages(csr, required) helper would eliminate the duplication.

♻️ Proposed refactor
+func validateRequiredUsages(csr *certv1.CertificateSigningRequest, required []certv1.KeyUsage) error {
+	usageSet := make(map[certv1.KeyUsage]bool, len(required))
+	for _, u := range required {
+		usageSet[u] = false
+	}
+	for _, usage := range csr.Spec.Usages {
+		if _, ok := usageSet[usage]; ok {
+			usageSet[usage] = true
+		}
+	}
+	for usage, found := range usageSet {
+		if !found {
+			return fmt.Errorf("missing required usage: %s", usage)
+		}
+	}
+	return nil
+}
+
 func validateBootstrapUsages(csr *certv1.CertificateSigningRequest) error {
-	requiredUsages := map[certv1.KeyUsage]bool{...}
-	...
+	return validateRequiredUsages(csr, []certv1.KeyUsage{
+		certv1.UsageDigitalSignature,
+		certv1.UsageClientAuth,
+	})
 }
 
 func validateServingUsages(csr *certv1.CertificateSigningRequest) error {
-	requiredUsages := map[certv1.KeyUsage]bool{...}
-	...
+	return validateRequiredUsages(csr, []certv1.KeyUsage{
+		certv1.UsageDigitalSignature,
+		certv1.UsageServerAuth,
+	})
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_content_validation.go` around lines 162 -
208, Extract the duplicated logic in validateBootstrapUsages and
validateServingUsages into a single helper validateRequiredUsages that takes the
CSR and the set/list of required usages (e.g. func validateRequiredUsages(csr
*certv1.CertificateSigningRequest, required []certv1.KeyUsage) error or a map),
move the loop that marks found usages and the final missing-check into that
helper, and then have validateBootstrapUsages and validateServingUsages call
validateRequiredUsages with their respective required usage sets
(UsageDigitalSignature+UsageClientAuth and
UsageDigitalSignature+UsageServerAuth) so behavior and error messages
(fmt.Errorf("missing required usage: %s", usage)) remain the same.
internal/controller/csrapproval/controller.go (1)

34-39: Scheme and Recorder fields are unused in this controller.

Scheme and Recorder are set during wiring in main.go but never referenced inside Reconcile or SetupWithManager. If they exist only for structural consistency with the main reconciler, consider documenting that intent; otherwise, removing them avoids confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/controller.go` around lines 34 - 39, The
CSRApprovalReconciler struct contains unused fields Scheme and Recorder; either
remove these fields from CSRApprovalReconciler or explicitly document why
they’re retained (e.g., wiring consistency with main.go) and reference their
intended use. Locate the CSRApprovalReconciler type definition and delete Scheme
*runtime.Scheme and Recorder record.EventRecorder if they are not referenced by
Reconcile or SetupWithManager, or add a comment above the struct explaining they
are intentionally kept for wiring in main.go and will be used in future logic;
update any constructor/wiring code in main.go accordingly to avoid linter
confusion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/main.go`:
- Around line 248-249: The CSR approver is being initialized with
mgr.GetEventRecorderFor("dpfhcpprovisioner-controller") and mgr.GetClient(),
which misattributes events to the main controller and is inconsistent with the
reconciler; change the NewCSRApprover call that creates csrApprover to use the
same local client variable (client) and use
mgr.GetEventRecorderFor("csrapproval-controller") so events are attributed
correctly and the client reference matches the reconciler (update the call to
csrapproval.NewCSRApprover(...)).

In `@internal/controller/csrapproval/client.go`:
- Around line 203-210: TestConnection currently calls
clientset.Discovery().ServerVersion() which blocks and ignores ctx (and relies
on config.Timeout being non-zero); update TestConnection to respect the provided
context by using a context-aware discovery REST call (e.g. use
clientset.Discovery().RESTClient().Get().AbsPath("/version").Do(ctx).Into(&versionInfo)
or the equivalent context-taking client-go call) or, if you cannot change the
callsite, ensure the REST config used to build clientset has a non-zero Timeout
(config.Timeout > 0) so the call is bounded; modify the TestConnection function
(and related client creation) to use the context-aware request path or guarantee
config.Timeout is set to a sensible non-zero duration when constructing the
kubernetes.Clientset.

In `@internal/controller/csrapproval/controller.go`:
- Around line 65-71: Add a nil-check for r.Approver before calling its methods
to avoid nil-pointer panics: in the deletion handling and in the code that calls
r.Approver.ProcessCSRs (and the block covering the current ProcessCSRs usage
around lines 87-93), wrap the calls with if r.Approver != nil { ... } else { log
a warning/info } so StopCSRWatch and ProcessCSRs are only invoked when Approver
is non-nil; reference the Approver field and its methods StopCSRWatch and
ProcessCSRs to locate and update the calls.

In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 56-101: The current ValidateCSR implementation rejects bootstrap
CSRs when nodeExists is true; update logic to permit legitimate re-bootstrap
scenarios by inspecting the existing Node's status or age instead of outright
rejecting: change or augment nodeExists to return the Node object or its Ready
condition/timestamps (e.g., modify nodeExists to return (*v1.Node, bool, error)
or add a new getNode helper), then in ValidateCSR when isBootstrapCSR &&
nodeExists check the Node's Ready condition (allow if NotReady) or compare
Node.CreationTimestamp/LastTransitionTime against a configurable gracePeriod to
allow re-bootstrap for older/stale Nodes; keep existing rejection for cases that
indicate a true duplicate join (Node Ready and recent within grace period).
Ensure you reference/adjust ValidateCSR, nodeExists (or new getNode), and any
configurable gracePeriod/dpuNamespace fields accordingly.

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 74-84: The cached hosted-cluster client can be stale when
TestConnection fails, so on TestConnection error you must invalidate the cached
client and trigger a retry path: call
a.clientManager.InvalidateHostedClusterClient(ctx, dpfhcp.Namespace,
dpfhcp.Name) (or add such a method if missing) when TestConnection returns an
error for the client returned by a.clientManager.GetHostedClusterClient, then
record the condition/event and requeue (or return so the next reconciliation
will create a fresh client); ensure invalidation is performed before returning
from the handler that processes TestConnection failures.
- Around line 86-95: TestConnection failing currently logs and returns without
requeue but leaves the possibly-stale cached hosted-cluster client (hcClient)
intact; update the reconciler so when TestConnection(ctx, hcClient) returns an
error you explicitly invalidate the cached client before returning (e.g., clear
the controller's cached client field or call a new invalidateHostedClusterClient
method), so the next reconciliation will recreate a fresh client; keep the
existing setCondition and Event calls and then return as before.

---

Duplicate comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 31-39: ClientManager's hcClients map is unsynchronized and can
cause concurrent map read/write panics; add a sync.RWMutex field (e.g.,
hcClientsMu) to ClientManager and use it to protect all accesses to hcClients:
acquire hcClientsMu.RLock() for readers and RUnlock(), and
hcClientsMu.Lock()/Unlock() for writers (creation, assignment, deletion). Update
any methods that reference hcClients (including constructors, getters, setters,
and cleanup code) to use the mutex, and ensure the mutex is initialized as part
of the ClientManager struct so the map accesses are always thread-safe.
- Around line 104-107: The client config currently sets config.Timeout = 0 which
allows API calls to hang indefinitely; change it to a finite duration (e.g., 15
seconds) by assigning config.Timeout to a time.Duration value (15 * time.Second)
so reconciler calls won’t stall, and ensure the time package is imported; leave
config.QPS and config.Burst as-is.

In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 269-290: The inline IP checks inside validateServingIdentities
should reuse hasValidIPAddress to avoid duplicate logic: replace the current
loop that only checks ip == nil and ip.IsUnspecified() with a call to
hasValidIPAddress(ip) and return an error that reflects the full validation
(e.g., "CSR contains invalid IP address: %s") when it returns false;
alternatively if you prefer the simpler checks, remove the unused
hasValidIPAddress function to eliminate dead code—make sure only one of these
two approaches remains.

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 205-231: The loop increments approvedCount whenever processCSR
returns nil, but processCSR currently returns nil for both approved and skipped
CSRs, causing over-counting; change processCSR signature to return (bool
approved, error) (or an equivalent enum/status) and update callers in the
bootstrapCSRs and servingCSRs loops to only increment approvedCount when
approved == true and err == nil; also update any other callers of processCSR to
handle the new return value and preserve existing error logging behavior (use
symbols: processCSR, approvedCount, bootstrapCSRs, servingCSRs).

---

Nitpick comments:
In `@internal/controller/csrapproval/controller.go`:
- Around line 34-39: The CSRApprovalReconciler struct contains unused fields
Scheme and Recorder; either remove these fields from CSRApprovalReconciler or
explicitly document why they’re retained (e.g., wiring consistency with main.go)
and reference their intended use. Locate the CSRApprovalReconciler type
definition and delete Scheme *runtime.Scheme and Recorder record.EventRecorder
if they are not referenced by Reconcile or SetupWithManager, or add a comment
above the struct explaining they are intentionally kept for wiring in main.go
and will be used in future logic; update any constructor/wiring code in main.go
accordingly to avoid linter confusion.

In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 162-208: Extract the duplicated logic in validateBootstrapUsages
and validateServingUsages into a single helper validateRequiredUsages that takes
the CSR and the set/list of required usages (e.g. func
validateRequiredUsages(csr *certv1.CertificateSigningRequest, required
[]certv1.KeyUsage) error or a map), move the loop that marks found usages and
the final missing-check into that helper, and then have validateBootstrapUsages
and validateServingUsages call validateRequiredUsages with their respective
required usage sets (UsageDigitalSignature+UsageClientAuth and
UsageDigitalSignature+UsageServerAuth) so behavior and error messages
(fmt.Errorf("missing required usage: %s", usage)) remain the same.

In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 119-133: Replace the expensive list in nodeExists with a direct
node Get: call v.hostedClient.CoreV1().Nodes().Get(ctx, hostname,
metav1.GetOptions{}) (in function nodeExists) and handle errors using
apierrors.IsNotFound(err) to return (false, nil) when the node doesn't exist,
otherwise return (false, err) for other errors; add the import apierrors
"k8s.io/apimachinery/pkg/api/errors".
- Around line 104-117: Replace the List+loop in Validator.dpuExists with a
direct Get by namespaced name: construct a dpuprovisioningv1alpha1.DPU object
and call v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname, Namespace:
v.dpuNamespace}, &dpu); if Get returns nil return (true, nil), if it returns a
NotFound error (use k8s.io/apimachinery/pkg/api/errors.IsNotFound) return
(false, nil), otherwise return (false, err). This avoids iterating over dpuList
and keeps the function signature and return semantics the same.

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 260-261: The Validator is being created per CSR inside processCSR
(via NewValidator), causing repeated list queries; instead, create a single
Validator in processPendingCSRs and pass it into processCSR as a parameter
(e.g., add a *Validator arg to processCSR and remove the NewValidator call
inside it). Update all callers of processCSR accordingly, and ensure the
existing methods on Validator (dpuExists, nodeExists) are used from the shared
instance so future caching or batched list optimizations can be added centrally.

Comment thread cmd/main.go Outdated
Comment thread internal/controller/csrapproval/client.go
Comment thread internal/controller/csrapproval/controller.go
Comment thread internal/controller/csrapproval/csr_owner_validation.go Outdated
Comment thread internal/controller/csrapproval/csrapproval.go
Comment thread internal/controller/csrapproval/csrapproval.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (6)
internal/controller/csrapproval/csr_owner_validation.go (3)

88-100: Drop the unnecessary else after early return — idiomatic Go.

The else block on line 88 is unreachable in any other sense; the prior if already returns, so the else clause adds indentation noise and is non-idiomatic Go.

♻️ Proposed refactor
-	} else {
-		// Serving CSR: node SHOULD exist (already joined via bootstrap CSR)
-		if !nodeExists {
-			return &ValidationResult{
-				Valid:  false,
-				Reason: fmt.Sprintf("node %s does not exist yet in hosted cluster", hostname),
-			}, nil
-		}
-		return &ValidationResult{
-			Valid:  true,
-			Reason: "DPU exists and node already joined",
-		}, nil
-	}
+	// Serving CSR: node SHOULD exist (already joined via bootstrap CSR)
+	if !nodeExists {
+		return &ValidationResult{
+			Valid:  false,
+			Reason: fmt.Sprintf("node %s does not exist yet in hosted cluster", hostname),
+		}, nil
+	}
+	return &ValidationResult{
+		Valid:  true,
+		Reason: "DPU exists and node already joined",
+	}, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 88 -
100, Remove the unnecessary else block following the early return: in the
function that constructs a ValidationResult using variables nodeExists and
hostname (the block that currently returns fmt.Sprintf("node %s does not exist
yet in hosted cluster", hostname) inside an else), delete the else and outdent
its contents so the nodeExists check and the two ValidationResult returns (the
"node ... does not exist yet in hosted cluster" and "DPU exists and node already
joined") become the natural fall-through after the prior return; keep the same
return values and use the same ValidationResult struct.

56-101: No unit tests for ValidateCSROwner — security-critical path lacks coverage.

The related files include csr_content_validation_test.go and hostname_test.go, but no csr_owner_validation_test.go appears in this PR. Given that ValidateCSROwner is the gatekeeper deciding whether a CSR gets auto-approved, the following cases should be covered at minimum:

  • DPU missing → invalid result
  • Bootstrap CSR, node absent → valid
  • Bootstrap CSR, node present → invalid
  • Serving CSR, node present → valid
  • Serving CSR, node absent → invalid
  • dpuExists/nodeExists errors propagate correctly

Would you like me to generate a csr_owner_validation_test.go covering these cases using the controller-runtime fake client and a fake kubernetes.Interface? I can open a new issue to track this as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 56 -
101, Add unit tests for ValidateCSROwner covering the six scenarios listed: DPU
missing returns Valid=false with the expected Reason; bootstrap CSR with node
absent returns Valid=true; bootstrap CSR with node present returns Valid=false;
serving CSR with node present returns Valid=true; serving CSR with node absent
returns Valid=false; and both dpuExists/nodeExists returning errors propagate
the error. Create csr_owner_validation_test.go and use controller-runtime's fake
client to control DPU existence and a fake kubernetes.Interface (or a client
stub) to control nodeExists behavior; call ValidateCSROwner and assert the
ValidationResult.Valid and Reason values for each case and that errors from
dpuExists/nodeExists are returned (not swallowed). Ensure tests reference the
Validator type and its methods dpuExists and nodeExists (mock/stub them or
inject a test-friendly interface) so behavior is deterministic.

104-133: Both helpers do an O(n) full-list scan; prefer a direct point-lookup instead.

dpuExists lists every DPU in the namespace and nodeExists lists every Node in the hosted cluster, both only to match a single name. A direct Get is O(1), avoids transferring the full list over the wire, and is simpler to read. Since DPU names map 1:1 to hostnames, there is no ambiguity in the lookup key.

♻️ Proposed refactor

Add apierrors "k8s.io/apimachinery/pkg/api/errors" to the import block, then replace both helpers:

+import (
+	"context"
+	"fmt"
+
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/types"
+	"k8s.io/client-go/kubernetes"
+	"sigs.k8s.io/controller-runtime/pkg/client"
+
+	dpuprovisioningv1alpha1 "github.com/nvidia/doca-platform/api/provisioning/v1alpha1"
+)
 func (v *Validator) dpuExists(ctx context.Context, hostname string) (bool, error) {
-	dpuList := &dpuprovisioningv1alpha1.DPUList{}
-	if err := v.mgmtClient.List(ctx, dpuList, client.InNamespace(v.dpuNamespace)); err != nil {
-		return false, err
-	}
-
-	for _, dpu := range dpuList.Items {
-		if dpu.Name == hostname {
-			return true, nil
-		}
-	}
-
-	return false, nil
+	dpu := &dpuprovisioningv1alpha1.DPU{}
+	err := v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname, Namespace: v.dpuNamespace}, dpu)
+	if err != nil {
+		if apierrors.IsNotFound(err) {
+			return false, nil
+		}
+		return false, err
+	}
+	return true, nil
 }
 func (v *Validator) nodeExists(ctx context.Context, hostname string) (bool, error) {
-	nodeList, err := v.hostedClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
+	_, err := v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{})
 	if err != nil {
-		return false, err
-	}
-
-	for _, node := range nodeList.Items {
-		if node.Name == hostname {
-			return true, nil
+		if apierrors.IsNotFound(err) {
+			return false, nil
 		}
+		return false, err
 	}
-
-	return false, nil
+	return true, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 104 -
133, Replace the full-list scans in dpuExists and nodeExists with direct Get
calls: import k8s apierrors and call v.mgmtClient.Get(ctx,
client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname},
&dpuprovisioningv1alpha1.DPU{}) in dpuExists and use
v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) in
nodeExists; treat apierrors.IsNotFound as a false/no-error result and return
other errors upward so you avoid O(n) list operations and return correct
boolean+error behavior for both helpers.
internal/controller/csrapproval/client.go (1)

159-191: replaceServerWithInternalEndpoint has two unused parameters and a package-name shadow.

  1. ctx context.Context and mgmtClient client.Client are accepted as parameters but never referenced in the function body — they are dead parameters.
  2. context, ok := kubeconfig.Contexts[currentContext] (Line 170) declares a local variable named context that shadows the imported "context" standard library package within this function's scope. Rename it to kubeconfigCtx or ctxEntry to avoid future confusion.
♻️ Proposed cleanup
-func replaceServerWithInternalEndpoint(ctx context.Context, mgmtClient client.Client, kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error {
+func replaceServerWithInternalEndpoint(kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error {
 	if kubeconfig == nil {
 		return fmt.Errorf("kubeconfig is nil")
 	}

 	currentContext := kubeconfig.CurrentContext
 	if currentContext == "" {
 		return fmt.Errorf("kubeconfig has no current context")
 	}

-	context, ok := kubeconfig.Contexts[currentContext]
+	kubeconfigCtx, ok := kubeconfig.Contexts[currentContext]
 	if !ok {
 		return fmt.Errorf("context %s not found in kubeconfig", currentContext)
 	}

-	clusterName := context.Cluster
+	clusterName := kubeconfigCtx.Cluster
 	...
 }

And update the call site in createHostedClusterClient:

-	if err := replaceServerWithInternalEndpoint(ctx, cm.mgmtClient, kubeconfig, namespace, name); err != nil {
+	if err := replaceServerWithInternalEndpoint(kubeconfig, namespace, name); err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/client.go` around lines 159 - 191, The
function replaceServerWithInternalEndpoint currently accepts unused parameters
ctx and mgmtClient and declares a local variable named context that shadows the
imported context package; remove the unused parameters from
replaceServerWithInternalEndpoint's signature and update its callers (e.g.,
createHostedClusterClient) to match the new signature, and rename the local
variable context (e.g., to kubeconfigCtx or ctxEntry) where
kubeconfig.Contexts[currentContext] is read to avoid shadowing the standard
library package.
api/v1alpha1/dpfhcpprovisioner_types.go (1)

238-239: Consider a more descriptive value for ReasonCSRApprovalActive.

The value "Active" is ambiguous in isolation — kubectl get dpfhcp -o yaml will show only the Reason field without the condition type context. Every other reason constant in this file uses a self-describing string (e.g., "AllComponentsOperational", "HostedClusterNotReady"). A value like "CSRApprovalActive" or "ApprovalControllerRunning" would be consistent with that convention.

✏️ Proposed rename
-	ReasonCSRApprovalActive string = "Active"
+	ReasonCSRApprovalActive string = "CSRApprovalActive"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/v1alpha1/dpfhcpprovisioner_types.go` around lines 238 - 239, The constant
ReasonCSRApprovalActive currently uses the ambiguous value "Active"; change its
string value to a more descriptive identifier such as "CSRApprovalActive" (i.e.,
keep the constant name ReasonCSRApprovalActive but set its value to
"CSRApprovalActive"), and update any usages/tests/fixtures that assert on the
prior "Active" literal (search for ReasonCSRApprovalActive and any string
comparisons to "Active") so output in kubectl shows the self-describing reason.
internal/controller/csrapproval/csrapproval.go (1)

264-266: NewValidator allocated on every CSR — consider hoisting to processPendingCSRs.

All three constructor arguments (a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace) are invariant across the entire CSR batch. Creating one Validator per CSR is functionally harmless but wasteful. Moving creation to processPendingCSRs and threading it into processCSR (or making it a field) would be cleaner.

♻️ Proposed refactor
-// processCSR processes a single CSR (validate and approve if valid)
-func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string) error {
+// processCSR processes a single CSR (validate and approve if valid)
+func (a *CSRApprover) processCSR(ctx context.Context, validator *Validator, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner) error {

And in processPendingCSRs, create the validator once before the loops:

+	validator := NewValidator(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace)
+
 	// Process bootstrap CSRs
 	for _, csr := range bootstrapCSRs {
-		if err := a.processCSR(ctx, dpfhcp, hcClient, &csr, "bootstrap"); err != nil {
+		if err := a.processCSR(ctx, validator, hcClient, &csr, "bootstrap", dpfhcp); err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csrapproval.go` around lines 264 - 266,
NewValidator is being allocated per CSR; instead, create the Validator once in
processPendingCSRs using the invariant arguments (a.mgmtClient, hcClient,
dpfhcp.Spec.DPUClusterRef.Namespace) and pass that single instance into
processCSR (or store it as a field on the approver struct) so processCSR calls
validator.ValidateCSROwner(ctx, hostname, isBootstrapCSR) without re-allocating;
update function signatures (processCSR) or the approver struct accordingly and
remove the per-CSR NewValidator call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 31-39: ClientManager's hcClients map is accessed concurrently and
must be protected: add a sync.RWMutex field (e.g., mu sync.RWMutex) to the
ClientManager struct and update all places that read or write hcClients — use
mu.RLock()/RUnlock() around reads in GetHostedClusterClient and any cache lookup
paths, and use mu.Lock()/Unlock() around writes such as creating or deleting
entries in InvalidateClient and map initialization (ensure hcClients is
lazy-initialized under the write lock). Also scan for any other methods
referencing hcClients and wrap their map accesses with the appropriate lock to
eliminate the data race.
- Around line 194-200: TestConnection currently ignores the passed ctx because
Discovery().ServerVersion() is not context-aware; change TestConnection to
perform a context-bound REST GET against the API /version using the discovery
REST client (e.g.
clientset.Discovery().RESTClient().Get().AbsPath("/version").Do(ctx).Into(&version.Info{}))
so the request respects ctx (or alternatively ensure the client REST config has
a non-zero Timeout). Update the function TestConnection to decode the response
into a version.Info and return any error from the Do(ctx) call instead of
calling ServerVersion().
- Around line 105-108: The client config currently sets config.Timeout = 0 which
allows List/Update calls to block forever; change this to a finite timeout (e.g.
10–30s) so CSR polling and approve operations fail fast when the hosted API is
unresponsive. Update the code that builds the rest.Config (the config variable
in internal/controller/csrapproval/client.go) to set a reasonable non-zero
timeout and ensure the clientset creation (where this config is used) inherits
that timeout so List/Update calls in the reconcile loop (30-second loop) won't
hang indefinitely.

In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 275-295: The IP validation helper hasValidIPAddress is unused
while validateServingIdentities currently only checks for nil and
IsUnspecified(); update the certificate-serving identity loop that iterates
certReq.IPAddresses in validateServingIdentities to call hasValidIPAddress(ip)
and return an error when it returns false (including the offending IP in the
message), or if you prefer remove hasValidIPAddress and consolidate its stricter
checks (IsUnspecified, IsLoopback, IsMulticast) directly into
validateServingIdentities so the IP validation logic is consistent and not
duplicated.

In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 76-83: The current isBootstrapCSR path rejects all bootstrap CSRs
when nodeExists is true; instead, change the logic in csr_owner_validation.go
(the isBootstrapCSR branch that returns ValidationResult) to only reject when
the existing Node already has an active/valid kubelet client certificate, and
allow the bootstrap CSR when the Node exists but does not have such a valid
client cert (i.e., implement a helper like isNodeHasValidKubeletCert(node) and
call it before returning the ValidationResult). Keep the same ValidationResult
shape but replace the plain nodeExists check with a call to the helper so
legitimate re-bootstrap/cert-recovery CSRs are accepted while nodes that already
have active kubelet certs are still blocked.

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 89-98: When TestConnection(ctx, hcClient) returns an error inside
the reconcile block in csrapproval.go, call
a.clientManager.InvalidateClient(...) for the same HostedCluster reference
before setting condition/recording event/returning so the stale cached client is
cleared; update the error branch that currently logs the error and calls
a.setCondition / a.recorder.Event to first invalidate the cached client obtained
via GetHostedClusterClient (use the same cluster identifier used to retrieve the
client) and then proceed with setCondition and Event and return.
- Around line 208-228: approvedCount is over-counting because processCSR
currently returns nil for both skipped and approved CSRs; change processCSR
signature to return (bool, error) where the bool indicates whether the CSR was
actually approved, update all call sites in csrapproval.go (the two loops
processing bootstrapCSRs and servingCSRs) to capture (approved, err) and only
increment approvedCount when approved==true, propagate/log err when non-nil, and
ensure all internal returns in processCSR return (false, nil) for
validation-skipped CSRs and (true, nil) for successfully approved CSRs so
pendingRelevantCSRs and the status message reflect true approvals.

---

Nitpick comments:
In `@api/v1alpha1/dpfhcpprovisioner_types.go`:
- Around line 238-239: The constant ReasonCSRApprovalActive currently uses the
ambiguous value "Active"; change its string value to a more descriptive
identifier such as "CSRApprovalActive" (i.e., keep the constant name
ReasonCSRApprovalActive but set its value to "CSRApprovalActive"), and update
any usages/tests/fixtures that assert on the prior "Active" literal (search for
ReasonCSRApprovalActive and any string comparisons to "Active") so output in
kubectl shows the self-describing reason.

In `@internal/controller/csrapproval/client.go`:
- Around line 159-191: The function replaceServerWithInternalEndpoint currently
accepts unused parameters ctx and mgmtClient and declares a local variable named
context that shadows the imported context package; remove the unused parameters
from replaceServerWithInternalEndpoint's signature and update its callers (e.g.,
createHostedClusterClient) to match the new signature, and rename the local
variable context (e.g., to kubeconfigCtx or ctxEntry) where
kubeconfig.Contexts[currentContext] is read to avoid shadowing the standard
library package.

In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 88-100: Remove the unnecessary else block following the early
return: in the function that constructs a ValidationResult using variables
nodeExists and hostname (the block that currently returns fmt.Sprintf("node %s
does not exist yet in hosted cluster", hostname) inside an else), delete the
else and outdent its contents so the nodeExists check and the two
ValidationResult returns (the "node ... does not exist yet in hosted cluster"
and "DPU exists and node already joined") become the natural fall-through after
the prior return; keep the same return values and use the same ValidationResult
struct.
- Around line 56-101: Add unit tests for ValidateCSROwner covering the six
scenarios listed: DPU missing returns Valid=false with the expected Reason;
bootstrap CSR with node absent returns Valid=true; bootstrap CSR with node
present returns Valid=false; serving CSR with node present returns Valid=true;
serving CSR with node absent returns Valid=false; and both dpuExists/nodeExists
returning errors propagate the error. Create csr_owner_validation_test.go and
use controller-runtime's fake client to control DPU existence and a fake
kubernetes.Interface (or a client stub) to control nodeExists behavior; call
ValidateCSROwner and assert the ValidationResult.Valid and Reason values for
each case and that errors from dpuExists/nodeExists are returned (not
swallowed). Ensure tests reference the Validator type and its methods dpuExists
and nodeExists (mock/stub them or inject a test-friendly interface) so behavior
is deterministic.
- Around line 104-133: Replace the full-list scans in dpuExists and nodeExists
with direct Get calls: import k8s apierrors and call v.mgmtClient.Get(ctx,
client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname},
&dpuprovisioningv1alpha1.DPU{}) in dpuExists and use
v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) in
nodeExists; treat apierrors.IsNotFound as a false/no-error result and return
other errors upward so you avoid O(n) list operations and return correct
boolean+error behavior for both helpers.

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 264-266: NewValidator is being allocated per CSR; instead, create
the Validator once in processPendingCSRs using the invariant arguments
(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace) and pass that
single instance into processCSR (or store it as a field on the approver struct)
so processCSR calls validator.ValidateCSROwner(ctx, hostname, isBootstrapCSR)
without re-allocating; update function signatures (processCSR) or the approver
struct accordingly and remove the per-CSR NewValidator call.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (6)
internal/controller/csrapproval/csrapproval.go (2)

148-220: processPendingCSRs swallows individual CSR processing errors — consider aggregating.

Lines 202-206 and 212-216 log errors from processCSR but continue and return nil from processPendingCSRs. If all CSRs fail, the caller at Line 104 sees no error and reports "CSR auto-approval active" — which is technically true but masks systemic issues. Consider returning an aggregated error (or a count of failures) so the condition message reflects the true state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csrapproval.go` around lines 148 - 220,
processPendingCSRs currently logs per-CSR errors from processCSR but always
returns nil, masking systemic failures; modify processPendingCSRs to track
failures (e.g., failureCount or collect error messages) while iterating over
bootstrapCSRs and servingCSRs and after both loops return a non-nil aggregated
error (or fmt.Errorf with the failure count and brief details) when any CSR
processing failed; ensure you update the return to include context (reference
processPendingCSRs and processCSR) and keep per-CSR log calls but also surface a
summary error to the caller so the caller can detect and report when
auto-approval failed for all/part of the CSRs.

248-262: NewValidator is instantiated per-CSR; hoist it out of the loop.

mgmtClient, hcClient, and dpuNamespace don't change between CSRs in the same poll cycle. Creating a Validator (and its embedded clients) for every CSR is wasteful. Create it once in processPendingCSRs and pass it into processCSR.

♻️ Proposed fix

In processPendingCSRs, create the validator once and pass it to processCSR:

 func (a *CSRApprover) processPendingCSRs(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset) error {
 	log := logf.FromContext(ctx)
+	validator := NewValidator(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace)
 
 	// ... (existing code) ...
 
 	for _, csr := range bootstrapCSRs {
-		_, err := a.processCSR(ctx, dpfhcp, hcClient, &csr, "bootstrap")
+		_, err := a.processCSR(ctx, dpfhcp, hcClient, &csr, "bootstrap", validator)

Then in processCSR, accept and use the pre-built validator:

-func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string) (bool, error) {
+func (a *CSRApprover) processCSR(ctx context.Context, dpfhcp *provisioningv1alpha1.DPFHCPProvisioner, hcClient *kubernetes.Clientset, csr *certv1.CertificateSigningRequest, csrType string, validator *Validator) (bool, error) {
 	// ...
-	validator := NewValidator(a.mgmtClient, hcClient, dpfhcp.Spec.DPUClusterRef.Namespace)
 	isBootstrapCSR := csr.Spec.SignerName == SignerBootstrap
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csrapproval.go` around lines 248 - 262,
NewValidator is being created inside processCSR for every CSR which is wasteful
because mgmtClient, hcClient and dpfhcp.Spec.DPUClusterRef.Namespace are
constant per poll; instead instantiate validator once in processPendingCSRs
(using mgmtClient, hcClient and the DPU namespace), then change processCSR to
accept a Validator parameter and remove the NewValidator call inside processCSR;
update all calls to processCSR to pass the prebuilt validator and ensure
processCSR uses the injected Validator (ValidateCSROwner) rather than creating
its own.
internal/controller/csrapproval/csr_content_validation.go (1)

108-166: Minor style inconsistency between validateBootstrapperIdentity and validateNodeIdentity group-checking logic.

validateBootstrapperIdentity uses a map[string]bool pattern to check required groups (Lines 116-133), while validateNodeIdentity uses individual boolean flags (Lines 147-163). Both are correct, but the inconsistency slightly hurts readability. Consider unifying to one pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_content_validation.go` around lines 108 -
166, Unify the group-checking pattern by changing validateNodeIdentity to use
the same map[string]bool approach as validateBootstrapperIdentity: keep the
username check (expectedUsername := SystemNodePrefix + hostname) but replace the
hasNodesGroup/hasAuthenticatedGroup booleans with a requiredGroups map
initialized with groupNodes and groupAuthenticated set to false, iterate
csr.Spec.Groups to mark found entries, then loop over requiredGroups to return
an error if any remain false; reference validateNodeIdentity,
validateBootstrapperIdentity, groupNodes and groupAuthenticated when making the
change.
internal/controller/csrapproval/client.go (2)

56-79: TOCTOU in GetHostedClusterClient — benign but may cause duplicate client creation.

Two concurrent goroutines can both miss the read-lock cache check (Lines 60-65), both create a client (Line 68), and both insert into the map (Lines 74-76) — the second write silently overwrites the first. The discarded client leaks its underlying transport connections until GC.

This is a minor concern since it only happens on the first access for a given key and the impact is a single wasted client creation. If you want to tighten it, a double-check after acquiring the write lock would suffice.

♻️ Double-check pattern
 	// Cache the client with write lock
 	cm.mu.Lock()
+	// Double-check: another goroutine may have inserted while we were creating
+	if existing, ok := cm.hcClients[key]; ok {
+		cm.mu.Unlock()
+		return existing, nil
+	}
 	cm.hcClients[key] = clientset
 	cm.mu.Unlock()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/client.go` around lines 56 - 79,
GetHostedClusterClient has a TOCTOU where two goroutines can both create a
client and one overwrites the other; after creating the new client via
createHostedClusterClient, acquire cm.mu.Lock() and re-check cm.hcClients[key] —
if an entry now exists, release the lock and return the existing client,
otherwise store the newly created client and return it; if you need to avoid
leaking the discarded client's connections, call a small cleanup helper (e.g.,
closeClient(newClient)) when you detect an existing entry.

172-205: Remove unused ctx and mgmtClient parameters from replaceServerWithInternalEndpoint.

The function signature accepts ctx context.Context and mgmtClient client.Client but neither is used in the function body. Removing them reduces noise and eliminates confusion about unnecessary dependencies.

♻️ Proposed fix
-func replaceServerWithInternalEndpoint(ctx context.Context, mgmtClient client.Client, kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error {
+func replaceServerWithInternalEndpoint(kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error {

Update the call site at line 107:

-	if err := replaceServerWithInternalEndpoint(ctx, cm.mgmtClient, kubeconfig, namespace, name); err != nil {
+	if err := replaceServerWithInternalEndpoint(kubeconfig, namespace, name); err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/client.go` around lines 172 - 205, The
function replaceServerWithInternalEndpoint currently declares unused parameters
ctx and mgmtClient; remove these parameters from the signature so it becomes
replaceServerWithInternalEndpoint(kubeconfig *clientcmdapi.Config,
hostedClusterNamespace, hostedClusterName string) error, update the function
body unchanged, and then update every call site that invokes
replaceServerWithInternalEndpoint to stop passing a context and client (pass
only the kubeconfig, hostedClusterNamespace and hostedClusterName). Also adjust
any unit tests or helper wrappers that reference this function signature to
match the new parameter list.
internal/controller/csrapproval/csr_owner_validation.go (1)

98-128: Replace List + iteration with direct Get lookups for better performance.

Both helpers list all objects then iterate to find by name. Use client.Get() for dpuExists and Nodes().Get() for nodeExists instead—these are O(1) indexed lookups at the API server versus linear scans, which matters when processing multiple CSRs per poll cycle in clusters with many DPUs/nodes.

♻️ Proposed fix

Add apierrors to imports:

import (
	"context"
	"fmt"

	apierrors "k8s.io/apimachinery/pkg/api/errors"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"sigs.k8s.io/controller-runtime/pkg/client"

	dpuprovisioningv1alpha1 "github.com/nvidia/doca-platform/api/provisioning/v1alpha1"
)

Then update the functions:

 func (v *Validator) dpuExists(ctx context.Context, hostname string) (bool, error) {
-	dpuList := &dpuprovisioningv1alpha1.DPUList{}
-	if err := v.mgmtClient.List(ctx, dpuList, client.InNamespace(v.dpuNamespace)); err != nil {
-		return false, err
-	}
-
-	for _, dpu := range dpuList.Items {
-		if dpu.Name == hostname {
-			return true, nil
-		}
+	dpu := &dpuprovisioningv1alpha1.DPU{}
+	err := v.mgmtClient.Get(ctx, client.ObjectKey{Namespace: v.dpuNamespace, Name: hostname}, dpu)
+	if err != nil {
+		if client.IgnoreNotFound(err) == nil {
+			return false, nil
+		}
+		return false, err
 	}
-
-	return false, nil
+	return true, nil
 }

 func (v *Validator) nodeExists(ctx context.Context, hostname string) (bool, error) {
-	nodeList, err := v.hostedClient.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
+	_, err := v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{})
 	if err != nil {
+		if apierrors.IsNotFound(err) {
+			return false, nil
+		}
 		return false, err
 	}
-
-	for _, node := range nodeList.Items {
-		if node.Name == hostname {
-			return true, nil
-		}
-	}
-
-	return false, nil
+	return true, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_owner_validation.go` around lines 98 -
128, Replace the inefficient list+iterate logic in dpuExists and nodeExists with
direct GETs: use v.mgmtClient.Get(ctx, types.NamespacedName{Name: hostname,
Namespace: v.dpuNamespace}, &dpuprovisioningv1alpha1.DPU{}) in dpuExists and
v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) in
nodeExists, handle not-found using apierrors.IsNotFound to return (false, nil)
and return other errors unchanged; add the apierrors import
(k8s.io/apimachinery/pkg/api/errors) and reference the existing v.mgmtClient,
v.dpuNamespace, and v.hostedClient symbols when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 33-53: The review metadata contains a duplicate review marker
([duplicate_comment]) and redundant approval tokens; remove the duplicate
comment/approval tag(s) so there's a single clear approval entry for this change
and no duplicated review lines, leaving the code as-is (ClientManager and
NewClientManager require no further changes).

In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 260-292: The CSR IP validation previously duplicated logic; update
validateServingIdentities to call hasValidIPAddress(certReq.IPAddresses...) by
replacing any inline IP checks with calls to hasValidIPAddress (function:
hasValidIPAddress) and keep the existing hostname check in
validateServingIdentities; ensure hasValidIPAddress continues to reject
unspecified, loopback, and multicast addresses and return errors from
validateServingIdentities on any invalid IPs so the change is fully integrated
and tests cover both DNS name (hostname) and IP validation.

In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 52-96: ValidateCSROwner currently implements the intended
behavior: call v.dpuExists(ctx, hostname) and if found allow bootstrap CSRs
(isBootstrapCSR true) regardless of node state, otherwise for serving CSRs call
v.nodeExists(ctx, hostname) and require the node to exist; keep this logic
as-is, ensure the error wraps from dpuExists/nodeExists remain
(fmt.Errorf("failed to check DPU existence: %w", err) and fmt.Errorf("failed to
check Node existence: %w", err)), preserve the ValidationResult messages and use
v.dpuNamespace in the "no matching DPU" reason string to maintain informative
diagnostics.

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 88-101: On TestConnection failure, ensure you invalidate the
cached hosted-cluster client by calling
a.clientManager.InvalidateClient(dpfhcp.Namespace, dpfhcp.Name) before setting
status/recording events so the next reconcile creates a fresh client; then call
a.setCondition(ctx, dpfhcp, metav1.ConditionFalse,
provisioningv1alpha1.ReasonHostedClusterNotReachable, fmt.Sprintf("Cannot
connect to hosted cluster: %v", err)) and a.recorder.Event(dpfhcp,
corev1.EventTypeWarning, "HostedClusterUnreachable", fmt.Sprintf("Cannot connect
to hosted cluster: %v", err)), and return ctrl.Result{}, nil (all within the
TestConnection error branch that currently logs via log.Error(err, "Hosted
cluster not reachable")).
- Around line 293-316: setCondition now correctly uses meta.SetStatusCondition
to avoid duplicate conditions; keep this implementation and the event-on-change
logic (in setCondition) as-is: only emit events via a.recorder.Event when
meta.SetStatusCondition(&dpfhcp.Status.Conditions, condition) returns true, and
only call a.mgmtClient.Status().Update(ctx, dpfhcp) in that branch; verify
references to setCondition, meta.SetStatusCondition, a.recorder.Event, and
a.mgmtClient.Status().Update are preserved and not reverted.

---

Nitpick comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 56-79: GetHostedClusterClient has a TOCTOU where two goroutines
can both create a client and one overwrites the other; after creating the new
client via createHostedClusterClient, acquire cm.mu.Lock() and re-check
cm.hcClients[key] — if an entry now exists, release the lock and return the
existing client, otherwise store the newly created client and return it; if you
need to avoid leaking the discarded client's connections, call a small cleanup
helper (e.g., closeClient(newClient)) when you detect an existing entry.
- Around line 172-205: The function replaceServerWithInternalEndpoint currently
declares unused parameters ctx and mgmtClient; remove these parameters from the
signature so it becomes replaceServerWithInternalEndpoint(kubeconfig
*clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error,
update the function body unchanged, and then update every call site that invokes
replaceServerWithInternalEndpoint to stop passing a context and client (pass
only the kubeconfig, hostedClusterNamespace and hostedClusterName). Also adjust
any unit tests or helper wrappers that reference this function signature to
match the new parameter list.

In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 108-166: Unify the group-checking pattern by changing
validateNodeIdentity to use the same map[string]bool approach as
validateBootstrapperIdentity: keep the username check (expectedUsername :=
SystemNodePrefix + hostname) but replace the hasNodesGroup/hasAuthenticatedGroup
booleans with a requiredGroups map initialized with groupNodes and
groupAuthenticated set to false, iterate csr.Spec.Groups to mark found entries,
then loop over requiredGroups to return an error if any remain false; reference
validateNodeIdentity, validateBootstrapperIdentity, groupNodes and
groupAuthenticated when making the change.

In `@internal/controller/csrapproval/csr_owner_validation.go`:
- Around line 98-128: Replace the inefficient list+iterate logic in dpuExists
and nodeExists with direct GETs: use v.mgmtClient.Get(ctx,
types.NamespacedName{Name: hostname, Namespace: v.dpuNamespace},
&dpuprovisioningv1alpha1.DPU{}) in dpuExists and
v.hostedClient.CoreV1().Nodes().Get(ctx, hostname, metav1.GetOptions{}) in
nodeExists, handle not-found using apierrors.IsNotFound to return (false, nil)
and return other errors unchanged; add the apierrors import
(k8s.io/apimachinery/pkg/api/errors) and reference the existing v.mgmtClient,
v.dpuNamespace, and v.hostedClient symbols when making these changes.

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 148-220: processPendingCSRs currently logs per-CSR errors from
processCSR but always returns nil, masking systemic failures; modify
processPendingCSRs to track failures (e.g., failureCount or collect error
messages) while iterating over bootstrapCSRs and servingCSRs and after both
loops return a non-nil aggregated error (or fmt.Errorf with the failure count
and brief details) when any CSR processing failed; ensure you update the return
to include context (reference processPendingCSRs and processCSR) and keep
per-CSR log calls but also surface a summary error to the caller so the caller
can detect and report when auto-approval failed for all/part of the CSRs.
- Around line 248-262: NewValidator is being created inside processCSR for every
CSR which is wasteful because mgmtClient, hcClient and
dpfhcp.Spec.DPUClusterRef.Namespace are constant per poll; instead instantiate
validator once in processPendingCSRs (using mgmtClient, hcClient and the DPU
namespace), then change processCSR to accept a Validator parameter and remove
the NewValidator call inside processCSR; update all calls to processCSR to pass
the prebuilt validator and ensure processCSR uses the injected Validator
(ValidateCSROwner) rather than creating its own.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (8)
internal/controller/csrapproval/csr_content_validation_test.go (2)

180-327: Add a test case for ValidateServingCSR invalid organization (step 4 is untested).

ValidateServingCSR calls validateOrganization(certReq) (step 4), but all five ValidateServingCSR test cases use a correct "system:nodes" organization. An invalid organization would slip through undetected if the validation were regressed.

✅ Proposed additional test case
+		Context("When CSR has invalid organization", func() {
+			It("should fail validation", func() {
+				hostname := "dpu-worker-01"
+				// DNS helper still uses the right hostname, but org is wrong
+				csrBytes := createServingCSRWithDNS("system:node:"+hostname, "wrong-org", hostname)
+
+				csr := &certv1.CertificateSigningRequest{
+					ObjectMeta: metav1.ObjectMeta{
+						Name: "test-serving-invalid-org",
+					},
+					Spec: certv1.CertificateSigningRequestSpec{
+						SignerName: SignerServing,
+						Username:   "system:node:" + hostname,
+						Groups: []string{
+							groupNodes,
+							groupAuthenticated,
+						},
+						Usages: []certv1.KeyUsage{
+							certv1.UsageDigitalSignature,
+							certv1.UsageServerAuth,
+						},
+						Request: csrBytes,
+					},
+				}
+
+				err := ValidateServingCSR(csr, hostname)
+				Expect(err).To(HaveOccurred())
+				Expect(err.Error()).To(ContainSubstring("invalid organization"))
+			})
+		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_content_validation_test.go` around lines
180 - 327, The tests for ValidateServingCSR are missing a case that asserts
validateOrganization(certReq) fails when the CSR organization is not
"system:nodes"; add a new spec similar to the other Context blocks that builds a
CSR with createServingCSRWithDNS using a non-matching organization (e.g.
"wrong-org"), sets Spec.Username to "system:node:"+hostname and Groups to
include/not include as needed, calls ValidateServingCSR(csr, hostname) and
expects an error containing a message about invalid organization (or the exact
substring produced by validateOrganization); use a descriptive test name like
"When CSR has invalid organization" to locate it alongside the other Contexts
and reference ValidateServingCSR and validateOrganization in the test for
clarity.

27-177: Add a test case for ValidateBootstrapCSR CN/hostname mismatch (step 5 is untested).

The implementation's step 5 — validateCN(certReq, hostname) — has no corresponding failure test. Every other validation step (identity, usages, org) has a negative test, but a CSR whose CN does not match the expected hostname will silently pass in this test suite if the CN validation were ever broken.

✅ Proposed additional test case
+		Context("When CSR has CN that does not match the hostname", func() {
+			It("should fail validation", func() {
+				hostname := "dpu-worker-01"
+				// Create CSR bytes with a different CN
+				csrBytes := createTestCSRWithOrganization("system:node:other-node", "system:nodes")
+
+				csr := &certv1.CertificateSigningRequest{
+					ObjectMeta: metav1.ObjectMeta{
+						Name: "test-cn-mismatch",
+					},
+					Spec: certv1.CertificateSigningRequestSpec{
+						SignerName: SignerBootstrap,
+						Username:   BootstrapperUsername,
+						Groups: []string{
+							groupServiceAccounts,
+							groupServiceAccountsMachineConfig,
+							groupAuthenticated,
+						},
+						Usages: []certv1.KeyUsage{
+							certv1.UsageDigitalSignature,
+							certv1.UsageClientAuth,
+						},
+						Request: csrBytes,
+					},
+				}
+
+				err := ValidateBootstrapCSR(csr, hostname)
+				Expect(err).To(HaveOccurred())
+				Expect(err.Error()).To(ContainSubstring("CN validation failed"))
+			})
+		})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csr_content_validation_test.go` around lines
27 - 177, Add a negative test that covers step 5 (CN/hostname validation) by
creating a CSR whose organization and other fields are valid but whose common
name (CN) does not match the provided hostname; call ValidateBootstrapCSR(csr,
hostname) and assert it returns an error and that err.Error() contains a
substring indicating the CN/hostname mismatch (e.g., "CN" or "common name");
place this new test as a Context similar to the other negative cases and reuse
helpers like createTestCSRWithOrganization and constants such as SignerBootstrap
and BootstrapperUsername so it exercises validateCN(certReq, hostname).
internal/controller/csrapproval/hostname_test.go (1)

296-390: Consider extracting common CSR generation logic to reduce duplication.

The three helper functions (createTestCSR, createTestCSRWithOrganization, createServingCSRWithDNS) share nearly identical key generation, CSR creation, and PEM encoding logic. A single parameterized helper would reduce the ~90 lines of duplication.

♻️ Example consolidation
+// createTestCSRRaw creates a test CSR with the given template
+func createTestCSRRaw(template x509.CertificateRequest) []byte {
+	privateKey, err := rsa.GenerateKey(rand.Reader, 2048)
+	if err != nil {
+		panic(err)
+	}
+	csrBytes, err := x509.CreateCertificateRequest(rand.Reader, &template, privateKey)
+	if err != nil {
+		panic(err)
+	}
+	return pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE REQUEST", Bytes: csrBytes})
+}
+
 func createTestCSR(cn string) []byte {
-	privateKey, err := rsa.GenerateKey(rand.Reader, 2048)
-	if err != nil {
-		panic(err)
-	}
-	template := x509.CertificateRequest{
+	return createTestCSRRaw(x509.CertificateRequest{
 		Subject: pkix.Name{CommonName: cn},
-	}
-	csrBytes, err := x509.CreateCertificateRequest(rand.Reader, &template, privateKey)
-	if err != nil {
-		panic(err)
-	}
-	pemBlock := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE REQUEST", Bytes: csrBytes})
-	return pemBlock
+	})
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/hostname_test.go` around lines 296 - 390, The
three helpers createTestCSR, createTestCSRWithOrganization, and
createServingCSRWithDNS duplicate key generation, x509.CreateCertificateRequest
and PEM encoding; consolidate them by adding a single helper (e.g., new
createCSR helper) that takes parameters for commonName (cn string),
organizations ([]string) and dnsNames ([]string) and performs rsa.GenerateKey,
x509.CreateCertificateRequest and pem.EncodeToMemory once, returning []byte;
then change createTestCSR, createTestCSRWithOrganization and
createServingCSRWithDNS to call createCSR with the appropriate args (nil/empty
slices where not needed) so the duplicated logic is removed while keeping the
same panic-on-error behavior for tests.
internal/controller/csrapproval/client.go (2)

56-79: Minor TOCTOU in double-checked locking — two goroutines can create duplicate clients for the same key.

Between RUnlock() at Line 65 and Lock() at Line 74, a second goroutine can also miss the cache and create a redundant client. The extra client is silently overwritten and GC'd, so this isn't a correctness issue, but it wastes a kubeconfig parse + TCP connection setup.

A simple fix is to re-check the map after acquiring the write lock:

♻️ Proposed fix
 	// Create new client (outside lock to avoid holding lock during slow operation)
 	clientset, err := cm.createHostedClusterClient(ctx, namespace, name)
 	if err != nil {
 		return nil, err
 	}

 	// Cache the client with write lock
 	cm.mu.Lock()
+	// Re-check: another goroutine may have populated the cache while we were creating
+	if existing, ok := cm.hcClients[key]; ok {
+		cm.mu.Unlock()
+		return existing, nil
+	}
 	cm.hcClients[key] = clientset
 	cm.mu.Unlock()

 	return clientset, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/client.go` around lines 56 - 79,
GetHostedClusterClient uses double-checked locking and can create duplicate
clients: after creating a client with createHostedClusterClient and before
storing it into cm.hcClients, re-acquire the write lock (cm.mu.Lock) and
re-check cm.hcClients[key]; if an entry now exists, return that existing client
and dispose/close the newly created one (if it supports closing) instead of
unconditionally overwriting; otherwise store the newly created client and return
it. Ensure you use the same key computed in GetHostedClusterClient and the same
lock fields cm.mu and map cm.hcClients in the fix.

172-205: Remove unused ctx and mgmtClient parameters from replaceServerWithInternalEndpoint.

The function performs pure in-memory kubeconfig transformation and does not reference either parameter. Removing them simplifies the signature and clarifies that no I/O is performed.

♻️ Proposed fix
-func replaceServerWithInternalEndpoint(ctx context.Context, mgmtClient client.Client, kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error {
+func replaceServerWithInternalEndpoint(kubeconfig *clientcmdapi.Config, hostedClusterNamespace, hostedClusterName string) error {

Update the call site at line 107:

-	if err := replaceServerWithInternalEndpoint(ctx, cm.mgmtClient, kubeconfig, namespace, name); err != nil {
+	if err := replaceServerWithInternalEndpoint(kubeconfig, namespace, name); err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/client.go` around lines 172 - 205, The
function replaceServerWithInternalEndpoint has unused parameters ctx and
mgmtClient; remove those parameters from its signature and definition so it only
accepts (kubeconfig *clientcmdapi.Config, hostedClusterNamespace,
hostedClusterName string) (or reorder to match current usage), update all call
sites to stop passing a context or management client, and run tests/compile to
ensure no remaining references to replaceServerWithInternalEndpoint(ctx,
mgmtClient, ...) remain.
internal/controller/csrapproval/controller.go (1)

101-107: Both controllers watch the same resource type — verify no event contention.

Both dpfhcpprovisioner and csrapproval controllers use For(&provisioningv1alpha1.DPFHCPProvisioner{}). This is supported by controller-runtime (each controller gets its own work queue), but every status update from either controller will trigger reconciliation in both controllers. Since the CSR approval controller updates the CSRAutoApprovalActive condition on the same CR, this creates a feedback loop where each CSR approval status update re-triggers both controllers.

This is functionally safe (the provisioner controller will no-op if nothing changed, and the CSR controller will just re-poll), but it doubles the reconciliation rate. Worth being aware of for observability/log noise at scale.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/controller.go` around lines 101 - 107, The
two controllers both watch DPFHCPProvisioner and status-only updates cause both
to reconcile; update CSRApprovalReconciler's SetupWithManager to add an event
filter so status-only changes (like CSRAutoApprovalActive flips) don't trigger
reconciliation for the other controller — for example replace the current
ctrl.NewControllerManagedBy(mgr).For(...) chain in SetupWithManager with one
that calls WithEventFilter(predicate.GenerationChangedPredicate{}) or a custom
predicate (predicate.Funcs) that ignores updates where only the
status/conditions changed, or that only requeues when the CSRAutoApprovalActive
condition itself changes; modify the SetupWithManager function in
CSRApprovalReconciler accordingly.
internal/controller/csrapproval/csrapproval.go (2)

107-113: Consider a distinct Reason for the transient-error condition state.

Setting ConditionTrue with ReasonCSRApprovalActive when processPendingCSRs returns an error conflates a degraded state with a healthy one. Alerting/monitoring rules keyed on CSRAutoApprovalActive=True will silently miss this degraded path. A separate reason (e.g., ReasonCSRApprovalDegraded) would let operators distinguish "active and healthy" from "active but encountering errors."

♻️ Proposed change
-		if condErr := a.setCondition(ctx, dpfhcp, metav1.ConditionTrue, provisioningv1alpha1.ReasonCSRApprovalActive,
-			fmt.Sprintf("CSR processing encountered transient error: %v", err)); condErr != nil {
+		if condErr := a.setCondition(ctx, dpfhcp, metav1.ConditionTrue, provisioningv1alpha1.ReasonCSRApprovalDegraded,
+			fmt.Sprintf("CSR processing encountered transient error: %v", err)); condErr != nil {

(Add ReasonCSRApprovalDegraded to the API constants.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csrapproval.go` around lines 107 - 113, The
current code marks the condition as ConditionTrue with ReasonCSRApprovalActive
even when processPendingCSRs returns an error; add a new reason constant (e.g.,
ReasonCSRApprovalDegraded) to the API constants and use it in the error branch
so setCondition(ctx, dpfhcp, metav1.ConditionTrue, ReasonCSRApprovalDegraded,
fmt.Sprintf(...)) is called when err != nil (leave the existing
ReasonCSRApprovalActive for the successful/healthy path); update any
references/tests to the reason name accordingly so operators can distinguish
active+healthy vs active+degraded.

153-156: Consider filtering pending CSRs server-side to reduce per-poll data transfer.

On every 30-second poll all CSRs are fetched and filtered in-memory. For hosted clusters with many historical (approved/denied) CSRs, this is wasteful. A field selector or label selector at the API level would reduce the payload.

♻️ Example — field selector for unapproved CSRs
-	csrList, err := hcClient.CertificatesV1().CertificateSigningRequests().List(ctx, metav1.ListOptions{})
+	// Filter server-side to CSRs that have not yet been approved/denied
+	csrList, err := hcClient.CertificatesV1().CertificateSigningRequests().List(ctx, metav1.ListOptions{
+		// Note: Kubernetes does not support status-based field selectors on CSRs natively;
+		// use a label convention on bootstrap/serving CSRs if available, or keep in-memory filter
+		// but at minimum set a ResourceVersion to avoid unnecessary full re-lists.
+		ResourceVersion: "0", // return from cache
+	})

(The most practical improvement is to add ResourceVersion: "0" so the apiserver can serve from its watch cache rather than etcd on every poll.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/csrapproval/csrapproval.go` around lines 153 - 156,
Currently all CSRs are fetched then filtered in-memory; change the List call to
request only pending CSRs and use the watch cache by passing a ListOptions with
a server-side selector and ResourceVersion "0". Update the
hcClient.CertificatesV1().CertificateSigningRequests().List(ctx,
metav1.ListOptions{}) invocation to pass metav1.ListOptions{FieldSelector:
"<selector-for-pending-csrs>", ResourceVersion: "0"} (or a label selector if you
add a label), so the API returns only unapproved/undecided CSRs and uses the
apiserver cache; keep the rest of the code that processes csrList unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/csrapproval/csr_content_validation.go`:
- Around line 222-231: The PEM block returned by pem.Decode(csr.Spec.Request)
isn't validated for type before attempting x509.ParseCertificateRequest, so add
a guard after pemBlock is non-nil that checks pemBlock.Type == "CERTIFICATE
REQUEST" (or the exact expected PEM type) and return a clear error if it does
not match; this prevents opaque ASN.1 parse errors when calling
x509.ParseCertificateRequest on the wrong PEM type.
- Around line 261-282: validateServingIdentities currently only ensures hostname
appears somewhere in certReq.DNSNames and allows extra SANs; change it to
require that certReq.DNSNames contains exactly one entry equal to the node
hostname (i.e., check len(certReq.DNSNames) == 1 and certReq.DNSNames[0] ==
hostname) and return an error otherwise; keep the existing IP address loop using
hasValidIPAddress(certReq.IPAddresses) unchanged and reuse the same error-return
pattern for invalid IPs.
- Around line 46-75: ValidateBootstrapCSR currently doesn't assert that parsed
certificate request SAN fields are empty; after parseCSRRequest(csr) add
explicit checks that certReq.DNSNames, certReq.IPAddresses,
certReq.EmailAddresses, and certReq.URIs are all empty and return a clear error
(e.g., "bootstrap CSR must not contain SANs: <which-field>") if any are
non-empty; place these checks before validateOrganization() and validateCN() so
malformed/malicious SANs are rejected early and include the specific certReq
field name in the error to aid debugging.
- Around line 77-106: ValidateServingCSR is missing CN validation like
ValidateBootstrapCSR does; after parsing the CSR with parseCSRRequest (certReq)
add a call to validateCN using the parsed certReq and the hostname, and return a
wrapped error (e.g., "CN validation failed: %w") if it fails, ensuring the CN
must be "system:node:<hostname>" to match ValidateBootstrapCSR's behavior.
- Around line 168-214: Both validateBootstrapUsages and validateServingUsages
currently only ensure required usages are present but do not reject
extra/disallowed usages; update each function to also iterate csr.Spec.Usages
and error if any usage is not in the allowed set (for validateBootstrapUsages
allowed set = {certv1.UsageDigitalSignature, certv1.UsageClientAuth}, for
validateServingUsages allowed set = {certv1.UsageDigitalSignature,
certv1.UsageServerAuth}). Return a clear fmt.Errorf when an unexpected usage is
found. Optionally factor the logic into a shared helper like validateUsages(csr
*certv1.CertificateSigningRequest, allowed map[certv1.KeyUsage]bool) used by
both functions to remove duplication.

In `@internal/controller/csrapproval/csr_owner_validation_test.go`:
- Around line 46-197: Tests create DPUs with mgmtClient.Create but call
mgmtClient.Delete at the end of the It blocks, which can leave orphaned
resources if an assertion fails; after every mgmtClient.Create call register a
Ginkgo DeferCleanup immediately (e.g. DeferCleanup(func(){
mgmtClient.Delete(ctx, dpu) })) so cleanup always runs, update each It that
creates dpu/dpu1/dpu2/dpu3 (the tests that call NewValidator and
validator.dpuExists) to defer cleanup right after the corresponding Create, and
remove the trailing inline Delete calls.

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 309-313: Move the a.recorder.Event(dpfhcp, eventType, reason,
message) call so it happens only after a successful persist: call
a.mgmtClient.Status().Update(ctx, dpfhcp) first, check that it returned nil, and
only then emit the event; if Status().Update returns an error, return that error
without recording the event (to avoid ghost events). Ensure you keep the same
dpfhcp, eventType, reason, message values and preserve existing error
propagation behavior.

---

Duplicate comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 207-214: TestConnection currently calls
clientset.Discovery().ServerVersion() which ignores the passed ctx; replace that
call with a context-aware REST request using the discovery REST client
(clientset.Discovery().RESTClient()). Perform a GET to the version endpoint
(e.g., AbsPath("/version") or the equivalent discovery path), execute the
request with Do(ctx) and handle the returned error (or unmarshal the response)
so the passed ctx deadline/cancellation is honored; update the TestConnection
function to use this RESTClient.Do(ctx) flow instead of ServerVersion().

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 293-317: The setCondition method in CSRApprover should use
meta.SetStatusCondition to atomically find/update/append the
CSRAutoApprovalActive condition on dpfhcp.Status.Conditions and only emit events
and call a.mgmtClient.Status().Update(ctx, dpfhcp) when that call returns true;
update setCondition to construct the metav1.Condition (Type
provisioningv1alpha1.CSRAutoApprovalActive, Status, Reason, Message,
ObservedGeneration dpfhcp.Generation), call changed :=
meta.SetStatusCondition(&dpfhcp.Status.Conditions, condition), and if changed
then call a.recorder.Event(dpfhcp, eventType, reason, message) (with eventType
corev1.EventTypeWarning for metav1.ConditionFalse else corev1.EventTypeNormal)
and persist via a.mgmtClient.Status().Update(ctx, dpfhcp); otherwise return nil.
- Around line 199-217: The current loops call a.processCSR(ctx, dpfhcp,
hcClient, &csr, "...") but ignore its new (bool, error) result so approved CSRs
are not being counted; declare a local approvedCount (int) before iterating
bootstrapCSRs and servingCSRs, capture the first return value (e.g., approved,
err := a.processCSR(...)), increment approvedCount when approved == true, and
after processing both sets log or expose approvedCount for observability;
reference processCSR, bootstrapCSRs, servingCSRs and the local approvedCount
variable when making the change.
- Around line 89-101: The current code correctly invalidates the cached
hosted-cluster client on TestConnection failure, but please ensure the
invalidation call occurs before setting the condition and returning; verify the
presence of a.clientManager.InvalidateClient(dpfhcp.Namespace, dpfhcp.Name) in
the TestConnection error branch and that the function then calls
a.setCondition(...) and returns ctrl.Result{}, nil (no requeue); if you find
these steps out of order or missing in csrapproval.go (the TestConnection error
handling block), move/add the InvalidateClient call so it runs before
setCondition and the early return.

---

Nitpick comments:
In `@internal/controller/csrapproval/client.go`:
- Around line 56-79: GetHostedClusterClient uses double-checked locking and can
create duplicate clients: after creating a client with createHostedClusterClient
and before storing it into cm.hcClients, re-acquire the write lock (cm.mu.Lock)
and re-check cm.hcClients[key]; if an entry now exists, return that existing
client and dispose/close the newly created one (if it supports closing) instead
of unconditionally overwriting; otherwise store the newly created client and
return it. Ensure you use the same key computed in GetHostedClusterClient and
the same lock fields cm.mu and map cm.hcClients in the fix.
- Around line 172-205: The function replaceServerWithInternalEndpoint has unused
parameters ctx and mgmtClient; remove those parameters from its signature and
definition so it only accepts (kubeconfig *clientcmdapi.Config,
hostedClusterNamespace, hostedClusterName string) (or reorder to match current
usage), update all call sites to stop passing a context or management client,
and run tests/compile to ensure no remaining references to
replaceServerWithInternalEndpoint(ctx, mgmtClient, ...) remain.

In `@internal/controller/csrapproval/controller.go`:
- Around line 101-107: The two controllers both watch DPFHCPProvisioner and
status-only updates cause both to reconcile; update CSRApprovalReconciler's
SetupWithManager to add an event filter so status-only changes (like
CSRAutoApprovalActive flips) don't trigger reconciliation for the other
controller — for example replace the current
ctrl.NewControllerManagedBy(mgr).For(...) chain in SetupWithManager with one
that calls WithEventFilter(predicate.GenerationChangedPredicate{}) or a custom
predicate (predicate.Funcs) that ignores updates where only the
status/conditions changed, or that only requeues when the CSRAutoApprovalActive
condition itself changes; modify the SetupWithManager function in
CSRApprovalReconciler accordingly.

In `@internal/controller/csrapproval/csr_content_validation_test.go`:
- Around line 180-327: The tests for ValidateServingCSR are missing a case that
asserts validateOrganization(certReq) fails when the CSR organization is not
"system:nodes"; add a new spec similar to the other Context blocks that builds a
CSR with createServingCSRWithDNS using a non-matching organization (e.g.
"wrong-org"), sets Spec.Username to "system:node:"+hostname and Groups to
include/not include as needed, calls ValidateServingCSR(csr, hostname) and
expects an error containing a message about invalid organization (or the exact
substring produced by validateOrganization); use a descriptive test name like
"When CSR has invalid organization" to locate it alongside the other Contexts
and reference ValidateServingCSR and validateOrganization in the test for
clarity.
- Around line 27-177: Add a negative test that covers step 5 (CN/hostname
validation) by creating a CSR whose organization and other fields are valid but
whose common name (CN) does not match the provided hostname; call
ValidateBootstrapCSR(csr, hostname) and assert it returns an error and that
err.Error() contains a substring indicating the CN/hostname mismatch (e.g., "CN"
or "common name"); place this new test as a Context similar to the other
negative cases and reuse helpers like createTestCSRWithOrganization and
constants such as SignerBootstrap and BootstrapperUsername so it exercises
validateCN(certReq, hostname).

In `@internal/controller/csrapproval/csrapproval.go`:
- Around line 107-113: The current code marks the condition as ConditionTrue
with ReasonCSRApprovalActive even when processPendingCSRs returns an error; add
a new reason constant (e.g., ReasonCSRApprovalDegraded) to the API constants and
use it in the error branch so setCondition(ctx, dpfhcp, metav1.ConditionTrue,
ReasonCSRApprovalDegraded, fmt.Sprintf(...)) is called when err != nil (leave
the existing ReasonCSRApprovalActive for the successful/healthy path); update
any references/tests to the reason name accordingly so operators can distinguish
active+healthy vs active+degraded.
- Around line 153-156: Currently all CSRs are fetched then filtered in-memory;
change the List call to request only pending CSRs and use the watch cache by
passing a ListOptions with a server-side selector and ResourceVersion "0".
Update the hcClient.CertificatesV1().CertificateSigningRequests().List(ctx,
metav1.ListOptions{}) invocation to pass metav1.ListOptions{FieldSelector:
"<selector-for-pending-csrs>", ResourceVersion: "0"} (or a label selector if you
add a label), so the API returns only unapproved/undecided CSRs and uses the
apiserver cache; keep the rest of the code that processes csrList unchanged.

In `@internal/controller/csrapproval/hostname_test.go`:
- Around line 296-390: The three helpers createTestCSR,
createTestCSRWithOrganization, and createServingCSRWithDNS duplicate key
generation, x509.CreateCertificateRequest and PEM encoding; consolidate them by
adding a single helper (e.g., new createCSR helper) that takes parameters for
commonName (cn string), organizations ([]string) and dnsNames ([]string) and
performs rsa.GenerateKey, x509.CreateCertificateRequest and pem.EncodeToMemory
once, returning []byte; then change createTestCSR, createTestCSRWithOrganization
and createServingCSRWithDNS to call createCSR with the appropriate args
(nil/empty slices where not needed) so the duplicated logic is removed while
keeping the same panic-on-error behavior for tests.

Comment thread internal/controller/csrapproval/csr_content_validation.go
Comment thread internal/controller/csrapproval/csr_content_validation.go
Comment thread internal/controller/csrapproval/csr_content_validation.go Outdated
Comment thread internal/controller/csrapproval/csr_content_validation.go
Comment thread internal/controller/csrapproval/csr_content_validation.go
Comment thread internal/controller/csrapproval/csr_owner_validation_test.go
Comment thread internal/controller/csrapproval/csrapproval.go Outdated
key := namespace + "/" + name
cm.mu.Lock()
delete(cm.hcClients, key)
cm.mu.Unlock()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a guard condition here in case the key is not in the map?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jason, delete is safe. ref - https://go.dev/ref/spec#Deletion_of_map_elements

@omertuc
Copy link
Copy Markdown
Member

omertuc commented Feb 19, 2026

What prevents an adversary from getting their CSRs approved?

@linoyaslan
Copy link
Copy Markdown
Collaborator Author

What prevents an adversary from getting their CSRs approved?

Hi @omertuc , I took inspiration from the CSR approval logic in assisted-service. You can review the validations I added in this PR, both the content validation for each CSR type and the owner validation.
In our case, we validate the DPU object against the worker attempting to join the cluster, whereas in assisted-service they validate the Agent hostname against the joining worker. It's a very similar approach, with adjustments to fit our use case. reference - https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/csr_utils.go

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Feb 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jkary, linoyaslan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- Dedicated controller: CSRApprovalReconciler polls hosted clusters every 30s
- CSR validation: Hostname matching, DPU ownership verification, certificate content checks
- Client caching: ClientManager caches kubeconfig clients per hosted cluster
- Independent operation: Runs alongside main controller
@omertuc
Copy link
Copy Markdown
Member

omertuc commented Feb 24, 2026

I took inspiration from the CSR approval logic in assisted-service.

It's a very similar approach, with adjustments to fit our use case. reference - https://github.com/openshift/assisted-service/blob/master/internal/controller/controllers/csr_utils.go

There's a lot that seems wrong with how various OCP / OCP related systems approve CSRs in my opinion, but we don't know their threat models and what kind of considerations / limitations they had when they chose their solutions. Regardless, even if those solutions were secure and reasonable, I don't think we can simply copy them verbatim, as maybe doing what they did in their circumstances made sense but doesn't necessarily make sense in ours. Maybe we're less limited than them and there's more we can do to ensure security.

For example, in the example you linked they're limiting their CSR approval process to nodes that are in known states. Since these states are very time-limited, maybe the attack window is not big enough to be a concern. However, in this PR, if I understand correctly, we're indefinitely approving any CSR.

You can review the validations I added in this PR, both the content validation for each CSR type and the owner validation.

In our case, we validate the DPU object against the worker attempting to join the cluster, whereas in assisted-service they validate the Agent hostname against the joining worker.

Anyone can write anything in these fields. This will protect against random nodes trying to join by accident, but not against someone maliciously (yet trivially) crafting CSRs they want to get approved

@linoyaslan
Copy link
Copy Markdown
Collaborator Author

linoyaslan commented Feb 24, 2026

Thanks @omertuc! so iiuc, you think we need the following here:

  1. Time-limited approval - do you mean to narrow the attack window? for example, approve CSRs on specific phase of the DPU? do you have any another idea for the appropriate time window?
  2. If the DPU hostname isn't sufficient for validation, what additional checks do you think we should add? I'm not sure we have anything else available to truly authenticate the DPU. @tsorya, I'd appreciate your thoughts on this as well.

@omertuc
Copy link
Copy Markdown
Member

omertuc commented Feb 24, 2026

Time-limited approval - do you mean to narrow the attack window? for example, approve CSRs on specific phase of the DPU? do you have any another idea for the appropriate time window?

I didn't suggest that, I just mentioned that as a an example of a difference between the system we copied from and our system, a difference which is critical when determining whether this is a good solution. Limiting the controller to a time window will make things better but still my question remains - what stops an adversary from getting their CSRs approved by this controller

If the DPU hostname isn't sufficient for validation, what additional checks do you think we should add?

IDK, we need to define our goals / our threat model and what we want to protect from. We can even say something like "we don't care about CSR security because nobody except devices with physical access to the machine (like the DPU itself) can even contact the hypershift control plane" (I don't know if that's true or not - if it's not, we should probably do that as well. If it is then we don't really necessarily have to worry about CSR security because the entire hosted cluster is fully internal and inaccessible from outside)

I'm not sure we have anything else available to truly authenticate the DPU.

We've experimented [1] before with injecting the private key used by kubelet to generate its CSR in the node ignition, then comparing the CSR public key against its matching private key before approving the CSR. We could theoretically do something similar here

[1] openshift/assisted-installer#895 - ignore the PR title and description, they're outdated - but the diff shows what I'm referring to. It never made it to assisted though

@tsorya
Copy link
Copy Markdown
Member

tsorya commented Feb 24, 2026

@omertuc I totally understand your concerns here but we can start with this option. It will for sure minimize attack surface.
@linoyaslan can we add a check for dpu object status too? So that we will improve csrs only from dpus in the status that waits for dpu to be added to the cluster, something like dpu_cluster_config?
And lets think how we can improve it in the future? Lets open a ticket?
@omertuc will work for now?

@linoyaslan
Copy link
Copy Markdown
Collaborator Author

linoyaslan commented Feb 24, 2026

@tsorya Yes, let me add the DPU status check in this PR + testing it + I'll open a separate ticket for future improvements
@omertuc I'd be happy to schedule a call to discuss the future improvements based on your comment above as I have a few questions

@linoyaslan
Copy link
Copy Markdown
Collaborator Author

linoyaslan commented Feb 24, 2026

@tsorya as discussed, a dedicated tickets opened for phase validation (NVIDIA-575) and for the mechanism @omertuc raised (NVIDIA-576)

@tsorya tsorya merged commit 534b8e5 into rh-ecosystem-edge:main Feb 24, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants