Skip to content

Conversation

@jtschelling
Copy link
Contributor

@jtschelling jtschelling commented Dec 5, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Added janitor-provider gRPC service, Helm chart, and Kubernetes manifests; integrated into build and dev tooling and image pipeline.
  • Configuration

    • New global feature flag to enable janitorProvider and CSP provider host wiring.
    • Main janitor deployment removed provider-specific CSP config and workload-identity annotations; janitor now delegates CSP ops to the provider.
  • Tests

    • Introduced a shared mock CSP gRPC test helper for unit tests.
  • Refactor

    • Standardized provider logging to structured slog.
  • Chores

    • Updated ignore entries for tooling and editor files.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

Warning

Rate limit exceeded

@jtschelling has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 11 seconds before requesting another review.

βŒ› How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 41135e2 and b06d7c1.

πŸ“’ Files selected for processing (4)
  • .gitignore
  • janitor-provider/pkg/csp/azure/azure.go
  • janitor/pkg/controller/rebootnode_controller.go
  • janitor/pkg/controller/test_utils.go
πŸ“ Walkthrough

Walkthrough

Adds a janitor-provider gRPC service and proto API, converts janitor controllers to call that service over gRPC, introduces a janitor-provider Helm chart and deployment, adds a bufconn-based mock CSP gRPC test harness, and updates build/tilt/packaging to include the new provider and cspProviderHost wiring.

Changes

Cohort / File(s) Summary
Build & Tooling
\.gitignore, .ko.yaml, api/Makefile, scripts/build-image-list.sh, scripts/buildko.sh, scripts/check-image-attestations.sh
Add ignores; include janitor-provider in ko builds, image lists, go.work and attestations; extend proto generation to api/proto/csp/v1alpha1; update build scripts.
CSP Proto API
api/proto/csp/v1alpha1/provider.proto
New proto defining CSPProviderService RPCs: SendRebootSignal, IsNodeReady, SendTerminateSignal and corresponding request/response messages; sets go_package.
Janitor-Provider Module & Server
janitor-provider/go.mod, janitor-provider/main.go, janitor-provider/Tiltfile
New Go module, gRPC server implementation (metrics + CSP handlers), and Tilt build integration.
CSP Backend Adapters
janitor-provider/pkg/csp/*
Ported CSP adapters into janitor-provider module; switched logging to log/slog; updated imports to janitor-provider paths; changed IsNodeReady param name from message β†’ requestID.
Helm Chart: janitor-provider
distros/kubernetes/nvsentinel/charts/janitor-provider/*
New Helm chart: Chart.yaml, _helpers.tpl, RBAC, ClusterRoleBinding, ServiceAccount, Deployment, Service, NetworkPolicy, and values.yaml including CSP/provider configuration and Nebius handling.
Parent Helm & Values
distros/kubernetes/nvsentinel/Chart.yaml, .../charts/janitor/templates/configmap.yaml, distros/kubernetes/nvsentinel/values.yaml, distros/kubernetes/nvsentinel/values-tilt.yaml
Add janitor-provider dependency and global.janitorProvider.enabled; wire cspProviderHost into controllers' config and tilt values.
Janitor Controllers
janitor/pkg/controller/rebootnode_controller.go, janitor/pkg/controller/terminatenode_controller.go
Replace local CSP interface with generated cspv1alpha1.CSPProviderServiceClient; create/manage gRPC connection in SetupWithManager; call typed RPCs from reconcile flows; add grpcConn fields and cleanup runnables.
Configuration
janitor/pkg/config/config.go
Add CSPProviderHost to GlobalConfig, RebootNodeControllerConfig, and TerminateNodeControllerConfig; propagate global into per-controller configs when unset.
Testing & Mocks
janitor/pkg/controller/test_utils.go, janitor/pkg/controller/suite_test.go, janitor/pkg/controller/*_test.go
Add bufconn-based MockCSPServer and MockCSPTestHelper with configurable behaviors and shared lifecycle; tests refactored to use shared mock gRPC server and updated assertions.
Tilt / Dev Workflow
tilt/Tiltfile, janitor-provider/Tiltfile
Include janitor-provider Tiltfile and register k8s_resource for janitor-provider.
Janitor Chart Simplification
distros/kubernetes/nvsentinel/charts/janitor/templates/deployment.yaml, .../serviceaccount.yaml, .../values.yaml
Remove in-chart CSP-specific envs, workload identity annotations, Nebius mounts/logic, and per-provider values so janitor relies on external janitor-provider.
Misc Chart Additions
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yaml, .../clusterrole*.yaml, .../service.yaml
New network policy, RBAC, and service templates for the janitor-provider chart.

Sequence Diagram(s)

sequenceDiagram
    participant Controller as Janitor Controller
    participant GRPCClient as gRPC Client
    participant Provider as janitor-provider (gRPC Server)
    participant Backend as CSP Backend Adapter

    Controller->>GRPCClient: IsNodeReady(node_name)
    GRPCClient->>Provider: IsNodeReadyRequest(node_name, request_id)
    Provider->>Backend: Check node readiness
    Backend-->>Provider: IsNodeReadyResponse(is_ready)
    Provider-->>GRPCClient: IsNodeReadyResponse(is_ready)
    GRPCClient-->>Controller: is_ready

    alt Node ready
        Controller->>GRPCClient: SendRebootSignal(node_name)
        GRPCClient->>Provider: SendRebootSignalRequest(node_name)
        Provider->>Backend: Trigger reboot
        Backend-->>Provider: SendRebootSignalResponse(request_id)
        Provider-->>GRPCClient: SendRebootSignalResponse(request_id)
        GRPCClient-->>Controller: request_id
    else Node not ready
        Controller-->>Controller: Requeue / retry
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

πŸ‡ I hopped a proto down the lane, a gRPC tune I hum,

Providers wake and charts unfurl when callers come.
Mock bufconn cushions every hop, metrics softly sing,
Controllers ring the janitor bell, and clouds begin to spring.
A carrot build, a tidy patch β€” hooray for springy spring!

πŸš₯ Pre-merge checks | βœ… 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'feat: grpc remediation service' is partially related to the changeset, referring to gRPC infrastructure additions but omits the major janitor-provider service extraction and CSP client refactoring that constitute the primary changes. Consider a more specific title such as 'feat: extract janitor-provider grpc service for CSP remediation' to better reflect the main architectural change of extracting CSP provider logic into a separate service.
βœ… Passed checks (1 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jtschelling jtschelling force-pushed the jtschelling/353-grpc-plugin branch from ff136b5 to 9e256c9 Compare December 19, 2025 18:40
@jtschelling jtschelling force-pushed the jtschelling/353-grpc-plugin branch from 0c561e6 to 034aba9 Compare December 29, 2025 18:40
@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/janitor/pkg/config 31.87% (+10.51%) πŸŽ‰
github.com/nvidia/nvsentinel/janitor/pkg/controller 23.51% (+1.76%) πŸ‘

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/janitor/pkg/config/config.go 31.87% (+10.51%) 91 (-12) 29 (+7) 62 (-19) πŸŽ‰
github.com/nvidia/nvsentinel/janitor/pkg/controller/rebootnode_controller.go 22.53% (+0.98%) 679 (-147) 153 (-25) 526 (-122) πŸ‘
github.com/nvidia/nvsentinel/janitor/pkg/controller/terminatenode_controller.go 24.02% (+2.08%) 637 (-138) 153 (-17) 484 (-121) πŸ‘
github.com/nvidia/nvsentinel/janitor/pkg/controller/test_utils.go 23.38% (+4.63%) 308 (+276) 72 (+66) 236 (+210) πŸ‘

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@neerajnv
Copy link

Could you please add some details on testing including which CSP the code has been tested on. Could you also add details on the metrics you have added to track the operational aspects of the code and alerts you would recommend.

@jtschelling jtschelling force-pushed the jtschelling/353-grpc-plugin branch from 3ec0666 to 0bb7857 Compare January 5, 2026 19:30
@jtschelling jtschelling marked this pull request as ready for review January 5, 2026 22:45
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Fix all issues with AI Agents πŸ€–
In @distros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yaml:
- Line 50: The ConfigMap currently renders the top-level key cspProviderHost
unconditionally (cspProviderHost: {{ .Values.config.cspProviderHost }}) which
can produce an empty value; change it to only emit the key/value when
.Values.config.cspProviderHost is set (mirror the conditional pattern used for
the controller-specific fields) so the ConfigMap omits cspProviderHost entirely
if the value is nil/empty. Locate the cspProviderHost entry in the template and
wrap it with the same conditional guard used elsewhere (checking
.Values.config.cspProviderHost) to prevent empty entries from being written.

In @janitor-provider/go.mod:
- Line 67: The project uses github.com/golang-jwt/jwt/v4 at v4.5.1 which
contains a HIGH severity vulnerability; update the module requirement to v4.5.2
(replace the version string for github.com/golang-jwt/jwt/v4 in go.mod), then
run the Go module commands to fetch and lock the new version (e.g., go get
github.com/golang-jwt/jwt/[email protected] and go mod tidy) so the dependency is
upgraded and the CVE is mitigated.

In @janitor-provider/main.go:
- Around line 50-93: Extract the repeated node lookup into a helper on
janitorProviderServer, e.g. add a method getNode(ctx context.Context, nodeName
string) (*corev1.Node, error) that calls s.k8sClient.CoreV1().Nodes().Get and
returns the node or wrapped status error; then update SendRebootSignal,
IsNodeReady and SendTerminateSignal to call getNode(ctx, req.NodeName) (and pass
*node when the cspClient methods expect a value), removing the duplicated Get
logic and error handling in each handler.

In @janitor-provider/pkg/csp/azure/azure.go:
- Line 152: Replace the fmt.Sprintf call inside the slog.Info invocation with
structured logging fields: call slog.Info with a fixed message string (e.g.,
"node healthy") and pass the node name as a structured field using slog.String
(or slog.Any) referencing node.Name; update the single occurrence of
slog.Info(fmt.Sprintf(...)) to use the structured form so the node name is a
searchable field.

In @janitor/pkg/controller/rebootnode_controller.go:
- Line 335: Delete the debug stdout print call fmt.Println("controller context",
ctx) from rebootnode_controller.go; locate it in the Reconcile handler (or the
method containing the controller context) and remove the line, and if context
information is needed for debugging replace it with the controller's structured
logger (e.g., use r.Log.Info with relevant keys) rather than printing to stdout.

In @janitor/pkg/controller/terminatenode_controller.go:
- Around line 412-418: The gRPC client connection created in SetupWithManager is
never closed; add a field Conn *grpc.ClientConn to TerminateNodeReconciler,
assign the conn returned by grpc.NewClient there instead of discarding it, and
register the reconciler with the manager as a runnable that closes conn when the
manager context is canceled (implement Start(ctx context.Context) error that
<-ctx.Done() then conn.Close()). Apply the same change to RebootNodeReconciler
(add Conn field, store the connection, and close it on manager shutdown) so the
connection lifecycle is tied to the manager and no leak occurs.

In @janitor/pkg/controller/test_utils.go:
- Around line 221-225: The goroutine that calls server.Serve(lis) currently logs
only when errors.Is(err, grpc.ErrServerStopped) is true (graceful shutdown)
which is inverted; change the condition to log unexpected errors by checking
that err is non-nil and NOT errors.Is(err, grpc.ErrServerStopped) (e.g., if err
!= nil && !errors.Is(err, grpc.ErrServerStopped) then call log.Printf) so
server.Serve failures are reported while normal shutdown remains silent; update
the if around server.Serve(lis) and keep references to server.Serve,
grpc.ErrServerStopped, and log.Printf.
- Around line 202-252: The test helper leaks the gRPC client connection because
the conn returned when creating the client is not stored or closed; add a field
to MockCSPTestHelper (e.g., ClientConn *grpc.ClientConn or net.Conn type
matching the dialer) to save the conn created in NewMockCSPTestHelper, assign
the returned conn to that field before creating
cspv1alpha1.NewCSPProviderServiceClient(conn), and update Stop() to close that
connection (call ClientConn.Close() or the appropriate Close method) in addition
to grpcServer. Ensure the field name used matches your struct (e.g., ClientConn)
so Stop() can reference and close it safely.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
janitor/pkg/controller/rebootnode_controller.go (2)

184-235: Variable shadowing causes dead code block.

The nodeReadyErr variable is declared at line 184, but the := operator at line 190 creates a new shadowed variable scoped to the else block. The error handling at lines 218-235 checks the outer nodeReadyErr which is always nil after the refactoring, making this entire block unreachable dead code.

The early return at line 202 handles the error case, so the block at lines 218-235 is never executed.

πŸ”Ž Proposed fix - remove dead code block
 		var nodeReadyErr error
 
 		if r.Config.ManualMode {
 			cspReady = true
 			nodeReadyErr = nil
 		} else {
 			rsp, nodeReadyErr := r.CSPClient.IsNodeReady(ctx, &cspv1alpha1.IsNodeReadyRequest{
 				NodeName: node.Name,
 			})
 			if nodeReadyErr != nil {
 				logger.Error(nodeReadyErr, "failed to check if node is ready",
 					"node", node.Name)
 
 				rebootNode.Status.ConsecutiveFailures++
 				delay := getNextRequeueDelay(rebootNode.Status.ConsecutiveFailures)
 
 				result = ctrl.Result{RequeueAfter: delay}
 				// Update status and return early
 				return r.updateRebootNodeStatus(ctx, req, originalRebootNode, &rebootNode, result)
 			}
 
 			cspReady = rsp.IsReady
 		}
 
 		// Check if kubernetes reports the node is ready.
 		kubernetesReady := false
 
 		for _, condition := range node.Status.Conditions {
 			if condition.Type == corev1.NodeReady {
 				kubernetesReady = condition.Status == corev1.ConditionTrue
 			}
 		}
 
-		// nolint:gocritic // Migrated business logic with if-else chain
-		if nodeReadyErr != nil {
-			logger.Error(nodeReadyErr, "node ready status check failed",
-				"node", node.Name)
-
-			rebootNode.Status.ConsecutiveFailures++
-
-			rebootNode.SetCompletionTime()
-			rebootNode.SetCondition(metav1.Condition{
-				Type:               janitordgxcnvidiacomv1alpha1.RebootNodeConditionNodeReady,
-				Status:             metav1.ConditionFalse,
-				Reason:             "Failed",
-				Message:            fmt.Sprintf("Node status could not be checked from CSP: %s", nodeReadyErr),
-				LastTransitionTime: metav1.Now(),
-			})
-
-			metrics.GlobalMetrics.IncActionCount(metrics.ActionTypeReboot, metrics.StatusFailed, node.Name)
-
-			result = ctrl.Result{} // Don't requeue on failure
-		} else if cspReady && kubernetesReady {
+		if cspReady && kubernetesReady {

399-415: gRPC connection is never closed - potential resource leak.

The gRPC connection created at line 400 is not stored or closed during shutdown. The conn variable goes out of scope, making it impossible to close later. Consider registering a runnable with the manager to handle cleanup.

πŸ”Ž Proposed fix - add connection to reconciler and register cleanup
 type RebootNodeReconciler struct {
 	client.Client
 	Scheme    *runtime.Scheme
 	Config    *config.RebootNodeControllerConfig
 	CSPClient cspv1alpha1.CSPProviderServiceClient
+	cspConn   *grpc.ClientConn
 }

 // SetupWithManager sets up the controller with the Manager.
 func (r *RebootNodeReconciler) SetupWithManager(mgr ctrl.Manager) error {
 	conn, err := grpc.NewClient(r.Config.CSPProviderHost, grpc.WithTransportCredentials(insecure.NewCredentials()))
 	if err != nil {
 		return fmt.Errorf("failed to create CSP client: %w", err)
 	}

 	r.CSPClient = cspv1alpha1.NewCSPProviderServiceClient(conn)
+	r.cspConn = conn
+
+	// Register cleanup on manager shutdown
+	if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
+		<-ctx.Done()
+		return conn.Close()
+	})); err != nil {
+		conn.Close()
+		return fmt.Errorf("failed to register connection cleanup: %w", err)
+	}

 	return ctrl.NewControllerManagedBy(mgr).

Note: This requires importing "sigs.k8s.io/controller-runtime/pkg/manager".

🧹 Nitpick comments (19)
.gitignore (1)

444-448: Approve additions; note redundancy in VSCode patterns.

The mise.toml entry and *.code-workspace patterns are sensible additions for environment management and editor configuration. However, .vscode/ (line 447) is redundant with the existing .vscode/* pattern at line 235β€”no need to add it.

Consider removing the .vscode/ entry at line 447 since .vscode/* at line 235 already covers directory contents (with exceptions for specific files at lines 236–240).

janitor-provider/go.mod (1)

16-16: Refactor tests to use standard library instead of stretchr/testify.

The testify dependency is currently used only for simple assertions in pkg/csp/client_test.go: assert.Equal(), require.NoError(), assert.NotNil(), and assert.Error(). These can all be replaced with standard testing package utilities (t.Error, t.Errorf, t.Fatalf with simple if checks). Per project learnings, avoid introducing testify for straightforward checks.

distros/kubernetes/nvsentinel/values-tilt.yaml (1)

287-288: Add inline documentation for cspProviderHost.

The cspProviderHost format (service.namespace.svc.cluster.local:port) is non-obvious and should include an inline comment explaining the expected format and purpose.

πŸ”Ž Proposed enhancement
   config:
+    # CSP provider gRPC endpoint in format: <service>.<namespace>.svc.cluster.local:<port>
     cspProviderHost: "janitor-provider.nvsentinel.svc.cluster.local:50051"

Based on learnings, include examples for non-obvious configurations in Helm chart documentation.

distros/kubernetes/nvsentinel/values.yaml (1)

73-74: Add inline documentation for janitorProvider flag.

The new janitorProvider.enabled flag should include an inline comment explaining its purpose, similar to other global feature flags in this file.

πŸ”Ž Proposed enhancement
   janitor:
     enabled: false
+  # janitorProvider enables the gRPC-based CSP provider service for janitor operations
   janitorProvider:
     enabled: false

Based on learnings, document all values in Helm chart values.yaml with inline comments.

api/Makefile (1)

71-91: Consider consolidating protoc invocations for better maintainability.

The current approach uses separate protoc commands for each proto directory (device and CSP). While this works correctly, it could be simplified by processing multiple proto paths in a single invocation, which would improve maintainability as more proto directories are added.

πŸ”Ž Proposed consolidation
 	cd proto && \
 	protoc \
 		-I . \
 		-I ../$(THIRD_PARTY_DIR) \
 		--plugin="protoc-gen-go=$(PROTOC_GEN_GO)" \
 		--plugin="protoc-gen-go-grpc=$(PROTOC_GEN_GO_GRPC)" \
 		--go_out=../$(GEN_DIR) \
 		--go_opt=paths=source_relative \
 		--go-grpc_out=../$(GEN_DIR) \
 		--go-grpc_opt=paths=source_relative \
-		device/v1alpha1/*.proto && \
-	protoc \
-		-I . \
-		-I ../$(THIRD_PARTY_DIR) \
-		--plugin="protoc-gen-go=$(PROTOC_GEN_GO)" \
-		--plugin="protoc-gen-go-grpc=$(PROTOC_GEN_GO_GRPC)" \
-		--go_out=../$(GEN_DIR) \
-		--go_opt=paths=source_relative \
-		--go-grpc_out=../$(GEN_DIR) \
-		--go-grpc_opt=paths=source_relative \
-		csp/v1alpha1/*.proto
+		device/v1alpha1/*.proto \
+		csp/v1alpha1/*.proto
distros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yaml (2)

17-17: Consider updating the description for production use.

The description "Example provider for NVSentinel Janitor" suggests this is sample code rather than a production component. If this chart is intended for production deployment, consider a more descriptive name like "CSP provider service for NVSentinel Janitor" or "Janitor gRPC provider service".


15-19: Consider adding appVersion field.

While optional, the appVersion field helps track which version of the janitor-provider image this chart is designed to deploy. This is especially useful when the chart version and application version evolve independently.

πŸ”Ž Suggested addition
 apiVersion: v2
 name: janitor-provider
 description: Example provider for NVSentinel Janitor
 type: application
 version: 0.1.0
+appVersion: "0.1.0"
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml (1)

14-19: Consider documenting cloud provider IAM annotation requirements.

Based on the PR objectives mentioning CSP testing, the ServiceAccount may need annotations for cloud provider IAM role bindings (e.g., GCP Workload Identity's iam.gke.io/gcp-service-account, AWS IRSA's eks.amazonaws.com/role-arn, or Azure's azure.workload.identity/client-id).

Consider documenting these requirements in the values.yaml or adding conditional annotation support in the template.

Would you like me to generate a template snippet that conditionally adds annotations from values.yaml?

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (1)

39-54: Consider adding a security context.

For defense-in-depth, consider adding a security context to run the container as non-root with a read-only root filesystem.

πŸ”Ž Recommended security context
         - name: janitor-provider
           image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default ((.Values.global).image).tag | default .Chart.AppVersion }}"
           imagePullPolicy: {{ .Values.image.pullPolicy }}
+          securityContext:
+            runAsNonRoot: true
+            runAsUser: 65532
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true
+            capabilities:
+              drop:
+                - ALL
           resources:
janitor-provider/pkg/csp/aws/aws.go (1)

110-127: Consider wrapping errors with additional context.

While the current error logging provides good context, wrapping errors with fmt.Errorf would preserve the error chain and provide richer context for upstream callers.

πŸ”Ž Example error wrapping
 	// Extract the instance ID from the provider ID
 	instanceID, err := parseAWSProviderID(providerID)
 	if err != nil {
-		slog.Error("Failed to parse provider ID", "error", err)
+		err = fmt.Errorf("failed to parse provider ID for node %s: %w", node.Name, err)
+		slog.Error("Failed to parse provider ID", "error", err, "node", node.Name)
 
 		return "", err
 	}
 
 	// Reboot the EC2 instance
 	slog.Info("Rebooting node", "node", node.Name, "instanceID", instanceID)
 
 	_, err = c.ec2.RebootInstances(ctx, &ec2.RebootInstancesInput{
 		InstanceIds: []string{instanceID},
 	})
 	if err != nil {
-		slog.Error("Failed to reboot instance", "error", err, "instanceID", instanceID)
+		err = fmt.Errorf("failed to reboot instance %s for node %s: %w", instanceID, node.Name, err)
+		slog.Error("Failed to reboot instance", "error", err, "instanceID", instanceID, "node", node.Name)
 
 		return "", err
 	}

Based on coding guidelines, which recommend "Wrap errors with context using fmt.Errorf("context: %w", err) in Go code."

janitor-provider/pkg/csp/azure/azure.go (1)

92-103: Consider wrapping errors with additional context.

Similar to the AWS implementation, wrapping errors with fmt.Errorf would provide richer context for upstream error handling while preserving the error chain.

πŸ”Ž Example error wrapping
 	// Extract the resource group and VM name from the provider ID
 	resourceGroup, vmName, instanceID, err := parseAzureProviderID(providerID)
 	if err != nil {
-		slog.Error("Failed to parse provider ID", "error", err)
+		err = fmt.Errorf("failed to parse Azure provider ID for node %s: %w", node.Name, err)
+		slog.Error("Failed to parse provider ID", "error", err, "node", node.Name)
 		return "", err
 	}
 
 	// Reboot the VM
 	_, err = vmssClient.BeginRestart(ctx, resourceGroup, vmName, instanceID, nil)
 	if err != nil {
-		slog.Error("Failed to send restart signal to node", "error", err, "node", vmName)
+		err = fmt.Errorf("failed to send restart signal to Azure VM %s: %w", vmName, err)
+		slog.Error("Failed to send restart signal to node", "error", err, "node", vmName)
 		return "", err
 	}

Based on coding guidelines, which recommend "Wrap errors with context using fmt.Errorf("context: %w", err) in Go code."

distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml (2)

63-70: Document probe configuration.

The liveness and readiness probes lack inline comments explaining their purpose and what endpoint they check. As per coding guidelines, all values should be documented.

πŸ”Ž Suggested documentation
+# Health check probes
 livenessProbe:
+  # HTTP endpoint to check if the container is alive
   httpGet:
     path: /
     port: http
 readinessProbe:
+  # HTTP endpoint to check if the container is ready to serve traffic
   httpGet:
     path: /
     port: http

Based on coding guidelines: Document all values in Helm chart values.yaml with inline comments.


85-87: Document the service port.

The service port (50051) should have an inline comment explaining it's the gRPC port for the CSP provider service.

πŸ”Ž Suggested documentation
 service:
+  # gRPC port for CSP provider service
   port: 50051

Based on coding guidelines: Document all values in Helm chart values.yaml with inline comments.

.ko.yaml (1)

157-172: Consider revising "Example" wording in descriptions

The descriptions reference "Example provider" and "Example implementation," but this appears to be production code rather than an example. Consider updating to reflect its actual purpose:

  • Line 164: "Example provider for NVSentinel Janitor"
  • Line 169: "Example implementation of the CSP provider interface for Janitor"
Suggested revision
     annotations:
-      org.opencontainers.image.description: "Example provider for NVSentinel Janitor"
+      org.opencontainers.image.description: "CSP provider service for NVSentinel Janitor"
     labels:
       org.opencontainers.image.source: "https://github.com/nvidia/nvsentinel"
       org.opencontainers.image.licenses: "Apache-2.0"
       org.opencontainers.image.title: "NVSentinel Janitor-Provider"
-      org.opencontainers.image.description: "Example implementation of the CSP provider interface for Janitor"
+      org.opencontainers.image.description: "CSP provider service implementing the gRPC interface for cloud operations"

Note: Based on the PR context, the single janitor-provider with CSP detection via environment aligns with the past review feedback about avoiding image explosion.

api/proto/csp/v1alpha1/provider.proto (1)

20-48: Consider adding field documentation

The proto definitions are clean and functional, but adding comments would improve maintainability and auto-generated documentation quality. Consider documenting:

  • Service RPCs with their purpose and behavior
  • Field meanings and constraints (e.g., node_name format expectations)
  • Return value semantics (e.g., what request_id represents)
Example documentation
+// CSPProviderService provides cloud service provider operations for node lifecycle management.
 service CSPProviderService {
+  // SendRebootSignal initiates a reboot operation for the specified node.
   rpc SendRebootSignal(SendRebootSignalRequest) returns (SendRebootSignalResponse) {}
+  // IsNodeReady checks whether the node has completed its reboot and is ready.
   rpc IsNodeReady(IsNodeReadyRequest) returns (IsNodeReadyResponse) {}
+  // SendTerminateSignal initiates termination of the specified node.
   rpc SendTerminateSignal(SendTerminateSignalRequest) returns (SendTerminateSignalResponse) {}
 }

 message SendRebootSignalRequest {
+  // node_name is the Kubernetes node name to reboot.
   string node_name = 1;
 }
janitor-provider/main.go (1)

99-99: Make port configurable

The port :50051 is hard-coded. Consider making it configurable via an environment variable with a sensible default.

Proposed change
+	port := os.Getenv("GRPC_PORT")
+	if port == "" {
+		port = "50051"
+	}
+	
-	lis, err := net.Listen("tcp", ":50051")
+	lis, err := net.Listen("tcp", ":"+port)
 	if err != nil {
 		slog.Error("Failed to listen", "error", err)
 		os.Exit(1)
 	}
+	
+	slog.Info("Listening for gRPC connections", "port", port)
janitor/pkg/controller/terminatenode_controller.go (2)

328-330: Consider adding input validation

The node name is passed directly to the gRPC request without validation. While the Kubernetes API will validate it, adding an explicit check here would make errors clearer.

+	if node.Name == "" {
+		return ctrl.Result{}, fmt.Errorf("node name is empty")
+	}
+
 	_, terminateErr := r.CSPClient.SendTerminateSignal(ctx, &cspv1alpha1.SendTerminateSignalRequest{
 		NodeName: node.Name,
 	})

413-413: Consider adding connection lifecycle management and documenting the in-cluster security boundary

The gRPC client uses insecure.NewCredentials() for communication with the CSP provider service (janitor-provider.nvsentinel.svc.cluster.local:50051). While this is acceptable for in-cluster Kubernetes service-to-service communication, the code lacks:

  1. Connection lifecycle management: The connection is created in SetupWithManager but never explicitly closed or gracefully shutdown
  2. Security assumption documentation: No comments explaining that this assumes in-cluster communication to a co-located trusted service
  3. TLS support option: Even for in-cluster communication, consider making TLS optional via configuration for defense-in-depth, especially given the sensitivity of operations (SendRebootSignal, SendTerminateSignal)

Add comments clarifying the security model and implement proper connection cleanup via controller shutdown hooks.

janitor/pkg/controller/test_utils.go (1)

235-237: Consider returning error instead of log.Fatalf.

log.Fatalf terminates the entire process immediately, preventing proper test cleanup and making debugging harder. Consider returning an error to give callers control over failure handling.

πŸ”Ž Proposed refactor
-func NewMockCSPTestHelper() *MockCSPTestHelper {
+func NewMockCSPTestHelper() (*MockCSPTestHelper, error) {
 	lis := bufconn.Listen(1024 * 1024)
 	server := grpc.NewServer()
 	mockServer := NewMockCSPServer()

 	cspv1alpha1.RegisterCSPProviderServiceServer(server, mockServer)

 	go func() {
 		if err := server.Serve(lis); err != nil && !errors.Is(err, grpc.ErrServerStopped) {
 			log.Printf("Mock CSP server exited with error: %v", err)
 		}
 	}()

 	// Create client connection
 	conn, err := grpc.NewClient(
 		"passthrough://bufnet",
 		grpc.WithContextDialer(func(context.Context, string) (net.Conn, error) {
 			return lis.Dial()
 		}),
 		grpc.WithTransportCredentials(insecure.NewCredentials()),
 	)
 	if err != nil {
-		log.Fatalf("Failed to create mock CSP client: %v", err)
+		server.GracefulStop()
+		return nil, fmt.Errorf("failed to create mock CSP client: %w", err)
 	}

 	return &MockCSPTestHelper{
 		Server:     mockServer,
 		Client:     cspv1alpha1.NewCSPProviderServiceClient(conn),
 		grpcServer: server,
 		listener:   lis,
-	}
+	}, nil
 }
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between fd4466e and d000cd6.

β›” Files ignored due to path filters (4)
  • api/gen/go/csp/v1alpha1/provider.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • api/gen/go/csp/v1alpha1/provider_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • janitor-provider/go.sum is excluded by !**/*.sum
  • janitor/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (41)
  • .gitignore
  • .ko.yaml
  • api/Makefile
  • api/proto/csp/v1alpha1/provider.proto
  • distros/kubernetes/nvsentinel/Chart.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpl
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • janitor-provider/Tiltfile
  • janitor-provider/go.mod
  • janitor-provider/main.go
  • janitor-provider/pkg/csp/aws/aws.go
  • janitor-provider/pkg/csp/azure/azure.go
  • janitor-provider/pkg/csp/client.go
  • janitor-provider/pkg/csp/client_test.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor-provider/pkg/csp/kind/kind.go
  • janitor-provider/pkg/csp/nebius/nebius.go
  • janitor-provider/pkg/csp/oci/oci.go
  • janitor-provider/pkg/model/csp.go
  • janitor/go.mod
  • janitor/pkg/config/config.go
  • janitor/pkg/controller/rebootnode_controller.go
  • janitor/pkg/controller/rebootnode_controller_test.go
  • janitor/pkg/controller/suite_test.go
  • janitor/pkg/controller/terminatenode_controller.go
  • janitor/pkg/controller/terminatenode_controller_test.go
  • janitor/pkg/controller/test_utils.go
  • scripts/build-image-list.sh
  • scripts/buildko.sh
  • scripts/check-image-attestations.sh
  • tilt/Tiltfile
🧰 Additional context used
πŸ““ Path-based instructions (4)
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/values.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • janitor-provider/pkg/csp/client.go
  • janitor-provider/pkg/csp/oci/oci.go
  • janitor/pkg/controller/terminatenode_controller_test.go
  • janitor/pkg/config/config.go
  • janitor-provider/pkg/csp/azure/azure.go
  • janitor-provider/pkg/csp/kind/kind.go
  • janitor-provider/pkg/csp/aws/aws.go
  • janitor/pkg/controller/terminatenode_controller.go
  • janitor-provider/pkg/csp/nebius/nebius.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor-provider/main.go
  • janitor/pkg/controller/rebootnode_controller_test.go
  • janitor/pkg/controller/rebootnode_controller.go
  • janitor/pkg/controller/suite_test.go
  • janitor/pkg/controller/test_utils.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • janitor-provider/go.mod
  • janitor/go.mod
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • janitor/pkg/controller/terminatenode_controller_test.go
  • janitor/pkg/controller/rebootnode_controller_test.go
  • janitor/pkg/controller/suite_test.go
🧠 Learnings (23)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • janitor-provider/pkg/csp/client.go
  • janitor-provider/pkg/csp/oci/oci.go
  • janitor-provider/pkg/csp/azure/azure.go
  • janitor-provider/pkg/csp/kind/kind.go
  • janitor-provider/pkg/csp/aws/aws.go
  • janitor-provider/pkg/csp/nebius/nebius.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • janitor-provider/go.mod
  • janitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • janitor-provider/go.mod
  • janitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • janitor-provider/go.mod
  • janitor/go.mod
  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • janitor-provider/go.mod
  • janitor/pkg/controller/terminatenode_controller_test.go
  • janitor/pkg/controller/rebootnode_controller_test.go
  • janitor/pkg/controller/suite_test.go
  • janitor/pkg/controller/test_utils.go
πŸ“š Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

  • janitor-provider/go.mod
  • janitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • janitor-provider/go.mod
  • janitor/pkg/controller/terminatenode_controller.go
  • janitor/go.mod
  • janitor-provider/pkg/csp/gcp/gcp.go
  • api/Makefile
  • janitor/pkg/controller/rebootnode_controller.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • janitor-provider/go.mod
  • janitor/go.mod
  • api/Makefile
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpl
  • distros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpl
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
πŸ“š Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • janitor-provider/Tiltfile
πŸ“š Learning: 2025-11-19T06:54:22.355Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/charts/fault-remediation/files/log-collector-job.yaml:39-40
Timestamp: 2025-11-19T06:54:22.355Z
Learning: When reviewing Helm templates in subcharts (charts within a charts/ directory), always check the parent chart's values.yaml for `global` section definitions. In Helm, values defined under `global` in the parent chart are automatically inherited by all subcharts and accessible via `.Values.global.*` within subchart templates. This is a standard Helm feature for sharing configuration across multiple charts.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpl
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Health monitors should include Helm chart in `distros/kubernetes/nvsentinel/charts/`

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yaml
  • distros/kubernetes/nvsentinel/Chart.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • janitor/pkg/controller/terminatenode_controller_test.go
  • janitor-provider/pkg/csp/nebius/nebius.go
  • janitor/pkg/controller/rebootnode_controller_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • janitor/pkg/controller/terminatenode_controller_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • janitor/pkg/controller/terminatenode_controller_test.go
  • janitor/pkg/controller/rebootnode_controller_test.go
  • janitor/pkg/controller/suite_test.go
πŸ“š Learning: 2025-11-26T08:03:16.858Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: log-collector/Tiltfile:18-23
Timestamp: 2025-11-26T08:03:16.858Z
Learning: In Tilt configurations, when a Docker build uses the repository root as the build context but the Dockerfile only copies files from a specific subdirectory, it's correct and more efficient to set deps to watch only that subdirectory (e.g., deps=["./"] when the Tiltfile is in that subdirectory) rather than watching the entire repository root. This avoids unnecessary rebuilds when unrelated files change.

Applied to files:

  • janitor-provider/Tiltfile
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Use semantic versioning for breaking changes in Protocol Buffer messages

Applied to files:

  • api/Makefile
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Include comprehensive comments for all fields in Protocol Buffer messages

Applied to files:

  • api/Makefile
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Define Protocol Buffer messages in `data-models/protobufs/` directory

Applied to files:

  • api/Makefile
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • janitor/pkg/controller/rebootnode_controller_test.go
🧬 Code graph analysis (7)
janitor/pkg/controller/terminatenode_controller_test.go (3)
janitor/pkg/controller/terminatenode_controller.go (1)
  • TerminateNodeReconciler (50-55)
janitor/pkg/config/config.go (2)
  • Config (26-30)
  • TerminateNodeControllerConfig (61-73)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
janitor/pkg/controller/terminatenode_controller.go (3)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
  • CSPProviderServiceClient (44-48)
  • NewCSPProviderServiceClient (54-56)
janitor/pkg/config/config.go (1)
  • Config (26-30)
janitor-provider/main.go (5)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
commons/pkg/logger/logger.go (1)
  • SetDefaultStructuredLogger (73-75)
commons/pkg/server/server.go (1)
  • NewServer (375-396)
janitor-provider/pkg/csp/client.go (1)
  • New (46-69)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (1)
  • RegisterCSPProviderServiceServer (124-133)
janitor/pkg/controller/rebootnode_controller_test.go (5)
janitor/pkg/controller/rebootnode_controller.go (1)
  • RebootNodeReconciler (74-79)
janitor/pkg/config/config.go (2)
  • Config (26-30)
  • RebootNodeControllerConfig (46-58)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
commons/pkg/server/server.go (1)
  • Server (87-97)
janitor/pkg/controller/test_utils.go (1)
  • DefaultFailureBehavior (61-67)
janitor/pkg/controller/rebootnode_controller.go (2)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
  • CSPProviderServiceClient (44-48)
  • NewCSPProviderServiceClient (54-56)
janitor/pkg/config/config.go (1)
  • Config (26-30)
janitor/pkg/controller/suite_test.go (1)
janitor/pkg/controller/test_utils.go (2)
  • MockCSPTestHelper (204-209)
  • NewMockCSPTestHelper (214-245)
janitor/pkg/controller/test_utils.go (1)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (3)
  • CSPProviderServiceClient (44-48)
  • RegisterCSPProviderServiceServer (124-133)
  • NewCSPProviderServiceClient (54-56)
πŸͺ› ast-grep (0.40.3)
janitor-provider/main.go

[warning] 98-98: "Detected a network listener listening on 0.0.0.0 or an empty string.
This could unexpectedly expose the server publicly as it binds to all
available interfaces. Instead, specify another IP address that is not
0.0.0.0 nor the empty string."
Context: net.Listen("tcp", ":50051")
Note: [CWE-200] Exposure of Sensitive Information to an Unauthorized Actor [REFERENCES]
- https://owasp.org/Top10/A01_2021-Broken_Access_Control

(avoid-bind-to-all-interfaces-go)

πŸͺ› OSV Scanner (2.3.1)
janitor-provider/go.mod

[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: Excessive memory allocation during header parsing in github.com/golang-jwt/jwt

(GO-2025-3553)


[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: jwt-go allows excessive memory allocation during header parsing

(GHSA-mh63-6h87-95cp)

πŸͺ› YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml

[warning] 67-67: too many blank lines (1 > 0)

(empty-lines)


[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yaml

[warning] 30-30: too many blank lines (1 > 0)

(empty-lines)


[error] 17-17: syntax error: expected , but found ''

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

πŸ€– Fix all issues with AI Agents
In
@distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml:
- Around line 50-56: The env values for JANITOR_PROVIDER_PORT, METRICS_PORT and
CSP are rendered as non-strings; update the Helm template to ensure each value
is rendered as a string by applying the quote filter to the entire expression
(e.g., quote the result of .Values.service.port | default 50051,
((.Values.global).metricsPort) | default 2112, and .Values.csp | default "kind")
so Kubernetes receives string values for the environment variables.

In @janitor-provider/main.go:
- Around line 97-149: The main function lacks graceful shutdown: add OS signal
handling (SIGINT, SIGTERM) to create a cancellable root context, pass that
context to long-running components (use context for csp.New if it accepts it,
and cancel when shutting down), and shut down both the metrics server (srv) and
the gRPC server (svr) cleanly instead of exiting immediately; specifically, in
main() set up a signal.NotifyContext or a signal channel, run srv.Serve(ctx) and
svr.Serve(lis) in separate goroutines, on signal call srv.Shutdown(ctx) or
server.Stop/Close equivalent and call svr.GracefulStop()/Stop after allowing a
timeout, cancel context and close listeners, and log errors from those shutdown
steps before exiting.
- Around line 125-133: The metrics server Serve call is blocking so the gRPC
server setup never runs; change the flow to start the metrics server
asynchronously by launching srv.Serve(...) in a separate goroutine and capture
its error on a channel or context so you can log failures without blocking
startup; locate the server.NewServer(...) and srv.Serve(...) calls in main
(symbols: server.NewServer, srv.Serve) and replace the direct blocking call with
a go routine that sends any returned error to an error channel or uses a
context-aware goroutine, while allowing the subsequent gRPC initialization (grpc
server setup and grpcSrv.Serve) to proceed normally.
- Line 15: Add a package-level godoc comment immediately above the "package
main" declaration describing the package's purpose and high-level behavior (what
the janitor-provider binary does, its main responsibilities, and any important
runtime considerations); ensure the comment is a complete sentence starting with
"Package main ..." so it satisfies Go documentation conventions and linter
checks.
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between d000cd6 and 75f1edf.

β›” Files ignored due to path filters (1)
  • janitor-provider/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (5)
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • janitor-provider/go.mod
  • janitor-provider/main.go
βœ… Files skipped from review due to trivial changes (1)
  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • janitor-provider/main.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • janitor-provider/go.mod
🧠 Learnings (11)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • janitor-provider/main.go
  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • janitor-provider/go.mod
πŸͺ› OSV Scanner (2.3.1)
janitor-provider/go.mod

[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: Excessive memory allocation during header parsing in github.com/golang-jwt/jwt

(GO-2025-3553)


[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: jwt-go allows excessive memory allocation during header parsing

(GHSA-mh63-6h87-95cp)

πŸͺ› YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

πŸ”‡ Additional comments (3)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml (1)

14-30: LGTM!

The Service template is well-structured with appropriate defaults (gRPC on 50051, metrics on 2112). The use of Helm helpers for naming and labeling ensures consistency across the chart.

Note: The YAMLlint syntax error on line 19 is a false positiveβ€”{{- include ... | nindent 4}} is valid Helm templating that the linter doesn't understand.

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (1)

39-49: Container configuration looks good.

The image tag fallback chain provides flexibility, and the port definitions follow conventions (metrics: 2112, gRPC service: 50051). The YAMLlint error on line 19 is a false positive due to Helm templating syntax.

janitor-provider/go.mod (1)

69-69: Upgrade github.com/golang-jwt/jwt/v4 to v4.5.2 to fix HIGH severity vulnerability.

This dependency contains CVE-2025-30204 (GHSA-mh63-6h87-95cp), which allows excessive memory allocation during header parsing. The fix is available in v4.5.2.

πŸ”Ž Fix

Run the following commands to upgrade:

cd janitor-provider
go get github.com/golang-jwt/jwt/[email protected]
go mod tidy

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

πŸ€– Fix all issues with AI Agents
In @janitor-provider/go.mod:
- Line 70: Update the vulnerable dependency github.com/golang-jwt/jwt/v4 from
v4.5.1 to v4.5.2 in the module declaration (replace the version entry for
github.com/golang-jwt/jwt/v4 v4.5.1 with v4.5.2), then run go get
github.com/golang-jwt/jwt/[email protected] and go mod tidy to refresh the go.sum and
ensure the new version is resolved; verify with go mod graph or the provided
verification script to confirm no older v4.5.1 remains in the dependency graph.

In @janitor-provider/main.go:
- Line 15: Add a package-level godoc comment above the package main declaration
describing the purpose and high-level behavior of this package (what
janitor-provider does, intended audience, and any important usage/constraints);
place the comment immediately above the "package main" line so it becomes the
package-level documentation required by the coding guidelines.
- Line 76: The call to s.cspClient.IsNodeReady uses a hardcoded "Node is ready"
message; change IsNodeReadyRequest to include a request_id field in the proto,
regenerate stubs, have SendRebootSignal return that request_id (or capture its
returned request_id variable) and pass that request_id into
s.cspClient.IsNodeReady (replace the hardcoded string), and update all CSP
implementations (OCI/Azure/AWS/GCP/Nebius) and tests to consume the new
IsNodeReadyRequest.request_id semantics.
🧹 Nitpick comments (5)
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml (3)

15-24: Add inline comments for core deployment values per coding guidelines.

The coding guidelines require all values in Helm chart values.yaml to be documented with inline comments. The CSP configuration sections have excellent documentation, but these core values are missing inline comments explaining their purpose.

πŸ”Ž Suggested inline comments
+# Number of janitor-provider replicas to deploy
 replicaCount: 1

 # Image configuration
 image:
+  # Container image repository
   repository: ghcr.io/nvidia/nvsentinel/janitor-provider
+  # Image pull policy (Always, IfNotPresent, Never)
   pullPolicy: IfNotPresent
+  # Image tag (defaults to Chart.appVersion if empty)
   tag: ""

+# Override the chart name
 nameOverride: ""
+# Override the full release name
 fullnameOverride: "janitor-provider"

Based on coding guidelines, document all values in Helm chart values.yaml with inline comments.


52-59: Consider adding a liveness probe for better pod lifecycle management.

The readinessProbe configuration looks appropriate for using the HTTP health endpoint from the metrics server. However, adding a livenessProbe would help Kubernetes detect and restart hung containers.

πŸ”Ž Suggested liveness probe
 readinessProbe:
   httpGet:
     path: /healthz
     port: metrics
   initialDelaySeconds: 5
   periodSeconds: 10
   timeoutSeconds: 3
   failureThreshold: 3

+livenessProbe:
+  httpGet:
+    path: /healthz
+    port: metrics
+  initialDelaySeconds: 15
+  periodSeconds: 20
+  timeoutSeconds: 5
+  failureThreshold: 3

37-70: Add inline comments for remaining configuration values.

Per coding guidelines, all values should have inline comments. These standard Kubernetes deployment values would benefit from brief descriptions.

πŸ”Ž Suggested inline comments
+# Additional annotations to add to pods
 podAnnotations: {}
+# Additional labels to add to pods
 podLabels: {}

+# Resource limits and requests for the container
 resources: {}
   # We usually recommend not to specify default resources...

...

+# Node selector for pod scheduling
 nodeSelector: {}

+# Tolerations for pod scheduling
 tolerations: []

+# Affinity rules for pod scheduling
 affinity: {}

 service:
+  # gRPC service port
   port: 50051

+# Additional environment variables to inject
 extraEnv: []

Based on coding guidelines, document all values in Helm chart values.yaml with inline comments.

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (2)

141-142: Add liveness probe template for production resilience.

The readinessProbe is properly templated from values, but there's no corresponding livenessProbe template. For production deployments, a liveness probe helps Kubernetes detect and restart stuck containers. Consider adding a livenessProbe section in both the values.yaml and this template.

πŸ”Ž Proposed template addition
          readinessProbe:
            {{- toYaml .Values.readinessProbe | nindent 12 }}
+          {{- with .Values.livenessProbe }}
+          livenessProbe:
+            {{- toYaml . | nindent 12 }}
+          {{- end }}

156-156: Remove redundant restartPolicy: Always.

restartPolicy: Always is the default (and only valid value) for Deployment pod specs. This line can be safely removed to reduce template noise.

πŸ”Ž Proposed fix
           {{- end }}
-      restartPolicy: Always
       {{- with (((.Values.global).systemNodeSelector) | default .Values.nodeSelector) }}
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 75f1edf and ac89c73.

πŸ“’ Files selected for processing (8)
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/values.yaml
  • janitor-provider/go.mod
  • janitor-provider/main.go
πŸ’€ Files with no reviewable changes (3)
  • distros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/values.yaml
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • janitor-provider/go.mod
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • janitor-provider/main.go
🧠 Learnings (18)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • janitor-provider/go.mod
  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
πŸ“š Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • janitor-provider/main.go
πŸͺ› OSV Scanner (2.3.1)
janitor-provider/go.mod

[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: Excessive memory allocation during header parsing in github.com/golang-jwt/jwt

(GO-2025-3553)


[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: jwt-go allows excessive memory allocation during header parsing

(GHSA-mh63-6h87-95cp)

πŸͺ› YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
  • GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/maintenance-notifier)
  • GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/csp-health-monitor)
  • GitHub Check: ko-build-test (fault-quarantine, .)
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
  • GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
  • GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: E2E Tests (AMD64 + MongoDB)
  • GitHub Check: E2E Tests (AMD64 + PostgreSQL)
  • GitHub Check: E2E Tests (ARM64 + PostgreSQL)
  • GitHub Check: E2E Tests (ARM64 + MongoDB)
  • GitHub Check: modules-lint-test (labeler)
  • GitHub Check: modules-lint-test (fault-remediation)
  • GitHub Check: modules-lint-test (janitor)
  • GitHub Check: modules-lint-test (platform-connectors)
  • GitHub Check: modules-lint-test (store-client)
  • GitHub Check: modules-lint-test (fault-quarantine)
  • GitHub Check: modules-lint-test (node-drainer)
  • GitHub Check: health-monitors-lint-test (kubernetes-object-monitor)
  • GitHub Check: health-monitors-lint-test (csp-health-monitor)
  • GitHub Check: CodeQL PR Analysis
πŸ”‡ Additional comments (4)
janitor-provider/go.mod (1)

1-24: Module structure follows guidelines appropriately.

The module is correctly set up as a separate Go module with semantic import versioning. The replace directives for local modules (api, commons) follow the monorepo pattern.

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (1)

45-55: Container configuration looks good with proper port naming and quoting.

The container spec correctly defines named ports (metrics, service) and uses appropriate quoting for environment variable values. The image tag fallback chain provides flexibility across deployment scenarios.

janitor-provider/main.go (1)

151-181: Graceful shutdown implementation looks correct.

The errgroup-based concurrent server management with signal handling is well-implemented. Good design decision to log metrics server failures without terminating the main gRPC service, as metrics are non-critical for core functionality.

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml (1)

14-49: ServiceAccount template correctly implements CSP-specific workload identity annotations.

The conditional logic for AWS IRSA, GCP Workload Identity, Azure Workload Identity, and OCI Workload Identity annotations is well-structured. Each provider's annotations are guarded by appropriate checks for required values.

Nebius does not require ServiceAccount annotations for authenticationβ€”it uses environment variables (NEBIUS_SA_KEY_FILE, NEBIUS_IAM_TOKEN) and mounted secrets instead, so the template is correctly excluding Nebius-specific annotations.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
janitor/pkg/controller/rebootnode_controller.go (2)

190-206: Add timeout context to IsNodeReady gRPC call.

The gRPC call to IsNodeReady does not use a timeout context, despite the CSPOperationTimeout constant being defined (line 45). Without a timeout, this call can hang indefinitely if the CSP provider is slow or unresponsive, blocking the reconciliation loop and degrading controller responsiveness.

πŸ”Ž Proposed fix
 		} else {
+			cspCtx, cancel := context.WithTimeout(ctx, CSPOperationTimeout)
+			defer cancel()
-			rsp, nodeReadyErr := r.CSPClient.IsNodeReady(ctx, &cspv1alpha1.IsNodeReadyRequest{
+			rsp, nodeReadyErr := r.CSPClient.IsNodeReady(cspCtx, &cspv1alpha1.IsNodeReadyRequest{
 				NodeName: node.Name,
 			})
 			if nodeReadyErr != nil {

335-352: Add timeout context to SendRebootSignal gRPC call.

The gRPC call to SendRebootSignal does not use a timeout context, despite line 340 checking for context.DeadlineExceeded. This inconsistency suggests the code expects timeouts but doesn't enforce them. Without a timeout, the call can hang indefinitely, blocking the reconciliation loop.

πŸ”Ž Proposed fix
 			} else {
 				// Start the reboot process
 				metrics.GlobalMetrics.IncActionCount(metrics.ActionTypeReboot, metrics.StatusStarted, node.Name)
 				logger.Info("sending reboot signal to node",
 					"node", node.Name)
 
+				cspCtx, cancel := context.WithTimeout(ctx, CSPOperationTimeout)
+				defer cancel()
-				rsp, rebootErr := r.CSPClient.SendRebootSignal(ctx, &cspv1alpha1.SendRebootSignalRequest{
+				rsp, rebootErr := r.CSPClient.SendRebootSignal(cspCtx, &cspv1alpha1.SendRebootSignalRequest{
 					NodeName: node.Name,
 				})
 
 				// Check for timeout
 				if errors.Is(rebootErr, context.DeadlineExceeded) {
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between ac89c73 and df04258.

πŸ“’ Files selected for processing (1)
  • janitor/pkg/controller/rebootnode_controller.go
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • janitor/pkg/controller/rebootnode_controller.go
🧠 Learnings (1)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • janitor/pkg/controller/rebootnode_controller.go
🧬 Code graph analysis (1)
janitor/pkg/controller/rebootnode_controller.go (2)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
  • CSPProviderServiceClient (44-48)
  • NewCSPProviderServiceClient (54-56)
πŸ”‡ Additional comments (1)
janitor/pkg/controller/rebootnode_controller.go (1)

78-78: LGTM: Field type updated for gRPC client.

The CSPClient field type has been correctly updated from the internal model to the generated gRPC client interface, aligning with the PR's objective to introduce a gRPC-based CSP provider service.

@jtschelling jtschelling force-pushed the jtschelling/353-grpc-plugin branch from 8d2858f to eae93f5 Compare January 9, 2026 18:26
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
janitor-provider/pkg/csp/gcp/gcp.go (1)

70-105: Clarify error handling logic in regex group extraction.

The error handling in lines 90-94 appears incorrect. The code checks if name != "" on line 90, then returns an error on line 93 saying "failed to extract required field %s from node.Spec.ProviderID" with name as the parameter. However, at this point name is guaranteed to be non-empty, so this error branch should never execute.

If the intent is to validate that all capture groups have matched values, the validation should check if match[i] is empty instead. The current logic makes the error handling unreachable.

πŸ”§ Suggested fix for error handling
 		for i, name := range groupNames {
 			// groupNames[0] is always an empty string
 			if i == 0 {
 				continue
 			}
 
-			if name != "" {
+			if name != "" && match[i] != "" {
 				result[name] = match[i]
-			} else {
+			} else if name != "" {
 				return nil, fmt.Errorf("failed to extract required field %s from node.Spec.ProviderID", name)
 			}
 		}
janitor/pkg/controller/test_utils.go (1)

15-15: Add package-level godoc.

The coding guidelines require package-level godoc for all Go packages.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ€– Fix all issues with AI agents
In @.gitignore:
- Around line 447-448: The `.vscode/` ignore rule you added conflicts with
existing negations like `!.vscode/settings.json`, `!.vscode/tasks.json`,
`!.vscode/launch.json`, and `!.vscode/extensions.json`, which prevents those
shared files from being committed; to preserve the shared-configuration pattern
remove the `.vscode/` entry and keep the negations, and also decide whether
workspace files should be sharedβ€”either remove or clarify the `*.code-workspace`
entry (or comment it) to match that intent.

In @.ko.yaml:
- Around line 157-173: The metadata for the janitor-provider image uses
"Example" which downplays production intent; update the annotations and labels
for the id "janitor-provider" (annotations/org.opencontainers.image.description
and labels/org.opencontainers.image.title and
labels/org.opencontainers.image.description) to consistent, production-ready
wording (e.g., "NVSentinel Janitor Provider" and a clear descriptive sentence)
so they match org.opencontainers.image.source/version/revision fields and avoid
implying non-production status.

In
@distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml:
- Around line 20-28: ClusterRole currently grants only get/list/watch on the
nodes resource but the TerminateNode controller requires the ability to delete
nodes; update the rules block in clusterrole.yaml (the ClusterRole that lists
resources: nodes and verbs: get, list, watch) to include the delete verb so the
controller can perform node deletions after CSP termination succeeds.

In
@distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml:
- Around line 116-137: Add an explicit precedence note to values.yaml
documenting Nebius credential selection: state that iamTokenSecretRef has
highest priority, then serviceAccountKeySecret, then serviceAccountKeyFile, and
iamToken is lowest; include the recommended usage (prefer
serviceAccountKeySecret/serviceAccountKeyFile for production) and place this
note near the existing Nebius credential docs alongside the existing keys
iamTokenSecretRef, iamToken, serviceAccountKeySecret and serviceAccountKeyFile
so the template precedence is clearly reflected.

In @distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml:
- Around line 72-191: The docs list "kwok" as a supported csp provider but there
is no csp.kwok blockβ€”either add an explicit placeholder block (csp.kwok: {} with
a short comment) or remove "kwok" from the provider list to avoid confusion;
also expand the nebius secret examples by replacing nebius.iamTokenSecretRef: {}
with a concrete inline example (e.g., iamTokenSecretRef: { name:
"nebius-janitor-token", key: "token" }) and add a similar example for
nebius.serviceAccountKeySecret (showing the secret key name like
sa-credentials.json), and state precedence explicitly (e.g., "When both
nebius.iamToken and nebius.iamTokenSecretRef are set, the secret reference takes
precedence") so readers know which value is used.
- Around line 15-71: Add inline comments for every top-level value in this
values.yaml (replicaCount, image.repository/pullPolicy/tag, nameOverride,
fullnameOverride, serviceAccount.* including automount/annotations/name,
podAnnotations, podLabels, resources, readinessProbe, nodeSelector, tolerations,
affinity, service.port, extraEnv) with short examples or expected types; for
extraEnv show the list-of-env format (name/value or valueFrom) and a brief
example. Also document the metrics/probe wiring: explain that
readinessProbe.httpGet.port: "metrics" refers to a named container port defined
in the Deployment/StatefulSet template (the container port named "metrics") and
state the default METRICS_PORT value used by the container, plus note that if
users override service.port or METRICS_PORT they must keep the named port
consistent for probes to work.

In @janitor-provider/go.mod:
- Line 70: Add a Go module replace directive to override the transitive
vulnerable dependency "github.com/golang-jwt/jwt/v4 v4.5.1" to the patched
version v4.5.2 by updating go.mod to include a replace for
github.com/golang-jwt/jwt/v4 => github.com/golang-jwt/jwt/v4 v4.5.2 so the build
uses v4.5.2 even if nebius/gosdk still requires v4.5.1; then run go mod tidy to
update go.sum and verify builds/tests pass.

In @janitor-provider/main.go:
- Around line 58-101: Requests to Nodes().Get in SendRebootSignal, IsNodeReady
and SendTerminateSignal always return codes.Internal; validate req.NodeName
first and return codes.InvalidArgument when empty, and after calling
s.k8sClient.CoreV1().Nodes().Get detect kubernetes NotFound errors (use
api/errors.IsNotFound(err)) and return codes.NotFound, otherwise keep returning
codes.Internal; keep the original error text in the status.Errorf messages so
callers see the underlying error.
- Around line 113-187: run() currently treats grpc.ErrServerStopped as an error
and has no env defaults; update the gRPC serve goroutine that calls
svr.Serve(lis) to treat grpc.ErrServerStopped as non-fatal (suppress/return nil
when errors.Is(err, grpc.ErrServerStopped)), ensure the metrics goroutine still
logs but continues, and add defaults for JANITOR_PROVIDER_PORT (50051) and
METRICS_PORT (8080) when reading them (use fallback if empty or invalid before
net.Listen/strconv.Atoi). Also change the error string from "failed to create
kubernetes clientset" to "failed to create in-cluster kubernetes config" in
run(), and update the PR template (.github/PULL_REQUEST_TEMPLATE.md) to add
fields for tested CSP(s) and Prometheus metrics added.

In @janitor/pkg/controller/terminatenode_controller.go:
- Around line 413-419: The gRPC client connection created via grpc.NewClient
(assigned to conn) is never closed; update the SetupWithManager
(TerminateNodeReconciler) flow to store the connection and register a manager
runnable that closes conn on context cancellation: after creating conn and
assigning r.CSPClient, create and register a connectionCloser (holding conn)
with mgr.Add or mgr.AddRunnable that waits on ctx.Done() and calls conn.Close(),
and ensure you handle registration errors by closing conn immediately on failure
to avoid leaks.
🧹 Nitpick comments (8)
janitor-provider/pkg/csp/aws/aws.go (2)

99-130: Consider adding more context when wrapping errors.

While the logging migration to slog is correctly implemented with structured fields, some errors could benefit from additional context when returned. For example:

  • Line 106: fmt.Errorf("no provider ID found for node %s: %w", node.Name, err) instead of creating a new error
  • Line 126: fmt.Errorf("failed to reboot EC2 instance %s: %w", instanceID, err) instead of returning the raw error

This aligns with the coding guideline to wrap errors with context using fmt.Errorf("context: %w", err).

♻️ Proposed error wrapping improvements
 func (c *Client) SendRebootSignal(ctx context.Context, node corev1.Node) (model.ResetSignalRequestRef, error) {
 	// Fetch the node's provider ID
 	providerID := node.Spec.ProviderID
 	if providerID == "" {
-		err := fmt.Errorf("no provider ID found for node %s", node.Name)
+		err := fmt.Errorf("failed to reboot node %s: no provider ID found", node.Name)
 		slog.Error("Failed to reboot node", "error", err)
 
 		return "", err
 	}
 
 	// Extract the instance ID from the provider ID
 	instanceID, err := parseAWSProviderID(providerID)
 	if err != nil {
-		slog.Error("Failed to parse provider ID", "error", err)
+		err = fmt.Errorf("failed to parse provider ID for node %s: %w", node.Name, err)
+		slog.Error("Failed to parse provider ID", "error", err, "node", node.Name)
 
 		return "", err
 	}
 
 	// Reboot the EC2 instance
 	slog.Info("Rebooting node", "node", node.Name, "instanceID", instanceID)
 
 	_, err = c.ec2.RebootInstances(ctx, &ec2.RebootInstancesInput{
 		InstanceIds: []string{instanceID},
 	})
 	if err != nil {
-		slog.Error("Failed to reboot instance", "error", err, "instanceID", instanceID)
+		err = fmt.Errorf("failed to reboot EC2 instance %s: %w", instanceID, err)
+		slog.Error("Failed to reboot instance", "error", err, "instanceID", instanceID, "node", node.Name)
 
 		return "", err
 	}

As per coding guidelines, wrap errors with context using fmt.Errorf("context: %w", err).


134-148: Consider adding observability to IsNodeReady.

The IsNodeReady function implements a 5-minute cooldown logic but lacks logging, which could make debugging difficult. Consider adding structured logging to track when the cooldown period is active vs. when the node is determined ready.

♻️ Suggested logging additions
 func (c *Client) IsNodeReady(ctx context.Context, node corev1.Node, message string) (bool, error) {
 	// Sending a reboot request to AWS doesn't update statuses immediately,
 	// the ec2 instance does not report that it isn't in a running state for some time
 	// and kubernetes still sees the node as ready. Wait five minutes before checking the status
 	storedTime, err := time.Parse(time.RFC3339, message)
 	if err != nil {
-		return false, err
+		err = fmt.Errorf("failed to parse reboot timestamp for node %s: %w", node.Name, err)
+		slog.Error("Failed to parse reboot timestamp", "error", err, "node", node.Name, "message", message)
+		return false, err
 	}
 
+	elapsed := time.Since(storedTime)
-	if time.Since(storedTime) < 5*time.Minute {
+	if elapsed < 5*time.Minute {
+		slog.Debug("Node still in cooldown period", "node", node.Name, "elapsed", elapsed, "required", 5*time.Minute)
 		return false, nil
 	}
 
+	slog.Info("Node cooldown period complete", "node", node.Name, "elapsed", elapsed)
 	return true, nil
 }
janitor-provider/go.mod (1)

16-16: Remove or refactor testify usage to standard testing package.

The janitor-provider tests use stretchr/testify exclusively for simple assertions: assert.Equal(), assert.Error(), assert.NotNil(), and require.NoError(). Per project learnings, the NVIDIA/NVSentinel repository discourages testify for simple comparison assertions. These can be replaced with straightforward standard testing package patterns (e.g., if expected != actual { t.Errorf(...) }), eliminating the dependency.

janitor-provider/pkg/csp/client.go (1)

45-69: Wrap the β€œprovider from env” error with context (or return status upstream), instead of returning the raw parse error.

Right now you log context but return err as-is; callers will lose β€œCSP env” context unless they also log. Consider:

  • return nil, fmt.Errorf("determine CSP provider from env: %w", err)

This aligns with the repo’s Go error-wrapping guideline.

janitor-provider/pkg/csp/gcp/gcp.go (1)

137-137: Consider adding more context to operation logs.

The info logs at lines 137 and 218 could benefit from additional structured fields such as project, zone, and operation type for better observability and debugging.

πŸ“Š Enhanced logging with additional context
-	slog.Info("Sending reset signal to", "node", nodeFields.instance)
+	slog.Info("Sending reset signal to node",
+		"node", nodeFields.instance,
+		"project", nodeFields.project,
+		"zone", nodeFields.zone)
-	slog.Info("Sending delete signal to", "node", nodeFields.instance)
+	slog.Info("Sending delete signal to node",
+		"node", nodeFields.instance,
+		"project", nodeFields.project,
+		"zone", nodeFields.zone)

Also applies to: 218-218

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (2)

141-142: Consider adding a liveness probe.

The deployment includes a readiness probe but no liveness probe. For a long-running gRPC service, a liveness probe can help Kubernetes detect and recover from deadlocked or unresponsive container states that the readiness probe might not catch.

πŸ₯ Add liveness probe configuration
           readinessProbe:
             {{- toYaml .Values.readinessProbe | nindent 12 }}
+          {{- with .Values.livenessProbe }}
+          livenessProbe:
+            {{- toYaml . | nindent 12 }}
+          {{- end }}

Then in values.yaml, add a default liveness probe configuration such as:

livenessProbe:
  grpc:
    port: 50051
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

45-56: Consider adding security context for defense in depth.

The container specification lacks a security context. As a security best practice, consider adding security context settings such as:

  • runAsNonRoot: true
  • readOnlyRootFilesystem: true (if applicable)
  • allowPrivilegeEscalation: false
  • Appropriate capabilities restrictions

This is especially important for a service that makes external CSP API calls.

πŸ”’ Suggested security context
         - name: janitor-provider
           image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default ((.Values.global).image).tag | default .Chart.AppVersion }}"
           imagePullPolicy: {{ .Values.image.pullPolicy }}
+          securityContext:
+            {{- toYaml .Values.securityContext | nindent 12 }}
           resources:
             {{- toYaml .Values.resources | nindent 12 }}

And in values.yaml:

securityContext:
  runAsNonRoot: true
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: false  # Set to true if the app doesn't need to write to filesystem
janitor/pkg/controller/rebootnode_controller_test.go (1)

122-126: Consider migrating to envtest for controller testing.

The test suite uses a fake Kubernetes client, but the coding guidelines specify using envtest for testing Kubernetes controllers instead of fake clients. While the fake client works for these scenarios, envtest provides a more realistic test environment that better catches integration issues.

As per coding guidelines, use envtest for testing Kubernetes controllers instead of fake clients.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between df04258 and eae93f5.

β›” Files ignored due to path filters (4)
  • api/gen/go/csp/v1alpha1/provider.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • api/gen/go/csp/v1alpha1/provider_grpc.pb.go is excluded by !**/*.pb.go, !**/gen/**
  • janitor-provider/go.sum is excluded by !**/*.sum
  • janitor/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (44)
  • .gitignore
  • .ko.yaml
  • api/Makefile
  • api/proto/csp/v1alpha1/provider.proto
  • distros/kubernetes/nvsentinel/Chart.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpl
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/values.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • janitor-provider/Tiltfile
  • janitor-provider/go.mod
  • janitor-provider/main.go
  • janitor-provider/pkg/csp/aws/aws.go
  • janitor-provider/pkg/csp/azure/azure.go
  • janitor-provider/pkg/csp/client.go
  • janitor-provider/pkg/csp/client_test.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor-provider/pkg/csp/kind/kind.go
  • janitor-provider/pkg/csp/nebius/nebius.go
  • janitor-provider/pkg/csp/oci/oci.go
  • janitor-provider/pkg/model/csp.go
  • janitor/go.mod
  • janitor/pkg/config/config.go
  • janitor/pkg/controller/rebootnode_controller.go
  • janitor/pkg/controller/rebootnode_controller_test.go
  • janitor/pkg/controller/suite_test.go
  • janitor/pkg/controller/terminatenode_controller.go
  • janitor/pkg/controller/terminatenode_controller_test.go
  • janitor/pkg/controller/test_utils.go
  • scripts/build-image-list.sh
  • scripts/buildko.sh
  • scripts/check-image-attestations.sh
  • tilt/Tiltfile
πŸ’€ Files with no reviewable changes (3)
  • distros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/values.yaml
🚧 Files skipped from review as they are similar to previous changes (15)
  • janitor-provider/Tiltfile
  • janitor/pkg/controller/terminatenode_controller_test.go
  • scripts/build-image-list.sh
  • janitor-provider/pkg/csp/nebius/nebius.go
  • janitor/pkg/controller/suite_test.go
  • janitor-provider/pkg/csp/oci/oci.go
  • janitor-provider/pkg/csp/azure/azure.go
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpl
  • api/proto/csp/v1alpha1/provider.proto
  • distros/kubernetes/nvsentinel/Chart.yaml
  • distros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • api/Makefile
  • janitor-provider/pkg/csp/kind/kind.go
  • distros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yaml
🧰 Additional context used
πŸ““ Path-based instructions (4)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • janitor/pkg/config/config.go
  • janitor-provider/pkg/csp/aws/aws.go
  • janitor-provider/pkg/csp/client.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor/pkg/controller/rebootnode_controller_test.go
  • janitor/pkg/controller/rebootnode_controller.go
  • janitor/pkg/controller/terminatenode_controller.go
  • janitor/pkg/controller/test_utils.go
  • janitor-provider/main.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • janitor/pkg/controller/rebootnode_controller_test.go
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • janitor/go.mod
  • janitor-provider/go.mod
🧠 Learnings (21)
πŸ“š Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.

Applied to files:

  • tilt/Tiltfile
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • janitor-provider/pkg/csp/aws/aws.go
  • janitor-provider/pkg/csp/client.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • janitor-provider/pkg/csp/client.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor/pkg/controller/rebootnode_controller.go
  • janitor/go.mod
  • janitor/pkg/controller/terminatenode_controller.go
  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • janitor/pkg/controller/rebootnode_controller_test.go
  • janitor/pkg/controller/test_utils.go
  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • janitor/pkg/controller/rebootnode_controller_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • janitor/pkg/controller/rebootnode_controller_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • janitor/pkg/controller/rebootnode_controller_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • janitor/pkg/controller/rebootnode_controller.go
  • janitor-provider/main.go
πŸ“š Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

  • janitor/go.mod
  • .ko.yaml
  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • janitor/go.mod
  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • janitor/go.mod
  • janitor-provider/go.mod
πŸ“š Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

  • janitor/go.mod
  • janitor-provider/go.mod
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • janitor/go.mod
  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • janitor/go.mod
  • janitor-provider/go.mod
  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`

Applied to files:

  • distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • .ko.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • janitor-provider/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • janitor-provider/main.go
🧬 Code graph analysis (4)
janitor/pkg/controller/rebootnode_controller_test.go (3)
janitor/pkg/controller/rebootnode_controller.go (1)
  • RebootNodeReconciler (74-79)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
janitor/pkg/controller/test_utils.go (1)
  • DefaultFailureBehavior (61-67)
janitor/pkg/controller/rebootnode_controller.go (2)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
  • CSPProviderServiceClient (44-48)
  • NewCSPProviderServiceClient (54-56)
janitor/pkg/controller/terminatenode_controller.go (3)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
  • CSPProviderServiceClient (44-48)
  • NewCSPProviderServiceClient (54-56)
janitor/pkg/config/config.go (1)
  • Config (26-30)
janitor/pkg/controller/test_utils.go (2)
commons/pkg/server/server.go (2)
  • Server (87-97)
  • NewServer (375-396)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (3)
  • CSPProviderServiceClient (44-48)
  • RegisterCSPProviderServiceServer (124-133)
  • NewCSPProviderServiceClient (54-56)
πŸͺ› OSV Scanner (2.3.1)
janitor-provider/go.mod

[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: Excessive memory allocation during header parsing in github.com/golang-jwt/jwt

(GO-2025-3553)


[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: jwt-go allows excessive memory allocation during header parsing

(GHSA-mh63-6h87-95cp)

πŸͺ› YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yaml

[warning] 30-30: too many blank lines (1 > 0)

(empty-lines)


[error] 17-17: syntax error: expected , but found ''

(syntax)

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml

[error] 19-19: syntax error: expected the node content, but found '-'

(syntax)

πŸ”‡ Additional comments (27)
scripts/check-image-attestations.sh (1)

69-69: LGTM! Janitor-provider image correctly added to attestation checks.

The addition of nvsentinel/janitor-provider to the Ko-built images list is appropriate and aligns with the PR's introduction of the janitor-provider gRPC service.

.gitignore (1)

443-444: βœ“ mise.toml addition is appropriate.

The addition of mise.toml to the Version Management Tools section is a legitimate change for version management tooling and poses no conflicts with existing entries.

tilt/Tiltfile (2)

118-118: LGTM!

The include statement follows the established pattern and is logically placed after the janitor include.


245-248: No additional resource dependencies are required for janitor-provider. The service is a gRPC provider that handles node lifecycle operations with the Kubernetes API and CSP clients, neither of which require separate Tilt resource dependencies. The current dependency on janitor is appropriate.

Likely an incorrect or invalid review comment.

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml (1)

14-19: Static analysis false positive - Helm syntax is correct.

The YAMLlint error on line 19 is a false positive. The {{- syntax is valid Helm templating (the dash trims leading whitespace). YAMLlint doesn't parse Helm templates, so it misinterprets this as invalid YAML.

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yaml (2)

14-19: Static analysis false positive - Helm syntax is correct.

Same as in clusterrole.yaml, the YAMLlint error on line 19 is a false positive. The {{- syntax is valid Helm templating.


20-27: LGTM! ClusterRoleBinding structure is correct and consistent.

The binding correctly references the ClusterRole and ServiceAccount using consistent naming templates. The namespace is properly scoped to the Helm release namespace.

scripts/buildko.sh (2)

51-51: LGTM!

The janitor-provider module is correctly added to both the go.work initialization and ko build inputs, consistent with other modules in the build pipeline.

Also applies to: 66-66


86-86: LGTM!

Proper newline termination ensures consistent output formatting for GitHub Actions workflows.

janitor-provider/go.mod (2)

125-127: LGTM!

The replace directives correctly point to local sibling modules, following the project's multi-module structure and the guideline to use commons/ for shared utilities across Go modules.


3-3: No action required. Go 1.25.0 is a valid, officially released version (August 12, 2025). The go.mod declaration is correct.

distros/kubernetes/nvsentinel/values-tilt.yaml (3)

80-81: LGTM!

Enables the janitor-provider subchart, consistent with other global feature flag patterns in the values file.


324-325: The janitor-provider service is correctly configured with port 50051. The Service template exists, the Deployment exposes the correct containerPort, and the network policy allows the port. The configuration in values-tilt.yaml is valid.


329-330: Document CSP testing status and add observability for janitor-provider service.

The janitor-provider service supports six CSPs (kind, aws, gcp, azure, oci, nebius), but testing coverage is incomplete. AWS, GCP, and kind have full UAT setups; azure and oci have stub directories only; nebius has no UAT configuration. Please document which CSPs are production-ready and add UAT for the remaining providers.

Additionally, janitor-provider exposes a /metrics endpoint via Prometheus but does not define custom metrics to monitor CSP operations. The janitor controller itself tracks janitor_actions_count and janitor_action_mttr_seconds, but janitor-provider lacks instrumentation for CSP API latency, error rates, and retry behavior. Consider adding:

  • CSP API call duration and error counters
  • Request-level metrics (reboot, terminate, ready-check operations)
  • Alerting rules for janitor-provider failures
janitor-provider/pkg/csp/aws/aws.go (1)

20-20: LGTM!

The migration to standard library log/slog follows the coding guideline to use structured logging, and the model import path correctly points to the janitor-provider module.

As per coding guidelines, use structured logging via log/slog in Go code.

Also applies to: 31-31

janitor/pkg/config/config.go (2)

34-37: LGTM!

The CSPProviderHost field additions follow consistent patterns across all configuration structs with appropriate struct tags and comments. The mapstructure:"cspProviderHost" tag correctly handles the camelCase field from YAML configuration.

Also applies to: 56-57, 71-72


106-114: LGTM!

The CSPProviderHost propagation logic correctly implements the inheritance pattern: controller-specific values take precedence, with fallback to the global configuration. This matches the existing pattern for NodeExclusions.

.ko.yaml (1)

55-88: Confirm whether health-events-analyzer is meant to remain buildable alongside csp-health-monitor.

This file still defines a health-events-analyzer build entry and adds csp-health-monitor. If the intent was a rename/move, we should remove (or clearly deprecate) the old build entry to avoid duplicated images and confusion.

janitor/go.mod (1)

5-20: Address gRPC version inconsistency: janitor pins v1.78.0 while most other modules use v1.77.0 indirectly.

The replace directives for internal modules (github.com/nvidia/nvsentinel/api, commons, data-models, store-client at lines 126–132) are an established monorepo pattern. However, verify that:

  • google.golang.org/grpc v1.78.0 in janitor is intentional and compatible with v1.77.0 used indirectly by other modules (api, data-models, store-client, etc.)
  • go mod tidy is explicitly run in CI workflows before build/test steps (currently available via Makefile target but not visible in workflow files)
  • CI/release tooling resolves the local replaces correctly when modules are referenced from outside the monorepo
janitor-provider/pkg/csp/gcp/gcp.go (1)

21-21: LGTM: Logging migration to slog.

The switch to log/slog aligns with the structured logging requirement in the coding guidelines, and the model import path update reflects the new janitor-provider package structure.

Based on coding guidelines, structured logging via log/slog is required for Go code.

Also applies to: 32-32

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml (1)

14-30: LGTM: Service template is well-structured.

The service configuration properly exposes both the gRPC service port (50051) and metrics port (2112) with appropriate defaults. The static analysis syntax error on line 19 is a false positiveβ€”the Helm templating syntax {{- include "provider.labels" . | nindent 4}} is correct.

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (2)

33-38: LGTM: Azure Workload Identity label correctly applied.

The conditional Azure Workload Identity label on the pod template aligns with the ServiceAccount annotation and follows Azure's workload identity pattern.


143-155: LGTM: Nebius secret mounting is properly implemented.

The conditional volume and volumeMount logic for Nebius service account keys is correct:

  • Only applied when provider is "nebius" and serviceAccountKeySecret is set
  • Properly mounted at /etc/nebius as read-only
  • Coordinates with the environment variable NEBIUS_SA_KEY_FILE set on line 120
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yaml (1)

26-29: Verify whether unrestricted ingress is intentional.

The NetworkPolicy allows ingress from any source (no from selector specified in the ingress rule on port 50051). This pattern is consistent throughout the codebase across multiple services (event-exporter-mock, csp-api-mock, and others), suggesting it may be intentional. However, confirm if janitor-provider should only accept connections from specific components (e.g., janitor controllers). If so, add a from selector with podSelector or namespaceSelector to restrict access.

distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml (1)

20-49: No action neededβ€”rendered YAML is valid.

An empty annotations: field is valid YAML syntax and valid Kubernetes YAML. When serviceAccount.annotations is empty and CSP-specific conditions don't match, the template renders with annotations: containing a null value, which is perfectly acceptable. Testing confirms the output is valid YAML.

Likely an incorrect or invalid review comment.

janitor/pkg/controller/test_utils.go (2)

222-225: Verify error handling logic.

The condition on line 222 checks err != nil && errors.Is(err, grpc.ErrServerStopped), which logs errors only when the server is stopped. This seems invertedβ€”typically you'd want to log errors that are not expected shutdown errors.

Consider whether the logic should be:

if err != nil && !errors.Is(err, grpc.ErrServerStopped) {
    slog.Error("Mock CSP server exited with error", "error", err)
}

35-49: Add godoc comment for exported type.

The exported MockCSPBehavior type lacks a godoc comment.

As per coding guidelines, function comments are required for all exported Go functions and types.

β›” Skipped due to learnings
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
janitor/pkg/controller/rebootnode_controller.go (1)

192-233: Unreachable code due to variable shadowing and early return.

At line 192, nodeReadyErr is declared with := inside the if block, shadowing any outer variable. However, lines 220-233 reference nodeReadyErr outside that block, but this code is unreachable because:

  1. When r.Config.ManualMode is false, lines 192-204 handle the error case with an early return
  2. When r.Config.ManualMode is true, nodeReadyErr remains nil (from line 190-191)

The nodeReadyErr != nil check at line 220 will never be true after the early return at line 204. This appears to be dead code from the migration.

Proposed fix: Remove unreachable error handling block
 		if r.Config.ManualMode {
 			cspReady = true
 			nodeReadyErr = nil
 		} else {
 			rsp, nodeReadyErr := r.CSPClient.IsNodeReady(ctx, &cspv1alpha1.IsNodeReadyRequest{
 				NodeName: node.Name,
 			})
 			if nodeReadyErr != nil {
 				logger.Error(nodeReadyErr, "failed to check if node is ready",
 					"node", node.Name)

 				rebootNode.Status.ConsecutiveFailures++
 				delay := getNextRequeueDelay(rebootNode.Status.ConsecutiveFailures)

 				result = ctrl.Result{RequeueAfter: delay}
 				// Update status and return early
 				return r.updateRebootNodeStatus(ctx, req, originalRebootNode, &rebootNode, result)
 			}

 			cspReady = rsp.IsReady
 		}

 		// Check if kubernetes reports the node is ready.
 		kubernetesReady := false

 		for _, condition := range node.Status.Conditions {
 			if condition.Type == corev1.NodeReady {
 				kubernetesReady = condition.Status == corev1.ConditionTrue
 			}
 		}

-		// nolint:gocritic // Migrated business logic with if-else chain
-		if nodeReadyErr != nil {
-			logger.Error(nodeReadyErr, "node ready status check failed",
-				"node", node.Name)
-
-			rebootNode.Status.ConsecutiveFailures++
-
-			rebootNode.SetCompletionTime()
-			rebootNode.SetCondition(metav1.Condition{
-				Type:               janitordgxcnvidiacomv1alpha1.RebootNodeConditionNodeReady,
-				Status:             metav1.ConditionFalse,
-				Reason:             "Failed",
-				Message:            fmt.Sprintf("Node status could not be checked from CSP: %s", nodeReadyErr),
-				LastTransitionTime: metav1.Now(),
-			})
-
-			metrics.GlobalMetrics.IncActionCount(metrics.ActionTypeReboot, metrics.StatusFailed, node.Name)
-
-			result = ctrl.Result{} // Don't requeue on failure
-		} else if cspReady && kubernetesReady {
+		if cspReady && kubernetesReady {
πŸ€– Fix all issues with AI agents
In @janitor/pkg/controller/test_utils.go:
- Around line 222-226: The goroutine that starts the mock gRPC server uses an
inverted condition and will suppress real errors: in the anonymous func calling
server.Serve(lis) change the if so it logs only when err is non-nil and NOT
grpc.ErrServerStopped (e.g., if err != nil && !errors.Is(err,
grpc.ErrServerStopped)) and leave graceful shutdowns unlogged; reference the
call to server.Serve, the goroutine, grpc.ErrServerStopped, and the log.Printf
call when making this change.
🧹 Nitpick comments (4)
janitor/pkg/controller/terminatenode_controller.go (2)

414-428: gRPC setup is consistent with rebootnode_controller and properly implements cleanup.

The implementation correctly:

  • Creates gRPC connection with the configured host
  • Stores the connection for lifecycle management
  • Registers cleanup via manager.RunnableFunc to close on shutdown

This follows the same pattern as rebootnode_controller.go, which is good for consistency.

Consider extracting the duplicated gRPC client setup logic (lines 415-428) into a shared helper function to reduce code duplication across controllers. Both RebootNodeReconciler and TerminateNodeReconciler have identical setup code.

Example shared helper
// In a shared file like grpc_client.go
func SetupCSPClient(mgr ctrl.Manager, host string) (cspv1alpha1.CSPProviderServiceClient, *grpc.ClientConn, error) {
    conn, err := grpc.NewClient(host, grpc.WithTransportCredentials(insecure.NewCredentials()))
    if err != nil {
        return nil, nil, fmt.Errorf("failed to create CSP client: %w", err)
    }

    if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
        <-ctx.Done()
        return conn.Close()
    })); err != nil {
        conn.Close()
        return nil, nil, fmt.Errorf("failed to add grpc connection cleanup to manager: %w", err)
    }

    return cspv1alpha1.NewCSPProviderServiceClient(conn), conn, nil
}

330-332: Response discarded - verify if RequestId should be captured.

The SendTerminateSignal response is discarded. In contrast, SendRebootSignal in rebootnode_controller.go (line 367) uses rsp.RequestId as the condition message for audit/tracking purposes.

Consider whether the terminate operation should similarly capture the RequestId for traceability.

Proposed change to capture RequestId
-			_, terminateErr := r.CSPClient.SendTerminateSignal(ctx, &cspv1alpha1.SendTerminateSignalRequest{
+			rsp, terminateErr := r.CSPClient.SendTerminateSignal(ctx, &cspv1alpha1.SendTerminateSignalRequest{
 				NodeName: node.Name,
 			})

Then use rsp.RequestId in the success condition message at line 360:

-					Message:            "Terminate signal sent to CSP",
+					Message:            rsp.RequestId,
janitor/pkg/controller/test_utils.go (2)

236-238: log.Fatalf will terminate the entire test process.

If the gRPC client creation fails, log.Fatalf calls os.Exit(1), which abruptly terminates the test process without running cleanup or reporting proper test failures. Consider using panic instead, which Ginkgo can recover and report as a test failure, or restructure to return an error.

Proposed fix using panic
 	conn, err := grpc.NewClient(
 		"passthrough://bufnet",
 		grpc.WithContextDialer(func(context.Context, string) (net.Conn, error) {
 			return lis.Dial()
 		}),
 		grpc.WithTransportCredentials(insecure.NewCredentials()),
 	)
 	if err != nil {
-		log.Fatalf("Failed to create mock CSP client: %v", err)
+		panic(fmt.Sprintf("Failed to create mock CSP client: %v", err))
 	}

Note: This requires adding "fmt" to imports.


249-257: Consider closing the bufconn listener explicitly.

The Stop() method closes the gRPC server and connection but doesn't explicitly close the bufconn.Listener. While GracefulStop() should handle this internally, explicit cleanup ensures no resource leaks if the server wasn't started or stopped abnormally.

Proposed enhancement
 func (h *MockCSPTestHelper) Stop() {
 	if h.grpcServer != nil {
 		h.grpcServer.GracefulStop()
 	}
 	if h.grpcConn != nil {
-		h.grpcConn.Close()
+		_ = h.grpcConn.Close()
+	}
+	if h.listener != nil {
+		_ = h.listener.Close()
 	}
 }
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between eae93f5 and fd3b4dc.

πŸ“’ Files selected for processing (3)
  • janitor/pkg/controller/rebootnode_controller.go
  • janitor/pkg/controller/terminatenode_controller.go
  • janitor/pkg/controller/test_utils.go
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • janitor/pkg/controller/terminatenode_controller.go
  • janitor/pkg/controller/test_utils.go
  • janitor/pkg/controller/rebootnode_controller.go
🧠 Learnings (7)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • janitor/pkg/controller/terminatenode_controller.go
  • janitor/pkg/controller/test_utils.go
  • janitor/pkg/controller/rebootnode_controller.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • janitor/pkg/controller/terminatenode_controller.go
  • janitor/pkg/controller/rebootnode_controller.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • janitor/pkg/controller/test_utils.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • janitor/pkg/controller/test_utils.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • janitor/pkg/controller/test_utils.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • janitor/pkg/controller/test_utils.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • janitor/pkg/controller/rebootnode_controller.go
🧬 Code graph analysis (3)
janitor/pkg/controller/terminatenode_controller.go (3)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
  • CSPProviderServiceClient (44-48)
  • NewCSPProviderServiceClient (54-56)
janitor/pkg/config/config.go (1)
  • Config (26-30)
janitor/pkg/controller/test_utils.go (1)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (3)
  • CSPProviderServiceClient (44-48)
  • RegisterCSPProviderServiceServer (124-133)
  • NewCSPProviderServiceClient (54-56)
janitor/pkg/controller/rebootnode_controller.go (2)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-39)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
  • CSPProviderServiceClient (44-48)
  • NewCSPProviderServiceClient (54-56)
πŸ”‡ Additional comments (2)
janitor/pkg/controller/test_utils.go (1)

69-200: Well-structured mock server with proper thread-safety.

The MockCSPServer implementation:

  • Uses sync.RWMutex correctly for concurrent access
  • Separates read operations (RLock) from write operations (Lock)
  • Provides granular control via per-operation setters
  • Embeds UnimplementedCSPProviderServiceServer for forward compatibility
janitor/pkg/controller/rebootnode_controller.go (1)

401-414: gRPC connection setup is correct with proper cleanup registration.

The connection lifecycle is properly managed:

  • Connection created at controller setup with error handling
  • Cleanup registered via manager.RunnableFunc that waits for context cancellation
  • Connection stored for later use

The insecure credentials usage is intentional for in-cluster service-to-service communication; verify this aligns with your security posture.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
janitor-provider/pkg/csp/kind/kind.go (1)

83-157: Fix timeout detection: you’re checking ctx.Err() instead of dockerCtx.Err(). When dockerCtx times out, ctx.Err() may still be nil (Line 104-106, 120-122, 142-144), so you’ll misclassify timeouts as generic failures.

Proposed fix
@@
  output, err := cmd.Output()
  if err != nil {
-   if ctx.Err() == context.DeadlineExceeded {
+   if dockerCtx.Err() == context.DeadlineExceeded {
      return "", fmt.Errorf("timeout while listing containers: %w", err)
    }
@@
    if err := cmd.Run(); err != nil {
-     if ctx.Err() == context.DeadlineExceeded {
+     if dockerCtx.Err() == context.DeadlineExceeded {
        return "", fmt.Errorf("timeout while deleting container: %w", err)
      }
@@
    output, err = cmd.Output()
    if err != nil {
-     if ctx.Err() == context.DeadlineExceeded {
+     if dockerCtx.Err() == context.DeadlineExceeded {
        return "", fmt.Errorf("timeout while verifying container deletion: %w", err)
      }

Also: strings.Contains(string(output), containerName) (Line 113) can false-positive; consider splitting by newline and exact-matching names.

janitor-provider/pkg/csp/azure/azure.go (1)

108-159: Wrap and/or log requestID parse failures; avoid slog.Info(fmt.Sprintf(...)). As written, time.Parse(time.RFC3339, requestID) returns a raw error (Line 111-114) without context; and Line 152 logs a formatted string instead of structured fields.

Proposed fix
@@
 func (c *Client) IsNodeReady(ctx context.Context, node corev1.Node, requestID string) (bool, error) {
@@
   storedTime, err := time.Parse(time.RFC3339, requestID)
   if err != nil {
-    return false, err
+    slog.Error("Invalid requestID timestamp; cannot evaluate readiness window",
+      "error", err,
+      "node", node.Name,
+      "requestID", requestID,
+    )
+    return false, fmt.Errorf("parse requestID timestamp %q: %w", requestID, err)
   }
@@
       for _, status := range instanceView.Statuses {
         if *status.Code == "ProvisioningState/succeeded" {
-          slog.Info(fmt.Sprintf("Node %s is in a healthy state", node.Name))
+          slog.Info("Node is in a healthy state", "node", node.Name)
           return true, nil
         }
       }
janitor/pkg/controller/rebootnode_controller.go (2)

192-209: Add per-RPC deadlines; current reconcile can hang indefinitely. IsNodeReady is called with the raw reconcile ctx (Line 192), but no context.WithTimeout is applied, and the existing CSPOperationTimeout constant is no longer enforced. This can stall controller workers if the provider is slow/unreachable.

Proposed fix
@@
-      rsp, nodeReadyErr := r.CSPClient.IsNodeReady(ctx, &cspv1alpha1.IsNodeReadyRequest{
+      rpcCtx, cancel := context.WithTimeout(ctx, CSPOperationTimeout)
+      defer cancel()
+      rsp, nodeReadyErr := r.CSPClient.IsNodeReady(rpcCtx, &cspv1alpha1.IsNodeReadyRequest{
         NodeName:  node.Name,
         RequestId: rebootNode.GetCSPReqRef(),
       })

338-390: Same timeout issue for SendRebootSignal; current DeadlineExceeded branch is effectively dead. Since you don’t set a deadline on the RPC context (Line 338-340), errors.Is(rebootErr, context.DeadlineExceeded) (Line 343) will rarely/never trigger as intended, and the call can block indefinitely.

Proposed fix
@@
-        rsp, rebootErr := r.CSPClient.SendRebootSignal(ctx, &cspv1alpha1.SendRebootSignalRequest{
+        rpcCtx, cancel := context.WithTimeout(ctx, CSPOperationTimeout)
+        defer cancel()
+        rsp, rebootErr := r.CSPClient.SendRebootSignal(rpcCtx, &cspv1alpha1.SendRebootSignalRequest{
           NodeName: node.Name,
         })
πŸ€– Fix all issues with AI agents
In @janitor-provider/pkg/csp/azure/azure.go:
- Around line 256-260: The defer closure currently logs "failed to close http
client" when resp.Body.Close() fails; update the message to accurately reflect
what's being closedβ€”e.g., change the log in the defer that calls
resp.Body.Close() to something like "failed to close IMDS response body" and
keep the error field ("error", cerr) intact; locate the defer func that invokes
resp.Body.Close() in azure.go and replace only the log message text.
🧹 Nitpick comments (8)
janitor-provider/pkg/csp/aws/aws.go (1)

129-129: Consider using UTC for time consistency.

OCI's SendRebootSignal uses time.Now().UTC().Format(time.RFC3339) while AWS uses time.Now().Format(time.RFC3339) (local time). This inconsistency could cause subtle timezone issues in distributed environments.

Proposed fix
-	return model.ResetSignalRequestRef(time.Now().Format(time.RFC3339)), nil
+	return model.ResetSignalRequestRef(time.Now().UTC().Format(time.RFC3339)), nil
janitor-provider/pkg/csp/gcp/gcp.go (1)

176-185: Consider nil check for defensive coding.

The *op.Status dereference on line 181 assumes the status is always present. While this should be the case for GCP Compute operations, a nil check would provide safer error handling against unexpected API responses.

Proposed fix
 	op, err := zoneOperationsClient.Get(ctx, req)
 	if err != nil {
 		return false, err
 	}

-	if *op.Status == computepb.Operation_DONE {
+	if op.Status != nil && *op.Status == computepb.Operation_DONE {
 		return true, nil
 	}

 	return false, nil
janitor-provider/main.go (2)

117-120: Validate JANITOR_PROVIDER_PORT environment variable.

If JANITOR_PROVIDER_PORT is empty or unset, net.Listen will bind to ":0" (random port), which is likely unintended for a production service.

Proposed fix
+	port := os.Getenv("JANITOR_PROVIDER_PORT")
+	if port == "" {
+		return fmt.Errorf("JANITOR_PROVIDER_PORT environment variable is required")
+	}
+
-	lis, err := net.Listen("tcp", fmt.Sprintf(":%s", os.Getenv("JANITOR_PROVIDER_PORT")))
+	lis, err := net.Listen("tcp", fmt.Sprintf(":%s", port))
 	if err != nil {
 		return fmt.Errorf("failed to listen: %w", err)
 	}

58-71: Consider using appropriate gRPC status codes for K8s errors.

Currently, all K8s API errors return codes.Internal. When a node is not found, returning codes.NotFound would be more semantically correct and help clients handle errors appropriately.

Proposed fix for SendRebootSignal (apply similar pattern to other handlers)
+import (
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
+)

 func (s *janitorProviderServer) SendRebootSignal(ctx context.Context, req *cspv1alpha1.SendRebootSignalRequest) (*cspv1alpha1.SendRebootSignalResponse, error) {
 	slog.Info("Sending reboot signal", "node", req.NodeName)
 	node, err := s.k8sClient.CoreV1().Nodes().Get(ctx, req.NodeName, metav1.GetOptions{})
 	if err != nil {
+		if apierrors.IsNotFound(err) {
+			return nil, status.Errorf(codes.NotFound, "node not found: %v", err)
+		}
 		return nil, status.Errorf(codes.Internal, "failed to get node: %v", err)
 	}
janitor-provider/pkg/csp/kind/kind.go (1)

53-61: Consider keeping requestID meaningful (or document why it’s ignored). requestID is explicitly ignored in simulation mode (Line 57-58); that’s fine, but a short doc comment in the interface or here would prevent confusion when correlating logs/traces across CSPs.

janitor-provider/pkg/csp/azure/azure.go (2)

74-106: Prefer adding stable correlation fields in slog logs. Right now most error logs don’t include node.Name / parsed VM identifiers (e.g., Line 78-79, 86-88, 94-96). Including "node", node.Name (and possibly "providerID", providerID) would make ops/debug much easier.


193-219: Azure client creation errors are logged, but error returns aren’t wrapped with context. Consider returning fmt.Errorf("create default VMSS client: %w", err) at call sites (or wrap locally) so callers don’t have to infer where it failed.

janitor/pkg/controller/rebootnode_controller.go (1)

188-398: Operational metrics gap for the new gRPC hop. Since CSP calls moved behind gRPC, consider adding metrics for RPC latency/error counts (per method + status) and alerting on sustained error rate / high latency / connection failures, to match the reviewer ask.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between fd3b4dc and 41135e2.

β›” Files ignored due to path filters (1)
  • api/gen/go/csp/v1alpha1/provider.pb.go is excluded by !**/*.pb.go, !**/gen/**
πŸ“’ Files selected for processing (10)
  • api/proto/csp/v1alpha1/provider.proto
  • janitor-provider/main.go
  • janitor-provider/pkg/csp/aws/aws.go
  • janitor-provider/pkg/csp/azure/azure.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor-provider/pkg/csp/kind/kind.go
  • janitor-provider/pkg/csp/nebius/nebius.go
  • janitor-provider/pkg/csp/oci/oci.go
  • janitor-provider/pkg/model/csp.go
  • janitor/pkg/controller/rebootnode_controller.go
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • janitor-provider/pkg/model/csp.go
  • janitor-provider/main.go
  • janitor/pkg/controller/rebootnode_controller.go
  • janitor-provider/pkg/csp/aws/aws.go
  • janitor-provider/pkg/csp/oci/oci.go
  • janitor-provider/pkg/csp/kind/kind.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor-provider/pkg/csp/nebius/nebius.go
  • janitor-provider/pkg/csp/azure/azure.go
🧠 Learnings (7)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • janitor-provider/main.go
  • janitor-provider/pkg/csp/aws/aws.go
  • janitor-provider/pkg/csp/oci/oci.go
  • janitor-provider/pkg/csp/kind/kind.go
  • janitor-provider/pkg/csp/gcp/gcp.go
  • janitor-provider/pkg/csp/nebius/nebius.go
  • janitor-provider/pkg/csp/azure/azure.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • janitor-provider/main.go
  • janitor/pkg/controller/rebootnode_controller.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • janitor-provider/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • janitor/pkg/controller/rebootnode_controller.go
  • janitor-provider/pkg/csp/gcp/gcp.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • janitor/pkg/controller/rebootnode_controller.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • janitor-provider/pkg/csp/nebius/nebius.go
🧬 Code graph analysis (3)
janitor-provider/main.go (5)
janitor-provider/pkg/model/csp.go (1)
  • CSPClient (30-40)
api/gen/go/csp/v1alpha1/provider.pb.go (18)
  • SendRebootSignalRequest (38-43)
  • SendRebootSignalRequest (56-56)
  • SendRebootSignalRequest (71-73)
  • SendRebootSignalResponse (82-87)
  • SendRebootSignalResponse (100-100)
  • SendRebootSignalResponse (115-117)
  • IsNodeReadyRequest (126-132)
  • IsNodeReadyRequest (145-145)
  • IsNodeReadyRequest (160-162)
  • IsNodeReadyResponse (178-183)
  • IsNodeReadyResponse (196-196)
  • IsNodeReadyResponse (211-213)
  • SendTerminateSignalRequest (222-227)
  • SendTerminateSignalRequest (240-240)
  • SendTerminateSignalRequest (255-257)
  • SendTerminateSignalResponse (266-271)
  • SendTerminateSignalResponse (284-284)
  • SendTerminateSignalResponse (299-301)
commons/pkg/server/server.go (4)
  • NewServer (375-396)
  • WithPort (151-153)
  • WithPrometheusMetrics (212-216)
  • WithSimpleHealth (246-254)
janitor-provider/pkg/csp/client.go (1)
  • New (46-69)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (1)
  • RegisterCSPProviderServiceServer (124-133)
janitor/pkg/controller/rebootnode_controller.go (2)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
  • CSPProviderServiceClient (44-48)
  • NewCSPProviderServiceClient (54-56)
api/gen/go/csp/v1alpha1/provider.pb.go (6)
  • IsNodeReadyRequest (126-132)
  • IsNodeReadyRequest (145-145)
  • IsNodeReadyRequest (160-162)
  • SendRebootSignalRequest (38-43)
  • SendRebootSignalRequest (56-56)
  • SendRebootSignalRequest (71-73)
janitor-provider/pkg/csp/azure/azure.go (6)
janitor-provider/pkg/model/csp.go (1)
  • ResetSignalRequestRef (24-24)
janitor-provider/pkg/csp/aws/aws.go (1)
  • Client (49-51)
janitor-provider/pkg/csp/gcp/gcp.go (1)
  • Client (40-40)
janitor-provider/pkg/csp/kind/kind.go (1)
  • Client (36-36)
janitor-provider/pkg/csp/nebius/nebius.go (1)
  • Client (60-63)
janitor-provider/pkg/csp/oci/oci.go (1)
  • Client (48-50)
πŸ”‡ Additional comments (8)
janitor-provider/pkg/model/csp.go (1)

34-36: LGTM!

The updated IsNodeReady signature with requestID parameter is well-documented and aligns with the proto API definition. The comment clearly explains that requestID is the reference returned by SendRebootSignal.

janitor-provider/pkg/csp/oci/oci.go (1)

135-150: LGTM!

The IsNodeReady implementation correctly parses the RFC3339 timestamp from requestID and applies the 5-minute cooldown period. The structured logging with slog.Error follows the coding guidelines.

janitor-provider/pkg/csp/nebius/nebius.go (1)

170-231: LGTM!

The IsNodeReady implementation correctly uses requestID as the instance ID for polling the Nebius instance state machine. The stop/start pattern with proper state transitions (STOPPING β†’ STOPPED β†’ STARTING β†’ RUNNING) is well-documented and handled appropriately. Structured logging with slog follows the coding guidelines.

janitor-provider/main.go (1)

154-186: LGTM on graceful shutdown handling.

The use of errgroup with separate goroutines for the metrics server, gRPC server, and shutdown handler is well-structured. The decision to log metrics server failures without terminating the service is reasonable, and GracefulStop() is correctly invoked on context cancellation.

api/proto/csp/v1alpha1/provider.proto (1)

20-48: LGTM!

The proto definitions are well-structured with clear service methods and message types. The IsNodeReadyRequest correctly includes both node_name and request_id fields to support correlation with previous SendRebootSignal calls. Field naming follows proto3 conventions.

janitor-provider/pkg/csp/kind/kind.go (1)

17-29: Good migration to log/slog and janitor-provider model import.

janitor-provider/pkg/csp/azure/azure.go (1)

17-36: slog + model import path update looks consistent with repo direction.

janitor/pkg/controller/rebootnode_controller.go (1)

74-81: gRPC API usage and lifecycle management are correct. The implementation properly uses grpc.NewClient() (the recommended API for grpc-go v1.78.0+), assigns clear ownership to the reconciler via r.grpcConn, and registers a single cleanup point with the manager that closes the connection on shutdown. There is no risk of double-close or races since no other component accesses the connection.

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/janitor/pkg/controller 23.30% (+1.55%) πŸ‘

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/janitor/pkg/controller/test_utils.go 24.36% (+5.61%) 312 (+280) 76 (+70) 236 (+210) πŸ‘

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

πŸ›‘οΈ CodeQL Analysis

🚨 Found 1 security alert(s)

πŸ”— View details

@lalitadithya lalitadithya merged commit 47f275b into NVIDIA:main Jan 9, 2026
69 of 71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants