-
Notifications
You must be signed in to change notification settings - Fork 33
feat: grpc remediation service #490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: grpc remediation service #490
Conversation
|
Warning Rate limit exceeded@jtschelling has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 11 seconds before requesting another review. β How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. π¦ How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. π Files selected for processing (4)
π WalkthroughWalkthroughAdds a janitor-provider gRPC service and proto API, converts janitor controllers to call that service over gRPC, introduces a janitor-provider Helm chart and deployment, adds a bufconn-based mock CSP gRPC test harness, and updates build/tilt/packaging to include the new provider and cspProviderHost wiring. Changes
Sequence Diagram(s)sequenceDiagram
participant Controller as Janitor Controller
participant GRPCClient as gRPC Client
participant Provider as janitor-provider (gRPC Server)
participant Backend as CSP Backend Adapter
Controller->>GRPCClient: IsNodeReady(node_name)
GRPCClient->>Provider: IsNodeReadyRequest(node_name, request_id)
Provider->>Backend: Check node readiness
Backend-->>Provider: IsNodeReadyResponse(is_ready)
Provider-->>GRPCClient: IsNodeReadyResponse(is_ready)
GRPCClient-->>Controller: is_ready
alt Node ready
Controller->>GRPCClient: SendRebootSignal(node_name)
GRPCClient->>Provider: SendRebootSignalRequest(node_name)
Provider->>Backend: Trigger reboot
Backend-->>Provider: SendRebootSignalResponse(request_id)
Provider-->>GRPCClient: SendRebootSignalResponse(request_id)
GRPCClient-->>Controller: request_id
else Node not ready
Controller-->>Controller: Requeue / retry
end
Estimated code review effortπ― 4 (Complex) | β±οΈ ~60 minutes Poem
π₯ Pre-merge checks | β 1 | β 2β Failed checks (1 warning, 1 inconclusive)
β Passed checks (1 passed)
βοΈ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ff136b5 to
9e256c9
Compare
0c561e6 to
034aba9
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
|
Could you please add some details on testing including which CSP the code has been tested on. Could you also add details on the metrics you have added to track the operational aspects of the code and alerts you would recommend. |
3ec0666 to
0bb7857
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Fix all issues with AI Agents π€
In @distros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yaml:
- Line 50: The ConfigMap currently renders the top-level key cspProviderHost
unconditionally (cspProviderHost: {{ .Values.config.cspProviderHost }}) which
can produce an empty value; change it to only emit the key/value when
.Values.config.cspProviderHost is set (mirror the conditional pattern used for
the controller-specific fields) so the ConfigMap omits cspProviderHost entirely
if the value is nil/empty. Locate the cspProviderHost entry in the template and
wrap it with the same conditional guard used elsewhere (checking
.Values.config.cspProviderHost) to prevent empty entries from being written.
In @janitor-provider/go.mod:
- Line 67: The project uses github.com/golang-jwt/jwt/v4 at v4.5.1 which
contains a HIGH severity vulnerability; update the module requirement to v4.5.2
(replace the version string for github.com/golang-jwt/jwt/v4 in go.mod), then
run the Go module commands to fetch and lock the new version (e.g., go get
github.com/golang-jwt/jwt/[email protected] and go mod tidy) so the dependency is
upgraded and the CVE is mitigated.
In @janitor-provider/main.go:
- Around line 50-93: Extract the repeated node lookup into a helper on
janitorProviderServer, e.g. add a method getNode(ctx context.Context, nodeName
string) (*corev1.Node, error) that calls s.k8sClient.CoreV1().Nodes().Get and
returns the node or wrapped status error; then update SendRebootSignal,
IsNodeReady and SendTerminateSignal to call getNode(ctx, req.NodeName) (and pass
*node when the cspClient methods expect a value), removing the duplicated Get
logic and error handling in each handler.
In @janitor-provider/pkg/csp/azure/azure.go:
- Line 152: Replace the fmt.Sprintf call inside the slog.Info invocation with
structured logging fields: call slog.Info with a fixed message string (e.g.,
"node healthy") and pass the node name as a structured field using slog.String
(or slog.Any) referencing node.Name; update the single occurrence of
slog.Info(fmt.Sprintf(...)) to use the structured form so the node name is a
searchable field.
In @janitor/pkg/controller/rebootnode_controller.go:
- Line 335: Delete the debug stdout print call fmt.Println("controller context",
ctx) from rebootnode_controller.go; locate it in the Reconcile handler (or the
method containing the controller context) and remove the line, and if context
information is needed for debugging replace it with the controller's structured
logger (e.g., use r.Log.Info with relevant keys) rather than printing to stdout.
In @janitor/pkg/controller/terminatenode_controller.go:
- Around line 412-418: The gRPC client connection created in SetupWithManager is
never closed; add a field Conn *grpc.ClientConn to TerminateNodeReconciler,
assign the conn returned by grpc.NewClient there instead of discarding it, and
register the reconciler with the manager as a runnable that closes conn when the
manager context is canceled (implement Start(ctx context.Context) error that
<-ctx.Done() then conn.Close()). Apply the same change to RebootNodeReconciler
(add Conn field, store the connection, and close it on manager shutdown) so the
connection lifecycle is tied to the manager and no leak occurs.
In @janitor/pkg/controller/test_utils.go:
- Around line 221-225: The goroutine that calls server.Serve(lis) currently logs
only when errors.Is(err, grpc.ErrServerStopped) is true (graceful shutdown)
which is inverted; change the condition to log unexpected errors by checking
that err is non-nil and NOT errors.Is(err, grpc.ErrServerStopped) (e.g., if err
!= nil && !errors.Is(err, grpc.ErrServerStopped) then call log.Printf) so
server.Serve failures are reported while normal shutdown remains silent; update
the if around server.Serve(lis) and keep references to server.Serve,
grpc.ErrServerStopped, and log.Printf.
- Around line 202-252: The test helper leaks the gRPC client connection because
the conn returned when creating the client is not stored or closed; add a field
to MockCSPTestHelper (e.g., ClientConn *grpc.ClientConn or net.Conn type
matching the dialer) to save the conn created in NewMockCSPTestHelper, assign
the returned conn to that field before creating
cspv1alpha1.NewCSPProviderServiceClient(conn), and update Stop() to close that
connection (call ClientConn.Close() or the appropriate Close method) in addition
to grpcServer. Ensure the field name used matches your struct (e.g., ClientConn)
so Stop() can reference and close it safely.
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (2)
janitor/pkg/controller/rebootnode_controller.go (2)
184-235: Variable shadowing causes dead code block.The
nodeReadyErrvariable is declared at line 184, but the:=operator at line 190 creates a new shadowed variable scoped to theelseblock. The error handling at lines 218-235 checks the outernodeReadyErrwhich is alwaysnilafter the refactoring, making this entire block unreachable dead code.The early return at line 202 handles the error case, so the block at lines 218-235 is never executed.
π Proposed fix - remove dead code block
var nodeReadyErr error if r.Config.ManualMode { cspReady = true nodeReadyErr = nil } else { rsp, nodeReadyErr := r.CSPClient.IsNodeReady(ctx, &cspv1alpha1.IsNodeReadyRequest{ NodeName: node.Name, }) if nodeReadyErr != nil { logger.Error(nodeReadyErr, "failed to check if node is ready", "node", node.Name) rebootNode.Status.ConsecutiveFailures++ delay := getNextRequeueDelay(rebootNode.Status.ConsecutiveFailures) result = ctrl.Result{RequeueAfter: delay} // Update status and return early return r.updateRebootNodeStatus(ctx, req, originalRebootNode, &rebootNode, result) } cspReady = rsp.IsReady } // Check if kubernetes reports the node is ready. kubernetesReady := false for _, condition := range node.Status.Conditions { if condition.Type == corev1.NodeReady { kubernetesReady = condition.Status == corev1.ConditionTrue } } - // nolint:gocritic // Migrated business logic with if-else chain - if nodeReadyErr != nil { - logger.Error(nodeReadyErr, "node ready status check failed", - "node", node.Name) - - rebootNode.Status.ConsecutiveFailures++ - - rebootNode.SetCompletionTime() - rebootNode.SetCondition(metav1.Condition{ - Type: janitordgxcnvidiacomv1alpha1.RebootNodeConditionNodeReady, - Status: metav1.ConditionFalse, - Reason: "Failed", - Message: fmt.Sprintf("Node status could not be checked from CSP: %s", nodeReadyErr), - LastTransitionTime: metav1.Now(), - }) - - metrics.GlobalMetrics.IncActionCount(metrics.ActionTypeReboot, metrics.StatusFailed, node.Name) - - result = ctrl.Result{} // Don't requeue on failure - } else if cspReady && kubernetesReady { + if cspReady && kubernetesReady {
399-415: gRPC connection is never closed - potential resource leak.The gRPC connection created at line 400 is not stored or closed during shutdown. The
connvariable goes out of scope, making it impossible to close later. Consider registering a runnable with the manager to handle cleanup.π Proposed fix - add connection to reconciler and register cleanup
type RebootNodeReconciler struct { client.Client Scheme *runtime.Scheme Config *config.RebootNodeControllerConfig CSPClient cspv1alpha1.CSPProviderServiceClient + cspConn *grpc.ClientConn } // SetupWithManager sets up the controller with the Manager. func (r *RebootNodeReconciler) SetupWithManager(mgr ctrl.Manager) error { conn, err := grpc.NewClient(r.Config.CSPProviderHost, grpc.WithTransportCredentials(insecure.NewCredentials())) if err != nil { return fmt.Errorf("failed to create CSP client: %w", err) } r.CSPClient = cspv1alpha1.NewCSPProviderServiceClient(conn) + r.cspConn = conn + + // Register cleanup on manager shutdown + if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error { + <-ctx.Done() + return conn.Close() + })); err != nil { + conn.Close() + return fmt.Errorf("failed to register connection cleanup: %w", err) + } return ctrl.NewControllerManagedBy(mgr).Note: This requires importing
"sigs.k8s.io/controller-runtime/pkg/manager".
π§Ή Nitpick comments (19)
.gitignore (1)
444-448: Approve additions; note redundancy in VSCode patterns.The
mise.tomlentry and*.code-workspacepatterns are sensible additions for environment management and editor configuration. However,.vscode/(line 447) is redundant with the existing.vscode/*pattern at line 235βno need to add it.Consider removing the
.vscode/entry at line 447 since.vscode/*at line 235 already covers directory contents (with exceptions for specific files at lines 236β240).janitor-provider/go.mod (1)
16-16: Refactor tests to use standard library instead of stretchr/testify.The testify dependency is currently used only for simple assertions in
pkg/csp/client_test.go:assert.Equal(),require.NoError(),assert.NotNil(), andassert.Error(). These can all be replaced with standardtestingpackage utilities (t.Error,t.Errorf,t.Fatalfwith simpleifchecks). Per project learnings, avoid introducing testify for straightforward checks.distros/kubernetes/nvsentinel/values-tilt.yaml (1)
287-288: Add inline documentation for cspProviderHost.The
cspProviderHostformat (service.namespace.svc.cluster.local:port) is non-obvious and should include an inline comment explaining the expected format and purpose.π Proposed enhancement
config: + # CSP provider gRPC endpoint in format: <service>.<namespace>.svc.cluster.local:<port> cspProviderHost: "janitor-provider.nvsentinel.svc.cluster.local:50051"Based on learnings, include examples for non-obvious configurations in Helm chart documentation.
distros/kubernetes/nvsentinel/values.yaml (1)
73-74: Add inline documentation for janitorProvider flag.The new
janitorProvider.enabledflag should include an inline comment explaining its purpose, similar to other global feature flags in this file.π Proposed enhancement
janitor: enabled: false + # janitorProvider enables the gRPC-based CSP provider service for janitor operations janitorProvider: enabled: falseBased on learnings, document all values in Helm chart
values.yamlwith inline comments.api/Makefile (1)
71-91: Consider consolidating protoc invocations for better maintainability.The current approach uses separate protoc commands for each proto directory (device and CSP). While this works correctly, it could be simplified by processing multiple proto paths in a single invocation, which would improve maintainability as more proto directories are added.
π Proposed consolidation
cd proto && \ protoc \ -I . \ -I ../$(THIRD_PARTY_DIR) \ --plugin="protoc-gen-go=$(PROTOC_GEN_GO)" \ --plugin="protoc-gen-go-grpc=$(PROTOC_GEN_GO_GRPC)" \ --go_out=../$(GEN_DIR) \ --go_opt=paths=source_relative \ --go-grpc_out=../$(GEN_DIR) \ --go-grpc_opt=paths=source_relative \ - device/v1alpha1/*.proto && \ - protoc \ - -I . \ - -I ../$(THIRD_PARTY_DIR) \ - --plugin="protoc-gen-go=$(PROTOC_GEN_GO)" \ - --plugin="protoc-gen-go-grpc=$(PROTOC_GEN_GO_GRPC)" \ - --go_out=../$(GEN_DIR) \ - --go_opt=paths=source_relative \ - --go-grpc_out=../$(GEN_DIR) \ - --go-grpc_opt=paths=source_relative \ - csp/v1alpha1/*.proto + device/v1alpha1/*.proto \ + csp/v1alpha1/*.protodistros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yaml (2)
17-17: Consider updating the description for production use.The description "Example provider for NVSentinel Janitor" suggests this is sample code rather than a production component. If this chart is intended for production deployment, consider a more descriptive name like "CSP provider service for NVSentinel Janitor" or "Janitor gRPC provider service".
15-19: Consider adding appVersion field.While optional, the
appVersionfield helps track which version of the janitor-provider image this chart is designed to deploy. This is especially useful when the chart version and application version evolve independently.π Suggested addition
apiVersion: v2 name: janitor-provider description: Example provider for NVSentinel Janitor type: application version: 0.1.0 +appVersion: "0.1.0"distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml (1)
14-19: Consider documenting cloud provider IAM annotation requirements.Based on the PR objectives mentioning CSP testing, the ServiceAccount may need annotations for cloud provider IAM role bindings (e.g., GCP Workload Identity's
iam.gke.io/gcp-service-account, AWS IRSA'seks.amazonaws.com/role-arn, or Azure'sazure.workload.identity/client-id).Consider documenting these requirements in the values.yaml or adding conditional annotation support in the template.
Would you like me to generate a template snippet that conditionally adds annotations from values.yaml?
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (1)
39-54: Consider adding a security context.For defense-in-depth, consider adding a security context to run the container as non-root with a read-only root filesystem.
π Recommended security context
- name: janitor-provider image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default ((.Values.global).image).tag | default .Chart.AppVersion }}" imagePullPolicy: {{ .Values.image.pullPolicy }} + securityContext: + runAsNonRoot: true + runAsUser: 65532 + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL resources:janitor-provider/pkg/csp/aws/aws.go (1)
110-127: Consider wrapping errors with additional context.While the current error logging provides good context, wrapping errors with
fmt.Errorfwould preserve the error chain and provide richer context for upstream callers.π Example error wrapping
// Extract the instance ID from the provider ID instanceID, err := parseAWSProviderID(providerID) if err != nil { - slog.Error("Failed to parse provider ID", "error", err) + err = fmt.Errorf("failed to parse provider ID for node %s: %w", node.Name, err) + slog.Error("Failed to parse provider ID", "error", err, "node", node.Name) return "", err } // Reboot the EC2 instance slog.Info("Rebooting node", "node", node.Name, "instanceID", instanceID) _, err = c.ec2.RebootInstances(ctx, &ec2.RebootInstancesInput{ InstanceIds: []string{instanceID}, }) if err != nil { - slog.Error("Failed to reboot instance", "error", err, "instanceID", instanceID) + err = fmt.Errorf("failed to reboot instance %s for node %s: %w", instanceID, node.Name, err) + slog.Error("Failed to reboot instance", "error", err, "instanceID", instanceID, "node", node.Name) return "", err }Based on coding guidelines, which recommend "Wrap errors with context using
fmt.Errorf("context: %w", err)in Go code."janitor-provider/pkg/csp/azure/azure.go (1)
92-103: Consider wrapping errors with additional context.Similar to the AWS implementation, wrapping errors with
fmt.Errorfwould provide richer context for upstream error handling while preserving the error chain.π Example error wrapping
// Extract the resource group and VM name from the provider ID resourceGroup, vmName, instanceID, err := parseAzureProviderID(providerID) if err != nil { - slog.Error("Failed to parse provider ID", "error", err) + err = fmt.Errorf("failed to parse Azure provider ID for node %s: %w", node.Name, err) + slog.Error("Failed to parse provider ID", "error", err, "node", node.Name) return "", err } // Reboot the VM _, err = vmssClient.BeginRestart(ctx, resourceGroup, vmName, instanceID, nil) if err != nil { - slog.Error("Failed to send restart signal to node", "error", err, "node", vmName) + err = fmt.Errorf("failed to send restart signal to Azure VM %s: %w", vmName, err) + slog.Error("Failed to send restart signal to node", "error", err, "node", vmName) return "", err }Based on coding guidelines, which recommend "Wrap errors with context using
fmt.Errorf("context: %w", err)in Go code."distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml (2)
63-70: Document probe configuration.The liveness and readiness probes lack inline comments explaining their purpose and what endpoint they check. As per coding guidelines, all values should be documented.
π Suggested documentation
+# Health check probes livenessProbe: + # HTTP endpoint to check if the container is alive httpGet: path: / port: http readinessProbe: + # HTTP endpoint to check if the container is ready to serve traffic httpGet: path: / port: httpBased on coding guidelines: Document all values in Helm chart
values.yamlwith inline comments.
85-87: Document the service port.The service port (50051) should have an inline comment explaining it's the gRPC port for the CSP provider service.
π Suggested documentation
service: + # gRPC port for CSP provider service port: 50051Based on coding guidelines: Document all values in Helm chart
values.yamlwith inline comments..ko.yaml (1)
157-172: Consider revising "Example" wording in descriptionsThe descriptions reference "Example provider" and "Example implementation," but this appears to be production code rather than an example. Consider updating to reflect its actual purpose:
- Line 164: "Example provider for NVSentinel Janitor"
- Line 169: "Example implementation of the CSP provider interface for Janitor"
Suggested revision
annotations: - org.opencontainers.image.description: "Example provider for NVSentinel Janitor" + org.opencontainers.image.description: "CSP provider service for NVSentinel Janitor" labels: org.opencontainers.image.source: "https://github.com/nvidia/nvsentinel" org.opencontainers.image.licenses: "Apache-2.0" org.opencontainers.image.title: "NVSentinel Janitor-Provider" - org.opencontainers.image.description: "Example implementation of the CSP provider interface for Janitor" + org.opencontainers.image.description: "CSP provider service implementing the gRPC interface for cloud operations"Note: Based on the PR context, the single janitor-provider with CSP detection via environment aligns with the past review feedback about avoiding image explosion.
api/proto/csp/v1alpha1/provider.proto (1)
20-48: Consider adding field documentationThe proto definitions are clean and functional, but adding comments would improve maintainability and auto-generated documentation quality. Consider documenting:
- Service RPCs with their purpose and behavior
- Field meanings and constraints (e.g., node_name format expectations)
- Return value semantics (e.g., what request_id represents)
Example documentation
+// CSPProviderService provides cloud service provider operations for node lifecycle management. service CSPProviderService { + // SendRebootSignal initiates a reboot operation for the specified node. rpc SendRebootSignal(SendRebootSignalRequest) returns (SendRebootSignalResponse) {} + // IsNodeReady checks whether the node has completed its reboot and is ready. rpc IsNodeReady(IsNodeReadyRequest) returns (IsNodeReadyResponse) {} + // SendTerminateSignal initiates termination of the specified node. rpc SendTerminateSignal(SendTerminateSignalRequest) returns (SendTerminateSignalResponse) {} } message SendRebootSignalRequest { + // node_name is the Kubernetes node name to reboot. string node_name = 1; }janitor-provider/main.go (1)
99-99: Make port configurableThe port
:50051is hard-coded. Consider making it configurable via an environment variable with a sensible default.Proposed change
+ port := os.Getenv("GRPC_PORT") + if port == "" { + port = "50051" + } + - lis, err := net.Listen("tcp", ":50051") + lis, err := net.Listen("tcp", ":"+port) if err != nil { slog.Error("Failed to listen", "error", err) os.Exit(1) } + + slog.Info("Listening for gRPC connections", "port", port)janitor/pkg/controller/terminatenode_controller.go (2)
328-330: Consider adding input validationThe node name is passed directly to the gRPC request without validation. While the Kubernetes API will validate it, adding an explicit check here would make errors clearer.
+ if node.Name == "" { + return ctrl.Result{}, fmt.Errorf("node name is empty") + } + _, terminateErr := r.CSPClient.SendTerminateSignal(ctx, &cspv1alpha1.SendTerminateSignalRequest{ NodeName: node.Name, })
413-413: Consider adding connection lifecycle management and documenting the in-cluster security boundaryThe gRPC client uses
insecure.NewCredentials()for communication with the CSP provider service (janitor-provider.nvsentinel.svc.cluster.local:50051). While this is acceptable for in-cluster Kubernetes service-to-service communication, the code lacks:
- Connection lifecycle management: The connection is created in
SetupWithManagerbut never explicitly closed or gracefully shutdown- Security assumption documentation: No comments explaining that this assumes in-cluster communication to a co-located trusted service
- TLS support option: Even for in-cluster communication, consider making TLS optional via configuration for defense-in-depth, especially given the sensitivity of operations (SendRebootSignal, SendTerminateSignal)
Add comments clarifying the security model and implement proper connection cleanup via controller shutdown hooks.
janitor/pkg/controller/test_utils.go (1)
235-237: Consider returning error instead oflog.Fatalf.
log.Fatalfterminates the entire process immediately, preventing proper test cleanup and making debugging harder. Consider returning an error to give callers control over failure handling.π Proposed refactor
-func NewMockCSPTestHelper() *MockCSPTestHelper { +func NewMockCSPTestHelper() (*MockCSPTestHelper, error) { lis := bufconn.Listen(1024 * 1024) server := grpc.NewServer() mockServer := NewMockCSPServer() cspv1alpha1.RegisterCSPProviderServiceServer(server, mockServer) go func() { if err := server.Serve(lis); err != nil && !errors.Is(err, grpc.ErrServerStopped) { log.Printf("Mock CSP server exited with error: %v", err) } }() // Create client connection conn, err := grpc.NewClient( "passthrough://bufnet", grpc.WithContextDialer(func(context.Context, string) (net.Conn, error) { return lis.Dial() }), grpc.WithTransportCredentials(insecure.NewCredentials()), ) if err != nil { - log.Fatalf("Failed to create mock CSP client: %v", err) + server.GracefulStop() + return nil, fmt.Errorf("failed to create mock CSP client: %w", err) } return &MockCSPTestHelper{ Server: mockServer, Client: cspv1alpha1.NewCSPProviderServiceClient(conn), grpcServer: server, listener: lis, - } + }, nil }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (4)
api/gen/go/csp/v1alpha1/provider.pb.gois excluded by!**/*.pb.go,!**/gen/**api/gen/go/csp/v1alpha1/provider_grpc.pb.gois excluded by!**/*.pb.go,!**/gen/**janitor-provider/go.sumis excluded by!**/*.sumjanitor/go.sumis excluded by!**/*.sum
π Files selected for processing (41)
.gitignore.ko.yamlapi/Makefileapi/proto/csp/v1alpha1/provider.protodistros/kubernetes/nvsentinel/Chart.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/values.yamldistros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yamldistros/kubernetes/nvsentinel/values-tilt.yamldistros/kubernetes/nvsentinel/values.yamljanitor-provider/Tiltfilejanitor-provider/go.modjanitor-provider/main.gojanitor-provider/pkg/csp/aws/aws.gojanitor-provider/pkg/csp/azure/azure.gojanitor-provider/pkg/csp/client.gojanitor-provider/pkg/csp/client_test.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor-provider/pkg/csp/kind/kind.gojanitor-provider/pkg/csp/nebius/nebius.gojanitor-provider/pkg/csp/oci/oci.gojanitor-provider/pkg/model/csp.gojanitor/go.modjanitor/pkg/config/config.gojanitor/pkg/controller/rebootnode_controller.gojanitor/pkg/controller/rebootnode_controller_test.gojanitor/pkg/controller/suite_test.gojanitor/pkg/controller/terminatenode_controller.gojanitor/pkg/controller/terminatenode_controller_test.gojanitor/pkg/controller/test_utils.goscripts/build-image-list.shscripts/buildko.shscripts/check-image-attestations.shtilt/Tiltfile
π§° Additional context used
π Path-based instructions (4)
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/values.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
janitor-provider/pkg/csp/client.gojanitor-provider/pkg/csp/oci/oci.gojanitor/pkg/controller/terminatenode_controller_test.gojanitor/pkg/config/config.gojanitor-provider/pkg/csp/azure/azure.gojanitor-provider/pkg/csp/kind/kind.gojanitor-provider/pkg/csp/aws/aws.gojanitor/pkg/controller/terminatenode_controller.gojanitor-provider/pkg/csp/nebius/nebius.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor-provider/main.gojanitor/pkg/controller/rebootnode_controller_test.gojanitor/pkg/controller/rebootnode_controller.gojanitor/pkg/controller/suite_test.gojanitor/pkg/controller/test_utils.go
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
janitor-provider/go.modjanitor/go.mod
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
janitor/pkg/controller/terminatenode_controller_test.gojanitor/pkg/controller/rebootnode_controller_test.gojanitor/pkg/controller/suite_test.go
π§ Learnings (23)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
janitor-provider/pkg/csp/client.gojanitor-provider/pkg/csp/oci/oci.gojanitor-provider/pkg/csp/azure/azure.gojanitor-provider/pkg/csp/kind/kind.gojanitor-provider/pkg/csp/aws/aws.gojanitor-provider/pkg/csp/nebius/nebius.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
janitor-provider/go.modjanitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
janitor-provider/go.modjanitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
janitor-provider/go.modjanitor/go.modjanitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
janitor-provider/go.modjanitor/pkg/controller/terminatenode_controller_test.gojanitor/pkg/controller/rebootnode_controller_test.gojanitor/pkg/controller/suite_test.gojanitor/pkg/controller/test_utils.go
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
janitor-provider/go.modjanitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
janitor-provider/go.modjanitor/pkg/controller/terminatenode_controller.gojanitor/go.modjanitor-provider/pkg/csp/gcp/gcp.goapi/Makefilejanitor/pkg/controller/rebootnode_controller.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
janitor-provider/go.mod
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
janitor-provider/go.modjanitor/go.modapi/Makefile
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpldistros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpl
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
π Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yamljanitor-provider/Tiltfile
π Learning: 2025-11-19T06:54:22.355Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/charts/fault-remediation/files/log-collector-job.yaml:39-40
Timestamp: 2025-11-19T06:54:22.355Z
Learning: When reviewing Helm templates in subcharts (charts within a charts/ directory), always check the parent chart's values.yaml for `global` section definitions. In Helm, values defined under `global` in the parent chart are automatically inherited by all subcharts and accessible via `.Values.global.*` within subchart templates. This is a standard Helm feature for sharing configuration across multiple charts.
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpl
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Health monitors should include Helm chart in `distros/kubernetes/nvsentinel/charts/`
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yamldistros/kubernetes/nvsentinel/Chart.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code
Applied to files:
janitor/pkg/controller/terminatenode_controller_test.gojanitor-provider/pkg/csp/nebius/nebius.gojanitor/pkg/controller/rebootnode_controller_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
janitor/pkg/controller/terminatenode_controller_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
janitor/pkg/controller/terminatenode_controller_test.gojanitor/pkg/controller/rebootnode_controller_test.gojanitor/pkg/controller/suite_test.go
π Learning: 2025-11-26T08:03:16.858Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: log-collector/Tiltfile:18-23
Timestamp: 2025-11-26T08:03:16.858Z
Learning: In Tilt configurations, when a Docker build uses the repository root as the build context but the Dockerfile only copies files from a specific subdirectory, it's correct and more efficient to set deps to watch only that subdirectory (e.g., deps=["./"] when the Tiltfile is in that subdirectory) rather than watching the entire repository root. This avoids unnecessary rebuilds when unrelated files change.
Applied to files:
janitor-provider/Tiltfile
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Use semantic versioning for breaking changes in Protocol Buffer messages
Applied to files:
api/Makefile
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Include comprehensive comments for all fields in Protocol Buffer messages
Applied to files:
api/Makefile
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Define Protocol Buffer messages in `data-models/protobufs/` directory
Applied to files:
api/Makefile
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
janitor/pkg/controller/rebootnode_controller_test.go
𧬠Code graph analysis (7)
janitor/pkg/controller/terminatenode_controller_test.go (3)
janitor/pkg/controller/terminatenode_controller.go (1)
TerminateNodeReconciler(50-55)janitor/pkg/config/config.go (2)
Config(26-30)TerminateNodeControllerConfig(61-73)janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)
janitor/pkg/controller/terminatenode_controller.go (3)
janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
CSPProviderServiceClient(44-48)NewCSPProviderServiceClient(54-56)janitor/pkg/config/config.go (1)
Config(26-30)
janitor-provider/main.go (5)
janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)commons/pkg/logger/logger.go (1)
SetDefaultStructuredLogger(73-75)commons/pkg/server/server.go (1)
NewServer(375-396)janitor-provider/pkg/csp/client.go (1)
New(46-69)api/gen/go/csp/v1alpha1/provider_grpc.pb.go (1)
RegisterCSPProviderServiceServer(124-133)
janitor/pkg/controller/rebootnode_controller_test.go (5)
janitor/pkg/controller/rebootnode_controller.go (1)
RebootNodeReconciler(74-79)janitor/pkg/config/config.go (2)
Config(26-30)RebootNodeControllerConfig(46-58)janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)commons/pkg/server/server.go (1)
Server(87-97)janitor/pkg/controller/test_utils.go (1)
DefaultFailureBehavior(61-67)
janitor/pkg/controller/rebootnode_controller.go (2)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
CSPProviderServiceClient(44-48)NewCSPProviderServiceClient(54-56)janitor/pkg/config/config.go (1)
Config(26-30)
janitor/pkg/controller/suite_test.go (1)
janitor/pkg/controller/test_utils.go (2)
MockCSPTestHelper(204-209)NewMockCSPTestHelper(214-245)
janitor/pkg/controller/test_utils.go (1)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (3)
CSPProviderServiceClient(44-48)RegisterCSPProviderServiceServer(124-133)NewCSPProviderServiceClient(54-56)
πͺ ast-grep (0.40.3)
janitor-provider/main.go
[warning] 98-98: "Detected a network listener listening on 0.0.0.0 or an empty string.
This could unexpectedly expose the server publicly as it binds to all
available interfaces. Instead, specify another IP address that is not
0.0.0.0 nor the empty string."
Context: net.Listen("tcp", ":50051")
Note: [CWE-200] Exposure of Sensitive Information to an Unauthorized Actor [REFERENCES]
- https://owasp.org/Top10/A01_2021-Broken_Access_Control
(avoid-bind-to-all-interfaces-go)
πͺ OSV Scanner (2.3.1)
janitor-provider/go.mod
[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: Excessive memory allocation during header parsing in github.com/golang-jwt/jwt
(GO-2025-3553)
[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: jwt-go allows excessive memory allocation during header parsing
πͺ YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
[warning] 67-67: too many blank lines (1 > 0)
(empty-lines)
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yaml
[warning] 30-30: too many blank lines (1 > 0)
(empty-lines)
[error] 17-17: syntax error: expected , but found ''
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
π€ Fix all issues with AI Agents
In
@distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml:
- Around line 50-56: The env values for JANITOR_PROVIDER_PORT, METRICS_PORT and
CSP are rendered as non-strings; update the Helm template to ensure each value
is rendered as a string by applying the quote filter to the entire expression
(e.g., quote the result of .Values.service.port | default 50051,
((.Values.global).metricsPort) | default 2112, and .Values.csp | default "kind")
so Kubernetes receives string values for the environment variables.
In @janitor-provider/main.go:
- Around line 97-149: The main function lacks graceful shutdown: add OS signal
handling (SIGINT, SIGTERM) to create a cancellable root context, pass that
context to long-running components (use context for csp.New if it accepts it,
and cancel when shutting down), and shut down both the metrics server (srv) and
the gRPC server (svr) cleanly instead of exiting immediately; specifically, in
main() set up a signal.NotifyContext or a signal channel, run srv.Serve(ctx) and
svr.Serve(lis) in separate goroutines, on signal call srv.Shutdown(ctx) or
server.Stop/Close equivalent and call svr.GracefulStop()/Stop after allowing a
timeout, cancel context and close listeners, and log errors from those shutdown
steps before exiting.
- Around line 125-133: The metrics server Serve call is blocking so the gRPC
server setup never runs; change the flow to start the metrics server
asynchronously by launching srv.Serve(...) in a separate goroutine and capture
its error on a channel or context so you can log failures without blocking
startup; locate the server.NewServer(...) and srv.Serve(...) calls in main
(symbols: server.NewServer, srv.Serve) and replace the direct blocking call with
a go routine that sends any returned error to an error channel or uses a
context-aware goroutine, while allowing the subsequent gRPC initialization (grpc
server setup and grpcSrv.Serve) to proceed normally.
- Line 15: Add a package-level godoc comment immediately above the "package
main" declaration describing the package's purpose and high-level behavior (what
the janitor-provider binary does, its main responsibilities, and any important
runtime considerations); ensure the comment is a complete sentence starting with
"Package main ..." so it satisfies Go documentation conventions and linter
checks.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
janitor-provider/go.sumis excluded by!**/*.sum
π Files selected for processing (5)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/values.yamljanitor-provider/go.modjanitor-provider/main.go
β Files skipped from review due to trivial changes (1)
- distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
janitor-provider/main.go
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
janitor-provider/go.mod
π§ Learnings (11)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
janitor-provider/main.gojanitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
janitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
janitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
janitor-provider/go.mod
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
janitor-provider/go.mod
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
janitor-provider/go.mod
πͺ OSV Scanner (2.3.1)
janitor-provider/go.mod
[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: Excessive memory allocation during header parsing in github.com/golang-jwt/jwt
(GO-2025-3553)
[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: jwt-go allows excessive memory allocation during header parsing
πͺ YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
π Additional comments (3)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml (1)
14-30: LGTM!The Service template is well-structured with appropriate defaults (gRPC on 50051, metrics on 2112). The use of Helm helpers for naming and labeling ensures consistency across the chart.
Note: The YAMLlint syntax error on line 19 is a false positiveβ
{{- include ... | nindent 4}}is valid Helm templating that the linter doesn't understand.distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (1)
39-49: Container configuration looks good.The image tag fallback chain provides flexibility, and the port definitions follow conventions (metrics: 2112, gRPC service: 50051). The YAMLlint error on line 19 is a false positive due to Helm templating syntax.
janitor-provider/go.mod (1)
69-69: Upgradegithub.com/golang-jwt/jwt/v4to v4.5.2 to fix HIGH severity vulnerability.This dependency contains CVE-2025-30204 (GHSA-mh63-6h87-95cp), which allows excessive memory allocation during header parsing. The fix is available in v4.5.2.
π Fix
Run the following commands to upgrade:
cd janitor-provider go get github.com/golang-jwt/jwt/[email protected] go mod tidyLikely an incorrect or invalid review comment.
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π€ Fix all issues with AI Agents
In @janitor-provider/go.mod:
- Line 70: Update the vulnerable dependency github.com/golang-jwt/jwt/v4 from
v4.5.1 to v4.5.2 in the module declaration (replace the version entry for
github.com/golang-jwt/jwt/v4 v4.5.1 with v4.5.2), then run go get
github.com/golang-jwt/jwt/[email protected] and go mod tidy to refresh the go.sum and
ensure the new version is resolved; verify with go mod graph or the provided
verification script to confirm no older v4.5.1 remains in the dependency graph.
In @janitor-provider/main.go:
- Line 15: Add a package-level godoc comment above the package main declaration
describing the purpose and high-level behavior of this package (what
janitor-provider does, intended audience, and any important usage/constraints);
place the comment immediately above the "package main" line so it becomes the
package-level documentation required by the coding guidelines.
- Line 76: The call to s.cspClient.IsNodeReady uses a hardcoded "Node is ready"
message; change IsNodeReadyRequest to include a request_id field in the proto,
regenerate stubs, have SendRebootSignal return that request_id (or capture its
returned request_id variable) and pass that request_id into
s.cspClient.IsNodeReady (replace the hardcoded string), and update all CSP
implementations (OCI/Azure/AWS/GCP/Nebius) and tests to consume the new
IsNodeReadyRequest.request_id semantics.
π§Ή Nitpick comments (5)
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml (3)
15-24: Add inline comments for core deployment values per coding guidelines.The coding guidelines require all values in Helm chart
values.yamlto be documented with inline comments. The CSP configuration sections have excellent documentation, but these core values are missing inline comments explaining their purpose.π Suggested inline comments
+# Number of janitor-provider replicas to deploy replicaCount: 1 # Image configuration image: + # Container image repository repository: ghcr.io/nvidia/nvsentinel/janitor-provider + # Image pull policy (Always, IfNotPresent, Never) pullPolicy: IfNotPresent + # Image tag (defaults to Chart.appVersion if empty) tag: "" +# Override the chart name nameOverride: "" +# Override the full release name fullnameOverride: "janitor-provider"Based on coding guidelines, document all values in Helm chart
values.yamlwith inline comments.
52-59: Consider adding a liveness probe for better pod lifecycle management.The readinessProbe configuration looks appropriate for using the HTTP health endpoint from the metrics server. However, adding a livenessProbe would help Kubernetes detect and restart hung containers.
π Suggested liveness probe
readinessProbe: httpGet: path: /healthz port: metrics initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 +livenessProbe: + httpGet: + path: /healthz + port: metrics + initialDelaySeconds: 15 + periodSeconds: 20 + timeoutSeconds: 5 + failureThreshold: 3
37-70: Add inline comments for remaining configuration values.Per coding guidelines, all values should have inline comments. These standard Kubernetes deployment values would benefit from brief descriptions.
π Suggested inline comments
+# Additional annotations to add to pods podAnnotations: {} +# Additional labels to add to pods podLabels: {} +# Resource limits and requests for the container resources: {} # We usually recommend not to specify default resources... ... +# Node selector for pod scheduling nodeSelector: {} +# Tolerations for pod scheduling tolerations: [] +# Affinity rules for pod scheduling affinity: {} service: + # gRPC service port port: 50051 +# Additional environment variables to inject extraEnv: []Based on coding guidelines, document all values in Helm chart
values.yamlwith inline comments.distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (2)
141-142: Add liveness probe template for production resilience.The readinessProbe is properly templated from values, but there's no corresponding livenessProbe template. For production deployments, a liveness probe helps Kubernetes detect and restart stuck containers. Consider adding a livenessProbe section in both the values.yaml and this template.
π Proposed template addition
readinessProbe: {{- toYaml .Values.readinessProbe | nindent 12 }} + {{- with .Values.livenessProbe }} + livenessProbe: + {{- toYaml . | nindent 12 }} + {{- end }}
156-156: Remove redundantrestartPolicy: Always.
restartPolicy: Alwaysis the default (and only valid value) for Deployment pod specs. This line can be safely removed to reduce template noise.π Proposed fix
{{- end }} - restartPolicy: Always {{- with (((.Values.global).systemNodeSelector) | default .Values.nodeSelector) }}
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (8)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/values.yamldistros/kubernetes/nvsentinel/charts/janitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/janitor/values.yamljanitor-provider/go.modjanitor-provider/main.go
π€ Files with no reviewable changes (3)
- distros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yaml
- distros/kubernetes/nvsentinel/charts/janitor/templates/deployment.yaml
- distros/kubernetes/nvsentinel/charts/janitor/values.yaml
π§° Additional context used
π Path-based instructions (3)
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
janitor-provider/go.mod
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
janitor-provider/main.go
π§ Learnings (18)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
janitor-provider/go.modjanitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
janitor-provider/go.mod
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
janitor-provider/go.mod
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
π Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
janitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
janitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
janitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
janitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
janitor-provider/main.go
πͺ OSV Scanner (2.3.1)
janitor-provider/go.mod
[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: Excessive memory allocation during header parsing in github.com/golang-jwt/jwt
(GO-2025-3553)
[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: jwt-go allows excessive memory allocation during header parsing
πͺ YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/maintenance-notifier)
- GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/csp-health-monitor)
- GitHub Check: ko-build-test (fault-quarantine, .)
- GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
- GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
- GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
- GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: E2E Tests (AMD64 + MongoDB)
- GitHub Check: E2E Tests (AMD64 + PostgreSQL)
- GitHub Check: E2E Tests (ARM64 + PostgreSQL)
- GitHub Check: E2E Tests (ARM64 + MongoDB)
- GitHub Check: modules-lint-test (labeler)
- GitHub Check: modules-lint-test (fault-remediation)
- GitHub Check: modules-lint-test (janitor)
- GitHub Check: modules-lint-test (platform-connectors)
- GitHub Check: modules-lint-test (store-client)
- GitHub Check: modules-lint-test (fault-quarantine)
- GitHub Check: modules-lint-test (node-drainer)
- GitHub Check: health-monitors-lint-test (kubernetes-object-monitor)
- GitHub Check: health-monitors-lint-test (csp-health-monitor)
- GitHub Check: CodeQL PR Analysis
π Additional comments (4)
janitor-provider/go.mod (1)
1-24: Module structure follows guidelines appropriately.The module is correctly set up as a separate Go module with semantic import versioning. The replace directives for local modules (api, commons) follow the monorepo pattern.
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (1)
45-55: Container configuration looks good with proper port naming and quoting.The container spec correctly defines named ports (metrics, service) and uses appropriate quoting for environment variable values. The image tag fallback chain provides flexibility across deployment scenarios.
janitor-provider/main.go (1)
151-181: Graceful shutdown implementation looks correct.The errgroup-based concurrent server management with signal handling is well-implemented. Good design decision to log metrics server failures without terminating the main gRPC service, as metrics are non-critical for core functionality.
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml (1)
14-49: ServiceAccount template correctly implements CSP-specific workload identity annotations.The conditional logic for AWS IRSA, GCP Workload Identity, Azure Workload Identity, and OCI Workload Identity annotations is well-structured. Each provider's annotations are guarded by appropriate checks for required values.
Nebius does not require ServiceAccount annotations for authenticationβit uses environment variables (
NEBIUS_SA_KEY_FILE,NEBIUS_IAM_TOKEN) and mounted secrets instead, so the template is correctly excluding Nebius-specific annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (2)
janitor/pkg/controller/rebootnode_controller.go (2)
190-206: Add timeout context to IsNodeReady gRPC call.The gRPC call to
IsNodeReadydoes not use a timeout context, despite theCSPOperationTimeoutconstant being defined (line 45). Without a timeout, this call can hang indefinitely if the CSP provider is slow or unresponsive, blocking the reconciliation loop and degrading controller responsiveness.π Proposed fix
} else { + cspCtx, cancel := context.WithTimeout(ctx, CSPOperationTimeout) + defer cancel() - rsp, nodeReadyErr := r.CSPClient.IsNodeReady(ctx, &cspv1alpha1.IsNodeReadyRequest{ + rsp, nodeReadyErr := r.CSPClient.IsNodeReady(cspCtx, &cspv1alpha1.IsNodeReadyRequest{ NodeName: node.Name, }) if nodeReadyErr != nil {
335-352: Add timeout context to SendRebootSignal gRPC call.The gRPC call to
SendRebootSignaldoes not use a timeout context, despite line 340 checking forcontext.DeadlineExceeded. This inconsistency suggests the code expects timeouts but doesn't enforce them. Without a timeout, the call can hang indefinitely, blocking the reconciliation loop.π Proposed fix
} else { // Start the reboot process metrics.GlobalMetrics.IncActionCount(metrics.ActionTypeReboot, metrics.StatusStarted, node.Name) logger.Info("sending reboot signal to node", "node", node.Name) + cspCtx, cancel := context.WithTimeout(ctx, CSPOperationTimeout) + defer cancel() - rsp, rebootErr := r.CSPClient.SendRebootSignal(ctx, &cspv1alpha1.SendRebootSignalRequest{ + rsp, rebootErr := r.CSPClient.SendRebootSignal(cspCtx, &cspv1alpha1.SendRebootSignalRequest{ NodeName: node.Name, }) // Check for timeout if errors.Is(rebootErr, context.DeadlineExceeded) {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (1)
janitor/pkg/controller/rebootnode_controller.go
π§° Additional context used
π Path-based instructions (1)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
janitor/pkg/controller/rebootnode_controller.go
π§ Learnings (1)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
janitor/pkg/controller/rebootnode_controller.go
𧬠Code graph analysis (1)
janitor/pkg/controller/rebootnode_controller.go (2)
janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
CSPProviderServiceClient(44-48)NewCSPProviderServiceClient(54-56)
π Additional comments (1)
janitor/pkg/controller/rebootnode_controller.go (1)
78-78: LGTM: Field type updated for gRPC client.The CSPClient field type has been correctly updated from the internal model to the generated gRPC client interface, aligning with the PR's objective to introduce a gRPC-based CSP provider service.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
8d2858f to
eae93f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (2)
janitor-provider/pkg/csp/gcp/gcp.go (1)
70-105: Clarify error handling logic in regex group extraction.The error handling in lines 90-94 appears incorrect. The code checks
if name != ""on line 90, then returns an error on line 93 saying "failed to extract required field %s from node.Spec.ProviderID" withnameas the parameter. However, at this pointnameis guaranteed to be non-empty, so this error branch should never execute.If the intent is to validate that all capture groups have matched values, the validation should check if
match[i]is empty instead. The current logic makes the error handling unreachable.π§ Suggested fix for error handling
for i, name := range groupNames { // groupNames[0] is always an empty string if i == 0 { continue } - if name != "" { + if name != "" && match[i] != "" { result[name] = match[i] - } else { + } else if name != "" { return nil, fmt.Errorf("failed to extract required field %s from node.Spec.ProviderID", name) } }janitor/pkg/controller/test_utils.go (1)
15-15: Add package-level godoc.The coding guidelines require package-level godoc for all Go packages.
As per coding guidelines, package-level godoc is required for all Go packages.
π€ Fix all issues with AI agents
In @.gitignore:
- Around line 447-448: The `.vscode/` ignore rule you added conflicts with
existing negations like `!.vscode/settings.json`, `!.vscode/tasks.json`,
`!.vscode/launch.json`, and `!.vscode/extensions.json`, which prevents those
shared files from being committed; to preserve the shared-configuration pattern
remove the `.vscode/` entry and keep the negations, and also decide whether
workspace files should be sharedβeither remove or clarify the `*.code-workspace`
entry (or comment it) to match that intent.
In @.ko.yaml:
- Around line 157-173: The metadata for the janitor-provider image uses
"Example" which downplays production intent; update the annotations and labels
for the id "janitor-provider" (annotations/org.opencontainers.image.description
and labels/org.opencontainers.image.title and
labels/org.opencontainers.image.description) to consistent, production-ready
wording (e.g., "NVSentinel Janitor Provider" and a clear descriptive sentence)
so they match org.opencontainers.image.source/version/revision fields and avoid
implying non-production status.
In
@distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml:
- Around line 20-28: ClusterRole currently grants only get/list/watch on the
nodes resource but the TerminateNode controller requires the ability to delete
nodes; update the rules block in clusterrole.yaml (the ClusterRole that lists
resources: nodes and verbs: get, list, watch) to include the delete verb so the
controller can perform node deletions after CSP termination succeeds.
In
@distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml:
- Around line 116-137: Add an explicit precedence note to values.yaml
documenting Nebius credential selection: state that iamTokenSecretRef has
highest priority, then serviceAccountKeySecret, then serviceAccountKeyFile, and
iamToken is lowest; include the recommended usage (prefer
serviceAccountKeySecret/serviceAccountKeyFile for production) and place this
note near the existing Nebius credential docs alongside the existing keys
iamTokenSecretRef, iamToken, serviceAccountKeySecret and serviceAccountKeyFile
so the template precedence is clearly reflected.
In @distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml:
- Around line 72-191: The docs list "kwok" as a supported csp provider but there
is no csp.kwok blockβeither add an explicit placeholder block (csp.kwok: {} with
a short comment) or remove "kwok" from the provider list to avoid confusion;
also expand the nebius secret examples by replacing nebius.iamTokenSecretRef: {}
with a concrete inline example (e.g., iamTokenSecretRef: { name:
"nebius-janitor-token", key: "token" }) and add a similar example for
nebius.serviceAccountKeySecret (showing the secret key name like
sa-credentials.json), and state precedence explicitly (e.g., "When both
nebius.iamToken and nebius.iamTokenSecretRef are set, the secret reference takes
precedence") so readers know which value is used.
- Around line 15-71: Add inline comments for every top-level value in this
values.yaml (replicaCount, image.repository/pullPolicy/tag, nameOverride,
fullnameOverride, serviceAccount.* including automount/annotations/name,
podAnnotations, podLabels, resources, readinessProbe, nodeSelector, tolerations,
affinity, service.port, extraEnv) with short examples or expected types; for
extraEnv show the list-of-env format (name/value or valueFrom) and a brief
example. Also document the metrics/probe wiring: explain that
readinessProbe.httpGet.port: "metrics" refers to a named container port defined
in the Deployment/StatefulSet template (the container port named "metrics") and
state the default METRICS_PORT value used by the container, plus note that if
users override service.port or METRICS_PORT they must keep the named port
consistent for probes to work.
In @janitor-provider/go.mod:
- Line 70: Add a Go module replace directive to override the transitive
vulnerable dependency "github.com/golang-jwt/jwt/v4 v4.5.1" to the patched
version v4.5.2 by updating go.mod to include a replace for
github.com/golang-jwt/jwt/v4 => github.com/golang-jwt/jwt/v4 v4.5.2 so the build
uses v4.5.2 even if nebius/gosdk still requires v4.5.1; then run go mod tidy to
update go.sum and verify builds/tests pass.
In @janitor-provider/main.go:
- Around line 58-101: Requests to Nodes().Get in SendRebootSignal, IsNodeReady
and SendTerminateSignal always return codes.Internal; validate req.NodeName
first and return codes.InvalidArgument when empty, and after calling
s.k8sClient.CoreV1().Nodes().Get detect kubernetes NotFound errors (use
api/errors.IsNotFound(err)) and return codes.NotFound, otherwise keep returning
codes.Internal; keep the original error text in the status.Errorf messages so
callers see the underlying error.
- Around line 113-187: run() currently treats grpc.ErrServerStopped as an error
and has no env defaults; update the gRPC serve goroutine that calls
svr.Serve(lis) to treat grpc.ErrServerStopped as non-fatal (suppress/return nil
when errors.Is(err, grpc.ErrServerStopped)), ensure the metrics goroutine still
logs but continues, and add defaults for JANITOR_PROVIDER_PORT (50051) and
METRICS_PORT (8080) when reading them (use fallback if empty or invalid before
net.Listen/strconv.Atoi). Also change the error string from "failed to create
kubernetes clientset" to "failed to create in-cluster kubernetes config" in
run(), and update the PR template (.github/PULL_REQUEST_TEMPLATE.md) to add
fields for tested CSP(s) and Prometheus metrics added.
In @janitor/pkg/controller/terminatenode_controller.go:
- Around line 413-419: The gRPC client connection created via grpc.NewClient
(assigned to conn) is never closed; update the SetupWithManager
(TerminateNodeReconciler) flow to store the connection and register a manager
runnable that closes conn on context cancellation: after creating conn and
assigning r.CSPClient, create and register a connectionCloser (holding conn)
with mgr.Add or mgr.AddRunnable that waits on ctx.Done() and calls conn.Close(),
and ensure you handle registration errors by closing conn immediately on failure
to avoid leaks.
π§Ή Nitpick comments (8)
janitor-provider/pkg/csp/aws/aws.go (2)
99-130: Consider adding more context when wrapping errors.While the logging migration to
slogis correctly implemented with structured fields, some errors could benefit from additional context when returned. For example:
- Line 106:
fmt.Errorf("no provider ID found for node %s: %w", node.Name, err)instead of creating a new error- Line 126:
fmt.Errorf("failed to reboot EC2 instance %s: %w", instanceID, err)instead of returning the raw errorThis aligns with the coding guideline to wrap errors with context using
fmt.Errorf("context: %w", err).β»οΈ Proposed error wrapping improvements
func (c *Client) SendRebootSignal(ctx context.Context, node corev1.Node) (model.ResetSignalRequestRef, error) { // Fetch the node's provider ID providerID := node.Spec.ProviderID if providerID == "" { - err := fmt.Errorf("no provider ID found for node %s", node.Name) + err := fmt.Errorf("failed to reboot node %s: no provider ID found", node.Name) slog.Error("Failed to reboot node", "error", err) return "", err } // Extract the instance ID from the provider ID instanceID, err := parseAWSProviderID(providerID) if err != nil { - slog.Error("Failed to parse provider ID", "error", err) + err = fmt.Errorf("failed to parse provider ID for node %s: %w", node.Name, err) + slog.Error("Failed to parse provider ID", "error", err, "node", node.Name) return "", err } // Reboot the EC2 instance slog.Info("Rebooting node", "node", node.Name, "instanceID", instanceID) _, err = c.ec2.RebootInstances(ctx, &ec2.RebootInstancesInput{ InstanceIds: []string{instanceID}, }) if err != nil { - slog.Error("Failed to reboot instance", "error", err, "instanceID", instanceID) + err = fmt.Errorf("failed to reboot EC2 instance %s: %w", instanceID, err) + slog.Error("Failed to reboot instance", "error", err, "instanceID", instanceID, "node", node.Name) return "", err }As per coding guidelines, wrap errors with context using
fmt.Errorf("context: %w", err).
134-148: Consider adding observability to IsNodeReady.The
IsNodeReadyfunction implements a 5-minute cooldown logic but lacks logging, which could make debugging difficult. Consider adding structured logging to track when the cooldown period is active vs. when the node is determined ready.β»οΈ Suggested logging additions
func (c *Client) IsNodeReady(ctx context.Context, node corev1.Node, message string) (bool, error) { // Sending a reboot request to AWS doesn't update statuses immediately, // the ec2 instance does not report that it isn't in a running state for some time // and kubernetes still sees the node as ready. Wait five minutes before checking the status storedTime, err := time.Parse(time.RFC3339, message) if err != nil { - return false, err + err = fmt.Errorf("failed to parse reboot timestamp for node %s: %w", node.Name, err) + slog.Error("Failed to parse reboot timestamp", "error", err, "node", node.Name, "message", message) + return false, err } + elapsed := time.Since(storedTime) - if time.Since(storedTime) < 5*time.Minute { + if elapsed < 5*time.Minute { + slog.Debug("Node still in cooldown period", "node", node.Name, "elapsed", elapsed, "required", 5*time.Minute) return false, nil } + slog.Info("Node cooldown period complete", "node", node.Name, "elapsed", elapsed) return true, nil }janitor-provider/go.mod (1)
16-16: Remove or refactor testify usage to standard testing package.The janitor-provider tests use
stretchr/testifyexclusively for simple assertions:assert.Equal(),assert.Error(),assert.NotNil(), andrequire.NoError(). Per project learnings, the NVIDIA/NVSentinel repository discourages testify for simple comparison assertions. These can be replaced with straightforward standardtestingpackage patterns (e.g.,if expected != actual { t.Errorf(...) }), eliminating the dependency.janitor-provider/pkg/csp/client.go (1)
45-69: Wrap the βprovider from envβ error with context (or returnstatusupstream), instead of returning the raw parse error.Right now you log context but return
erras-is; callers will lose βCSP envβ context unless they also log. Consider:
return nil, fmt.Errorf("determine CSP provider from env: %w", err)This aligns with the repoβs Go error-wrapping guideline.
janitor-provider/pkg/csp/gcp/gcp.go (1)
137-137: Consider adding more context to operation logs.The info logs at lines 137 and 218 could benefit from additional structured fields such as project, zone, and operation type for better observability and debugging.
π Enhanced logging with additional context
- slog.Info("Sending reset signal to", "node", nodeFields.instance) + slog.Info("Sending reset signal to node", + "node", nodeFields.instance, + "project", nodeFields.project, + "zone", nodeFields.zone)- slog.Info("Sending delete signal to", "node", nodeFields.instance) + slog.Info("Sending delete signal to node", + "node", nodeFields.instance, + "project", nodeFields.project, + "zone", nodeFields.zone)Also applies to: 218-218
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (2)
141-142: Consider adding a liveness probe.The deployment includes a readiness probe but no liveness probe. For a long-running gRPC service, a liveness probe can help Kubernetes detect and recover from deadlocked or unresponsive container states that the readiness probe might not catch.
π₯ Add liveness probe configuration
readinessProbe: {{- toYaml .Values.readinessProbe | nindent 12 }} + {{- with .Values.livenessProbe }} + livenessProbe: + {{- toYaml . | nindent 12 }} + {{- end }}Then in values.yaml, add a default liveness probe configuration such as:
livenessProbe: grpc: port: 50051 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3
45-56: Consider adding security context for defense in depth.The container specification lacks a security context. As a security best practice, consider adding security context settings such as:
runAsNonRoot: truereadOnlyRootFilesystem: true(if applicable)allowPrivilegeEscalation: false- Appropriate
capabilitiesrestrictionsThis is especially important for a service that makes external CSP API calls.
π Suggested security context
- name: janitor-provider image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default ((.Values.global).image).tag | default .Chart.AppVersion }}" imagePullPolicy: {{ .Values.image.pullPolicy }} + securityContext: + {{- toYaml .Values.securityContext | nindent 12 }} resources: {{- toYaml .Values.resources | nindent 12 }}And in values.yaml:
securityContext: runAsNonRoot: true allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: false # Set to true if the app doesn't need to write to filesystemjanitor/pkg/controller/rebootnode_controller_test.go (1)
122-126: Consider migrating to envtest for controller testing.The test suite uses a fake Kubernetes client, but the coding guidelines specify using
envtestfor testing Kubernetes controllers instead of fake clients. While the fake client works for these scenarios,envtestprovides a more realistic test environment that better catches integration issues.As per coding guidelines, use
envtestfor testing Kubernetes controllers instead of fake clients.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (4)
api/gen/go/csp/v1alpha1/provider.pb.gois excluded by!**/*.pb.go,!**/gen/**api/gen/go/csp/v1alpha1/provider_grpc.pb.gois excluded by!**/*.pb.go,!**/gen/**janitor-provider/go.sumis excluded by!**/*.sumjanitor/go.sumis excluded by!**/*.sum
π Files selected for processing (44)
.gitignore.ko.yamlapi/Makefileapi/proto/csp/v1alpha1/provider.protodistros/kubernetes/nvsentinel/Chart.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/values.yamldistros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yamldistros/kubernetes/nvsentinel/charts/janitor/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/janitor/values.yamldistros/kubernetes/nvsentinel/values-tilt.yamldistros/kubernetes/nvsentinel/values.yamljanitor-provider/Tiltfilejanitor-provider/go.modjanitor-provider/main.gojanitor-provider/pkg/csp/aws/aws.gojanitor-provider/pkg/csp/azure/azure.gojanitor-provider/pkg/csp/client.gojanitor-provider/pkg/csp/client_test.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor-provider/pkg/csp/kind/kind.gojanitor-provider/pkg/csp/nebius/nebius.gojanitor-provider/pkg/csp/oci/oci.gojanitor-provider/pkg/model/csp.gojanitor/go.modjanitor/pkg/config/config.gojanitor/pkg/controller/rebootnode_controller.gojanitor/pkg/controller/rebootnode_controller_test.gojanitor/pkg/controller/suite_test.gojanitor/pkg/controller/terminatenode_controller.gojanitor/pkg/controller/terminatenode_controller_test.gojanitor/pkg/controller/test_utils.goscripts/build-image-list.shscripts/buildko.shscripts/check-image-attestations.shtilt/Tiltfile
π€ Files with no reviewable changes (3)
- distros/kubernetes/nvsentinel/charts/janitor/templates/serviceaccount.yaml
- distros/kubernetes/nvsentinel/charts/janitor/templates/deployment.yaml
- distros/kubernetes/nvsentinel/charts/janitor/values.yaml
π§ Files skipped from review as they are similar to previous changes (15)
- janitor-provider/Tiltfile
- janitor/pkg/controller/terminatenode_controller_test.go
- scripts/build-image-list.sh
- janitor-provider/pkg/csp/nebius/nebius.go
- janitor/pkg/controller/suite_test.go
- janitor-provider/pkg/csp/oci/oci.go
- janitor-provider/pkg/csp/azure/azure.go
- distros/kubernetes/nvsentinel/charts/janitor-provider/templates/_helpers.tpl
- api/proto/csp/v1alpha1/provider.proto
- distros/kubernetes/nvsentinel/Chart.yaml
- distros/kubernetes/nvsentinel/charts/janitor/templates/configmap.yaml
- distros/kubernetes/nvsentinel/values.yaml
- api/Makefile
- janitor-provider/pkg/csp/kind/kind.go
- distros/kubernetes/nvsentinel/charts/janitor-provider/Chart.yaml
π§° Additional context used
π Path-based instructions (4)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
janitor/pkg/config/config.gojanitor-provider/pkg/csp/aws/aws.gojanitor-provider/pkg/csp/client.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor/pkg/controller/rebootnode_controller_test.gojanitor/pkg/controller/rebootnode_controller.gojanitor/pkg/controller/terminatenode_controller.gojanitor/pkg/controller/test_utils.gojanitor-provider/main.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
janitor/pkg/controller/rebootnode_controller_test.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
janitor/go.modjanitor-provider/go.mod
π§ Learnings (21)
π Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.
Applied to files:
tilt/Tiltfile
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
janitor-provider/pkg/csp/aws/aws.gojanitor-provider/pkg/csp/client.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
janitor-provider/pkg/csp/client.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor/pkg/controller/rebootnode_controller.gojanitor/go.modjanitor/pkg/controller/terminatenode_controller.gojanitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
janitor/pkg/controller/rebootnode_controller_test.gojanitor/pkg/controller/test_utils.gojanitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
janitor/pkg/controller/rebootnode_controller_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code
Applied to files:
janitor/pkg/controller/rebootnode_controller_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
janitor/pkg/controller/rebootnode_controller_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Document all values in Helm chart `values.yaml` with inline comments
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Include examples for non-obvious configurations in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/values.yamldistros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
janitor/pkg/controller/rebootnode_controller.gojanitor-provider/main.go
π Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Applied to files:
janitor/go.mod.ko.yamljanitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
janitor/go.modjanitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
janitor/go.modjanitor-provider/go.mod
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
janitor/go.modjanitor-provider/go.mod
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
janitor/go.modjanitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
janitor/go.modjanitor-provider/go.modjanitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`
Applied to files:
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
.ko.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
janitor-provider/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
janitor-provider/main.go
𧬠Code graph analysis (4)
janitor/pkg/controller/rebootnode_controller_test.go (3)
janitor/pkg/controller/rebootnode_controller.go (1)
RebootNodeReconciler(74-79)janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)janitor/pkg/controller/test_utils.go (1)
DefaultFailureBehavior(61-67)
janitor/pkg/controller/rebootnode_controller.go (2)
janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
CSPProviderServiceClient(44-48)NewCSPProviderServiceClient(54-56)
janitor/pkg/controller/terminatenode_controller.go (3)
janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
CSPProviderServiceClient(44-48)NewCSPProviderServiceClient(54-56)janitor/pkg/config/config.go (1)
Config(26-30)
janitor/pkg/controller/test_utils.go (2)
commons/pkg/server/server.go (2)
Server(87-97)NewServer(375-396)api/gen/go/csp/v1alpha1/provider_grpc.pb.go (3)
CSPProviderServiceClient(44-48)RegisterCSPProviderServiceServer(124-133)NewCSPProviderServiceClient(54-56)
πͺ OSV Scanner (2.3.1)
janitor-provider/go.mod
[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: Excessive memory allocation during header parsing in github.com/golang-jwt/jwt
(GO-2025-3553)
[HIGH] 1-1: github.com/golang-jwt/jwt/v4 4.5.1: jwt-go allows excessive memory allocation during header parsing
πͺ YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yaml
[warning] 30-30: too many blank lines (1 > 0)
(empty-lines)
[error] 17-17: syntax error: expected , but found ''
(syntax)
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml
[error] 19-19: syntax error: expected the node content, but found '-'
(syntax)
π Additional comments (27)
scripts/check-image-attestations.sh (1)
69-69: LGTM! Janitor-provider image correctly added to attestation checks.The addition of
nvsentinel/janitor-providerto the Ko-built images list is appropriate and aligns with the PR's introduction of the janitor-provider gRPC service..gitignore (1)
443-444: β mise.toml addition is appropriate.The addition of
mise.tomlto the Version Management Tools section is a legitimate change for version management tooling and poses no conflicts with existing entries.tilt/Tiltfile (2)
118-118: LGTM!The include statement follows the established pattern and is logically placed after the janitor include.
245-248: No additional resource dependencies are required for janitor-provider. The service is a gRPC provider that handles node lifecycle operations with the Kubernetes API and CSP clients, neither of which require separate Tilt resource dependencies. The current dependency onjanitoris appropriate.Likely an incorrect or invalid review comment.
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml (1)
14-19: Static analysis false positive - Helm syntax is correct.The YAMLlint error on line 19 is a false positive. The
{{-syntax is valid Helm templating (the dash trims leading whitespace). YAMLlint doesn't parse Helm templates, so it misinterprets this as invalid YAML.distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrolebinding.yaml (2)
14-19: Static analysis false positive - Helm syntax is correct.Same as in clusterrole.yaml, the YAMLlint error on line 19 is a false positive. The
{{-syntax is valid Helm templating.
20-27: LGTM! ClusterRoleBinding structure is correct and consistent.The binding correctly references the ClusterRole and ServiceAccount using consistent naming templates. The namespace is properly scoped to the Helm release namespace.
scripts/buildko.sh (2)
51-51: LGTM!The janitor-provider module is correctly added to both the go.work initialization and ko build inputs, consistent with other modules in the build pipeline.
Also applies to: 66-66
86-86: LGTM!Proper newline termination ensures consistent output formatting for GitHub Actions workflows.
janitor-provider/go.mod (2)
125-127: LGTM!The replace directives correctly point to local sibling modules, following the project's multi-module structure and the guideline to use
commons/for shared utilities across Go modules.
3-3: No action required. Go 1.25.0 is a valid, officially released version (August 12, 2025). The go.mod declaration is correct.distros/kubernetes/nvsentinel/values-tilt.yaml (3)
80-81: LGTM!Enables the janitor-provider subchart, consistent with other global feature flag patterns in the values file.
324-325: The janitor-provider service is correctly configured with port 50051. The Service template exists, the Deployment exposes the correct containerPort, and the network policy allows the port. The configuration in values-tilt.yaml is valid.
329-330: Document CSP testing status and add observability for janitor-provider service.The janitor-provider service supports six CSPs (kind, aws, gcp, azure, oci, nebius), but testing coverage is incomplete. AWS, GCP, and kind have full UAT setups; azure and oci have stub directories only; nebius has no UAT configuration. Please document which CSPs are production-ready and add UAT for the remaining providers.
Additionally, janitor-provider exposes a
/metricsendpoint via Prometheus but does not define custom metrics to monitor CSP operations. The janitor controller itself tracksjanitor_actions_countandjanitor_action_mttr_seconds, but janitor-provider lacks instrumentation for CSP API latency, error rates, and retry behavior. Consider adding:
- CSP API call duration and error counters
- Request-level metrics (reboot, terminate, ready-check operations)
- Alerting rules for janitor-provider failures
janitor-provider/pkg/csp/aws/aws.go (1)
20-20: LGTM!The migration to standard library
log/slogfollows the coding guideline to use structured logging, and the model import path correctly points to the janitor-provider module.As per coding guidelines, use structured logging via
log/slogin Go code.Also applies to: 31-31
janitor/pkg/config/config.go (2)
34-37: LGTM!The
CSPProviderHostfield additions follow consistent patterns across all configuration structs with appropriate struct tags and comments. Themapstructure:"cspProviderHost"tag correctly handles the camelCase field from YAML configuration.Also applies to: 56-57, 71-72
106-114: LGTM!The CSPProviderHost propagation logic correctly implements the inheritance pattern: controller-specific values take precedence, with fallback to the global configuration. This matches the existing pattern for NodeExclusions.
.ko.yaml (1)
55-88: Confirm whetherhealth-events-analyzeris meant to remain buildable alongsidecsp-health-monitor.This file still defines a
health-events-analyzerbuild entry and addscsp-health-monitor. If the intent was a rename/move, we should remove (or clearly deprecate) the old build entry to avoid duplicated images and confusion.janitor/go.mod (1)
5-20: Address gRPC version inconsistency: janitor pins v1.78.0 while most other modules use v1.77.0 indirectly.The replace directives for internal modules (
github.com/nvidia/nvsentinel/api,commons,data-models,store-clientat lines 126β132) are an established monorepo pattern. However, verify that:
google.golang.org/grpc v1.78.0in janitor is intentional and compatible with v1.77.0 used indirectly by other modules (api, data-models, store-client, etc.)go mod tidyis explicitly run in CI workflows before build/test steps (currently available via Makefile target but not visible in workflow files)- CI/release tooling resolves the local replaces correctly when modules are referenced from outside the monorepo
janitor-provider/pkg/csp/gcp/gcp.go (1)
21-21: LGTM: Logging migration to slog.The switch to
log/slogaligns with the structured logging requirement in the coding guidelines, and the model import path update reflects the new janitor-provider package structure.Based on coding guidelines, structured logging via
log/slogis required for Go code.Also applies to: 32-32
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/service.yaml (1)
14-30: LGTM: Service template is well-structured.The service configuration properly exposes both the gRPC service port (50051) and metrics port (2112) with appropriate defaults. The static analysis syntax error on line 19 is a false positiveβthe Helm templating syntax
{{- include "provider.labels" . | nindent 4}}is correct.distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml (2)
33-38: LGTM: Azure Workload Identity label correctly applied.The conditional Azure Workload Identity label on the pod template aligns with the ServiceAccount annotation and follows Azure's workload identity pattern.
143-155: LGTM: Nebius secret mounting is properly implemented.The conditional volume and volumeMount logic for Nebius service account keys is correct:
- Only applied when provider is "nebius" and
serviceAccountKeySecretis set- Properly mounted at
/etc/nebiusas read-only- Coordinates with the environment variable
NEBIUS_SA_KEY_FILEset on line 120distros/kubernetes/nvsentinel/charts/janitor-provider/templates/networkpolicy.yaml (1)
26-29: Verify whether unrestricted ingress is intentional.The NetworkPolicy allows ingress from any source (no
fromselector specified in the ingress rule on port 50051). This pattern is consistent throughout the codebase across multiple services (event-exporter-mock, csp-api-mock, and others), suggesting it may be intentional. However, confirm if janitor-provider should only accept connections from specific components (e.g., janitor controllers). If so, add afromselector withpodSelectorornamespaceSelectorto restrict access.distros/kubernetes/nvsentinel/charts/janitor-provider/templates/serviceaccount.yaml (1)
20-49: No action neededβrendered YAML is valid.An empty
annotations:field is valid YAML syntax and valid Kubernetes YAML. WhenserviceAccount.annotationsis empty and CSP-specific conditions don't match, the template renders withannotations:containing a null value, which is perfectly acceptable. Testing confirms the output is valid YAML.Likely an incorrect or invalid review comment.
janitor/pkg/controller/test_utils.go (2)
222-225: Verify error handling logic.The condition on line 222 checks
err != nil && errors.Is(err, grpc.ErrServerStopped), which logs errors only when the server is stopped. This seems invertedβtypically you'd want to log errors that are not expected shutdown errors.Consider whether the logic should be:
if err != nil && !errors.Is(err, grpc.ErrServerStopped) { slog.Error("Mock CSP server exited with error", "error", err) }
35-49: Add godoc comment for exported type.The exported
MockCSPBehaviortype lacks a godoc comment.As per coding guidelines, function comments are required for all exported Go functions and types.
β Skipped due to learnings
Learnt from: CR Repo: NVIDIA/NVSentinel PR: 0 File: .github/copilot-instructions.md:0-0 Timestamp: 2025-11-24T22:20:48.152Z Learning: Applies to **/*.go : Function comments required for all exported Go functions
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/clusterrole.yaml
Show resolved
Hide resolved
distros/kubernetes/nvsentinel/charts/janitor-provider/templates/deployment.yaml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (1)
janitor/pkg/controller/rebootnode_controller.go (1)
192-233: Unreachable code due to variable shadowing and early return.At line 192,
nodeReadyErris declared with:=inside theifblock, shadowing any outer variable. However, lines 220-233 referencenodeReadyErroutside that block, but this code is unreachable because:
- When
r.Config.ManualModeis false, lines 192-204 handle the error case with an early return- When
r.Config.ManualModeis true,nodeReadyErrremainsnil(from line 190-191)The
nodeReadyErr != nilcheck at line 220 will never be true after the early return at line 204. This appears to be dead code from the migration.Proposed fix: Remove unreachable error handling block
if r.Config.ManualMode { cspReady = true nodeReadyErr = nil } else { rsp, nodeReadyErr := r.CSPClient.IsNodeReady(ctx, &cspv1alpha1.IsNodeReadyRequest{ NodeName: node.Name, }) if nodeReadyErr != nil { logger.Error(nodeReadyErr, "failed to check if node is ready", "node", node.Name) rebootNode.Status.ConsecutiveFailures++ delay := getNextRequeueDelay(rebootNode.Status.ConsecutiveFailures) result = ctrl.Result{RequeueAfter: delay} // Update status and return early return r.updateRebootNodeStatus(ctx, req, originalRebootNode, &rebootNode, result) } cspReady = rsp.IsReady } // Check if kubernetes reports the node is ready. kubernetesReady := false for _, condition := range node.Status.Conditions { if condition.Type == corev1.NodeReady { kubernetesReady = condition.Status == corev1.ConditionTrue } } - // nolint:gocritic // Migrated business logic with if-else chain - if nodeReadyErr != nil { - logger.Error(nodeReadyErr, "node ready status check failed", - "node", node.Name) - - rebootNode.Status.ConsecutiveFailures++ - - rebootNode.SetCompletionTime() - rebootNode.SetCondition(metav1.Condition{ - Type: janitordgxcnvidiacomv1alpha1.RebootNodeConditionNodeReady, - Status: metav1.ConditionFalse, - Reason: "Failed", - Message: fmt.Sprintf("Node status could not be checked from CSP: %s", nodeReadyErr), - LastTransitionTime: metav1.Now(), - }) - - metrics.GlobalMetrics.IncActionCount(metrics.ActionTypeReboot, metrics.StatusFailed, node.Name) - - result = ctrl.Result{} // Don't requeue on failure - } else if cspReady && kubernetesReady { + if cspReady && kubernetesReady {
π€ Fix all issues with AI agents
In @janitor/pkg/controller/test_utils.go:
- Around line 222-226: The goroutine that starts the mock gRPC server uses an
inverted condition and will suppress real errors: in the anonymous func calling
server.Serve(lis) change the if so it logs only when err is non-nil and NOT
grpc.ErrServerStopped (e.g., if err != nil && !errors.Is(err,
grpc.ErrServerStopped)) and leave graceful shutdowns unlogged; reference the
call to server.Serve, the goroutine, grpc.ErrServerStopped, and the log.Printf
call when making this change.
π§Ή Nitpick comments (4)
janitor/pkg/controller/terminatenode_controller.go (2)
414-428: gRPC setup is consistent with rebootnode_controller and properly implements cleanup.The implementation correctly:
- Creates gRPC connection with the configured host
- Stores the connection for lifecycle management
- Registers cleanup via
manager.RunnableFuncto close on shutdownThis follows the same pattern as
rebootnode_controller.go, which is good for consistency.Consider extracting the duplicated gRPC client setup logic (lines 415-428) into a shared helper function to reduce code duplication across controllers. Both
RebootNodeReconcilerandTerminateNodeReconcilerhave identical setup code.Example shared helper
// In a shared file like grpc_client.go func SetupCSPClient(mgr ctrl.Manager, host string) (cspv1alpha1.CSPProviderServiceClient, *grpc.ClientConn, error) { conn, err := grpc.NewClient(host, grpc.WithTransportCredentials(insecure.NewCredentials())) if err != nil { return nil, nil, fmt.Errorf("failed to create CSP client: %w", err) } if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error { <-ctx.Done() return conn.Close() })); err != nil { conn.Close() return nil, nil, fmt.Errorf("failed to add grpc connection cleanup to manager: %w", err) } return cspv1alpha1.NewCSPProviderServiceClient(conn), conn, nil }
330-332: Response discarded - verify if RequestId should be captured.The
SendTerminateSignalresponse is discarded. In contrast,SendRebootSignalinrebootnode_controller.go(line 367) usesrsp.RequestIdas the condition message for audit/tracking purposes.Consider whether the terminate operation should similarly capture the
RequestIdfor traceability.Proposed change to capture RequestId
- _, terminateErr := r.CSPClient.SendTerminateSignal(ctx, &cspv1alpha1.SendTerminateSignalRequest{ + rsp, terminateErr := r.CSPClient.SendTerminateSignal(ctx, &cspv1alpha1.SendTerminateSignalRequest{ NodeName: node.Name, })Then use
rsp.RequestIdin the success condition message at line 360:- Message: "Terminate signal sent to CSP", + Message: rsp.RequestId,janitor/pkg/controller/test_utils.go (2)
236-238:log.Fatalfwill terminate the entire test process.If the gRPC client creation fails,
log.Fatalfcallsos.Exit(1), which abruptly terminates the test process without running cleanup or reporting proper test failures. Consider usingpanicinstead, which Ginkgo can recover and report as a test failure, or restructure to return an error.Proposed fix using panic
conn, err := grpc.NewClient( "passthrough://bufnet", grpc.WithContextDialer(func(context.Context, string) (net.Conn, error) { return lis.Dial() }), grpc.WithTransportCredentials(insecure.NewCredentials()), ) if err != nil { - log.Fatalf("Failed to create mock CSP client: %v", err) + panic(fmt.Sprintf("Failed to create mock CSP client: %v", err)) }Note: This requires adding
"fmt"to imports.
249-257: Consider closing the bufconn listener explicitly.The
Stop()method closes the gRPC server and connection but doesn't explicitly close thebufconn.Listener. WhileGracefulStop()should handle this internally, explicit cleanup ensures no resource leaks if the server wasn't started or stopped abnormally.Proposed enhancement
func (h *MockCSPTestHelper) Stop() { if h.grpcServer != nil { h.grpcServer.GracefulStop() } if h.grpcConn != nil { - h.grpcConn.Close() + _ = h.grpcConn.Close() + } + if h.listener != nil { + _ = h.listener.Close() } }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
janitor/pkg/controller/rebootnode_controller.gojanitor/pkg/controller/terminatenode_controller.gojanitor/pkg/controller/test_utils.go
π§° Additional context used
π Path-based instructions (1)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
janitor/pkg/controller/terminatenode_controller.gojanitor/pkg/controller/test_utils.gojanitor/pkg/controller/rebootnode_controller.go
π§ Learnings (7)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
janitor/pkg/controller/terminatenode_controller.gojanitor/pkg/controller/test_utils.gojanitor/pkg/controller/rebootnode_controller.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
janitor/pkg/controller/terminatenode_controller.gojanitor/pkg/controller/rebootnode_controller.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
janitor/pkg/controller/test_utils.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
janitor/pkg/controller/test_utils.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code
Applied to files:
janitor/pkg/controller/test_utils.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
janitor/pkg/controller/test_utils.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
janitor/pkg/controller/rebootnode_controller.go
𧬠Code graph analysis (3)
janitor/pkg/controller/terminatenode_controller.go (3)
janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
CSPProviderServiceClient(44-48)NewCSPProviderServiceClient(54-56)janitor/pkg/config/config.go (1)
Config(26-30)
janitor/pkg/controller/test_utils.go (1)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (3)
CSPProviderServiceClient(44-48)RegisterCSPProviderServiceServer(124-133)NewCSPProviderServiceClient(54-56)
janitor/pkg/controller/rebootnode_controller.go (2)
janitor-provider/pkg/model/csp.go (1)
CSPClient(30-39)api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
CSPProviderServiceClient(44-48)NewCSPProviderServiceClient(54-56)
π Additional comments (2)
janitor/pkg/controller/test_utils.go (1)
69-200: Well-structured mock server with proper thread-safety.The
MockCSPServerimplementation:
- Uses
sync.RWMutexcorrectly for concurrent access- Separates read operations (
RLock) from write operations (Lock)- Provides granular control via per-operation setters
- Embeds
UnimplementedCSPProviderServiceServerfor forward compatibilityjanitor/pkg/controller/rebootnode_controller.go (1)
401-414: gRPC connection setup is correct with proper cleanup registration.The connection lifecycle is properly managed:
- Connection created at controller setup with error handling
- Cleanup registered via
manager.RunnableFuncthat waits for context cancellation- Connection stored for later use
The insecure credentials usage is intentional for in-cluster service-to-service communication; verify this aligns with your security posture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (4)
janitor-provider/pkg/csp/kind/kind.go (1)
83-157: Fix timeout detection: youβre checkingctx.Err()instead ofdockerCtx.Err(). WhendockerCtxtimes out,ctx.Err()may still be nil (Line 104-106, 120-122, 142-144), so youβll misclassify timeouts as generic failures.Proposed fix
@@ output, err := cmd.Output() if err != nil { - if ctx.Err() == context.DeadlineExceeded { + if dockerCtx.Err() == context.DeadlineExceeded { return "", fmt.Errorf("timeout while listing containers: %w", err) } @@ if err := cmd.Run(); err != nil { - if ctx.Err() == context.DeadlineExceeded { + if dockerCtx.Err() == context.DeadlineExceeded { return "", fmt.Errorf("timeout while deleting container: %w", err) } @@ output, err = cmd.Output() if err != nil { - if ctx.Err() == context.DeadlineExceeded { + if dockerCtx.Err() == context.DeadlineExceeded { return "", fmt.Errorf("timeout while verifying container deletion: %w", err) }Also:
strings.Contains(string(output), containerName)(Line 113) can false-positive; consider splitting by newline and exact-matching names.janitor-provider/pkg/csp/azure/azure.go (1)
108-159: Wrap and/or logrequestIDparse failures; avoidslog.Info(fmt.Sprintf(...)). As written,time.Parse(time.RFC3339, requestID)returns a raw error (Line 111-114) without context; and Line 152 logs a formatted string instead of structured fields.Proposed fix
@@ func (c *Client) IsNodeReady(ctx context.Context, node corev1.Node, requestID string) (bool, error) { @@ storedTime, err := time.Parse(time.RFC3339, requestID) if err != nil { - return false, err + slog.Error("Invalid requestID timestamp; cannot evaluate readiness window", + "error", err, + "node", node.Name, + "requestID", requestID, + ) + return false, fmt.Errorf("parse requestID timestamp %q: %w", requestID, err) } @@ for _, status := range instanceView.Statuses { if *status.Code == "ProvisioningState/succeeded" { - slog.Info(fmt.Sprintf("Node %s is in a healthy state", node.Name)) + slog.Info("Node is in a healthy state", "node", node.Name) return true, nil } }janitor/pkg/controller/rebootnode_controller.go (2)
192-209: Add per-RPC deadlines; current reconcile can hang indefinitely.IsNodeReadyis called with the raw reconcilectx(Line 192), but nocontext.WithTimeoutis applied, and the existingCSPOperationTimeoutconstant is no longer enforced. This can stall controller workers if the provider is slow/unreachable.Proposed fix
@@ - rsp, nodeReadyErr := r.CSPClient.IsNodeReady(ctx, &cspv1alpha1.IsNodeReadyRequest{ + rpcCtx, cancel := context.WithTimeout(ctx, CSPOperationTimeout) + defer cancel() + rsp, nodeReadyErr := r.CSPClient.IsNodeReady(rpcCtx, &cspv1alpha1.IsNodeReadyRequest{ NodeName: node.Name, RequestId: rebootNode.GetCSPReqRef(), })
338-390: Same timeout issue forSendRebootSignal; current DeadlineExceeded branch is effectively dead. Since you donβt set a deadline on the RPC context (Line 338-340),errors.Is(rebootErr, context.DeadlineExceeded)(Line 343) will rarely/never trigger as intended, and the call can block indefinitely.Proposed fix
@@ - rsp, rebootErr := r.CSPClient.SendRebootSignal(ctx, &cspv1alpha1.SendRebootSignalRequest{ + rpcCtx, cancel := context.WithTimeout(ctx, CSPOperationTimeout) + defer cancel() + rsp, rebootErr := r.CSPClient.SendRebootSignal(rpcCtx, &cspv1alpha1.SendRebootSignalRequest{ NodeName: node.Name, })
π€ Fix all issues with AI agents
In @janitor-provider/pkg/csp/azure/azure.go:
- Around line 256-260: The defer closure currently logs "failed to close http
client" when resp.Body.Close() fails; update the message to accurately reflect
what's being closedβe.g., change the log in the defer that calls
resp.Body.Close() to something like "failed to close IMDS response body" and
keep the error field ("error", cerr) intact; locate the defer func that invokes
resp.Body.Close() in azure.go and replace only the log message text.
π§Ή Nitpick comments (8)
janitor-provider/pkg/csp/aws/aws.go (1)
129-129: Consider using UTC for time consistency.OCI's
SendRebootSignalusestime.Now().UTC().Format(time.RFC3339)while AWS usestime.Now().Format(time.RFC3339)(local time). This inconsistency could cause subtle timezone issues in distributed environments.Proposed fix
- return model.ResetSignalRequestRef(time.Now().Format(time.RFC3339)), nil + return model.ResetSignalRequestRef(time.Now().UTC().Format(time.RFC3339)), niljanitor-provider/pkg/csp/gcp/gcp.go (1)
176-185: Consider nil check for defensive coding.The
*op.Statusdereference on line 181 assumes the status is always present. While this should be the case for GCP Compute operations, a nil check would provide safer error handling against unexpected API responses.Proposed fix
op, err := zoneOperationsClient.Get(ctx, req) if err != nil { return false, err } - if *op.Status == computepb.Operation_DONE { + if op.Status != nil && *op.Status == computepb.Operation_DONE { return true, nil } return false, niljanitor-provider/main.go (2)
117-120: Validate JANITOR_PROVIDER_PORT environment variable.If
JANITOR_PROVIDER_PORTis empty or unset,net.Listenwill bind to":0"(random port), which is likely unintended for a production service.Proposed fix
+ port := os.Getenv("JANITOR_PROVIDER_PORT") + if port == "" { + return fmt.Errorf("JANITOR_PROVIDER_PORT environment variable is required") + } + - lis, err := net.Listen("tcp", fmt.Sprintf(":%s", os.Getenv("JANITOR_PROVIDER_PORT"))) + lis, err := net.Listen("tcp", fmt.Sprintf(":%s", port)) if err != nil { return fmt.Errorf("failed to listen: %w", err) }
58-71: Consider using appropriate gRPC status codes for K8s errors.Currently, all K8s API errors return
codes.Internal. When a node is not found, returningcodes.NotFoundwould be more semantically correct and help clients handle errors appropriately.Proposed fix for SendRebootSignal (apply similar pattern to other handlers)
+import ( + apierrors "k8s.io/apimachinery/pkg/api/errors" +) func (s *janitorProviderServer) SendRebootSignal(ctx context.Context, req *cspv1alpha1.SendRebootSignalRequest) (*cspv1alpha1.SendRebootSignalResponse, error) { slog.Info("Sending reboot signal", "node", req.NodeName) node, err := s.k8sClient.CoreV1().Nodes().Get(ctx, req.NodeName, metav1.GetOptions{}) if err != nil { + if apierrors.IsNotFound(err) { + return nil, status.Errorf(codes.NotFound, "node not found: %v", err) + } return nil, status.Errorf(codes.Internal, "failed to get node: %v", err) }janitor-provider/pkg/csp/kind/kind.go (1)
53-61: Consider keepingrequestIDmeaningful (or document why itβs ignored).requestIDis explicitly ignored in simulation mode (Line 57-58); thatβs fine, but a short doc comment in the interface or here would prevent confusion when correlating logs/traces across CSPs.janitor-provider/pkg/csp/azure/azure.go (2)
74-106: Prefer adding stable correlation fields in slog logs. Right now most error logs donβt includenode.Name/ parsed VM identifiers (e.g., Line 78-79, 86-88, 94-96). Including"node", node.Name(and possibly"providerID", providerID) would make ops/debug much easier.
193-219: Azure client creation errors are logged, but error returns arenβt wrapped with context. Consider returningfmt.Errorf("create default VMSS client: %w", err)at call sites (or wrap locally) so callers donβt have to infer where it failed.janitor/pkg/controller/rebootnode_controller.go (1)
188-398: Operational metrics gap for the new gRPC hop. Since CSP calls moved behind gRPC, consider adding metrics for RPC latency/error counts (per method + status) and alerting on sustained error rate / high latency / connection failures, to match the reviewer ask.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
api/gen/go/csp/v1alpha1/provider.pb.gois excluded by!**/*.pb.go,!**/gen/**
π Files selected for processing (10)
api/proto/csp/v1alpha1/provider.protojanitor-provider/main.gojanitor-provider/pkg/csp/aws/aws.gojanitor-provider/pkg/csp/azure/azure.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor-provider/pkg/csp/kind/kind.gojanitor-provider/pkg/csp/nebius/nebius.gojanitor-provider/pkg/csp/oci/oci.gojanitor-provider/pkg/model/csp.gojanitor/pkg/controller/rebootnode_controller.go
π§° Additional context used
π Path-based instructions (1)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
janitor-provider/pkg/model/csp.gojanitor-provider/main.gojanitor/pkg/controller/rebootnode_controller.gojanitor-provider/pkg/csp/aws/aws.gojanitor-provider/pkg/csp/oci/oci.gojanitor-provider/pkg/csp/kind/kind.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor-provider/pkg/csp/nebius/nebius.gojanitor-provider/pkg/csp/azure/azure.go
π§ Learnings (7)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
janitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
janitor-provider/main.gojanitor-provider/pkg/csp/aws/aws.gojanitor-provider/pkg/csp/oci/oci.gojanitor-provider/pkg/csp/kind/kind.gojanitor-provider/pkg/csp/gcp/gcp.gojanitor-provider/pkg/csp/nebius/nebius.gojanitor-provider/pkg/csp/azure/azure.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
janitor-provider/main.gojanitor/pkg/controller/rebootnode_controller.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
janitor-provider/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
janitor/pkg/controller/rebootnode_controller.gojanitor-provider/pkg/csp/gcp/gcp.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
janitor/pkg/controller/rebootnode_controller.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code
Applied to files:
janitor-provider/pkg/csp/nebius/nebius.go
𧬠Code graph analysis (3)
janitor-provider/main.go (5)
janitor-provider/pkg/model/csp.go (1)
CSPClient(30-40)api/gen/go/csp/v1alpha1/provider.pb.go (18)
SendRebootSignalRequest(38-43)SendRebootSignalRequest(56-56)SendRebootSignalRequest(71-73)SendRebootSignalResponse(82-87)SendRebootSignalResponse(100-100)SendRebootSignalResponse(115-117)IsNodeReadyRequest(126-132)IsNodeReadyRequest(145-145)IsNodeReadyRequest(160-162)IsNodeReadyResponse(178-183)IsNodeReadyResponse(196-196)IsNodeReadyResponse(211-213)SendTerminateSignalRequest(222-227)SendTerminateSignalRequest(240-240)SendTerminateSignalRequest(255-257)SendTerminateSignalResponse(266-271)SendTerminateSignalResponse(284-284)SendTerminateSignalResponse(299-301)commons/pkg/server/server.go (4)
NewServer(375-396)WithPort(151-153)WithPrometheusMetrics(212-216)WithSimpleHealth(246-254)janitor-provider/pkg/csp/client.go (1)
New(46-69)api/gen/go/csp/v1alpha1/provider_grpc.pb.go (1)
RegisterCSPProviderServiceServer(124-133)
janitor/pkg/controller/rebootnode_controller.go (2)
api/gen/go/csp/v1alpha1/provider_grpc.pb.go (2)
CSPProviderServiceClient(44-48)NewCSPProviderServiceClient(54-56)api/gen/go/csp/v1alpha1/provider.pb.go (6)
IsNodeReadyRequest(126-132)IsNodeReadyRequest(145-145)IsNodeReadyRequest(160-162)SendRebootSignalRequest(38-43)SendRebootSignalRequest(56-56)SendRebootSignalRequest(71-73)
janitor-provider/pkg/csp/azure/azure.go (6)
janitor-provider/pkg/model/csp.go (1)
ResetSignalRequestRef(24-24)janitor-provider/pkg/csp/aws/aws.go (1)
Client(49-51)janitor-provider/pkg/csp/gcp/gcp.go (1)
Client(40-40)janitor-provider/pkg/csp/kind/kind.go (1)
Client(36-36)janitor-provider/pkg/csp/nebius/nebius.go (1)
Client(60-63)janitor-provider/pkg/csp/oci/oci.go (1)
Client(48-50)
π Additional comments (8)
janitor-provider/pkg/model/csp.go (1)
34-36: LGTM!The updated
IsNodeReadysignature withrequestIDparameter is well-documented and aligns with the proto API definition. The comment clearly explains thatrequestIDis the reference returned bySendRebootSignal.janitor-provider/pkg/csp/oci/oci.go (1)
135-150: LGTM!The
IsNodeReadyimplementation correctly parses the RFC3339 timestamp fromrequestIDand applies the 5-minute cooldown period. The structured logging withslog.Errorfollows the coding guidelines.janitor-provider/pkg/csp/nebius/nebius.go (1)
170-231: LGTM!The
IsNodeReadyimplementation correctly usesrequestIDas the instance ID for polling the Nebius instance state machine. The stop/start pattern with proper state transitions (STOPPING β STOPPED β STARTING β RUNNING) is well-documented and handled appropriately. Structured logging withslogfollows the coding guidelines.janitor-provider/main.go (1)
154-186: LGTM on graceful shutdown handling.The use of
errgroupwith separate goroutines for the metrics server, gRPC server, and shutdown handler is well-structured. The decision to log metrics server failures without terminating the service is reasonable, andGracefulStop()is correctly invoked on context cancellation.api/proto/csp/v1alpha1/provider.proto (1)
20-48: LGTM!The proto definitions are well-structured with clear service methods and message types. The
IsNodeReadyRequestcorrectly includes bothnode_nameandrequest_idfields to support correlation with previousSendRebootSignalcalls. Field naming follows proto3 conventions.janitor-provider/pkg/csp/kind/kind.go (1)
17-29: Good migration tolog/slogand janitor-provider model import.janitor-provider/pkg/csp/azure/azure.go (1)
17-36:slog+ model import path update looks consistent with repo direction.janitor/pkg/controller/rebootnode_controller.go (1)
74-81: gRPC API usage and lifecycle management are correct. The implementation properly usesgrpc.NewClient()(the recommended API for grpc-go v1.78.0+), assigns clear ownership to the reconciler viar.grpcConn, and registers a single cleanup point with the manager that closes the connection on shutdown. There is no risk of double-close or races since no other component accesses the connection.
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
π‘οΈ CodeQL Analysisπ¨ Found 1 security alert(s) π View details |
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Configuration
Tests
Refactor
Chores
βοΈ Tip: You can customize this high-level summary in your review settings.