-
Notifications
You must be signed in to change notification settings - Fork 33
fix(remediation): use ctrl-runtime business logic #544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
π WalkthroughWalkthroughUnifies startup under controller-runtime manager; introduces controller-runtimeβbased annotation and remediation packages, templated remediation and log-collector orchestration, controller-runtime client usage for CR status checks, exported metrics, reconciler refactor to injected clients, and Helm defaults removing ctrlRuntimeEnabled. Tests updated to use new public APIs. Changes
Sequence Diagram(s)sequenceDiagram
participant Manager as Controller-Manager
participant Reconciler as FaultRemediationReconciler
participant Datastore as Event Datastore
participant RemClient as FaultRemediationClient
participant KubeAPI as Kubernetes API
Manager->>Reconciler: start reconcile (watch events)
Reconciler->>Datastore: fetch health event (EventWithToken)
Datastore-->>Reconciler: healthEventWithStatus
Reconciler->>RemClient: CreateMaintenanceResource(ctx, healthEventData)
RemClient->>KubeAPI: get Node / render & create CR (or dry-run)
KubeAPI-->>RemClient: CR created / error
RemClient->>KubeAPI: create/check LogCollector Job
KubeAPI-->>RemClient: Job status
RemClient-->>Reconciler: return (crName / error / ctrl.Result)
Reconciler->>Reconciler: update node annotation via AnnotationManager
Reconciler->>Manager: return ctrl.Result (requeue/done)
Estimated code review effortπ― 5 (Critical) | β±οΈ ~120 minutes Poem
π₯ Pre-merge checks | β 2 | β 1β Failed checks (1 warning)
β Passed checks (2 passed)
βοΈ Tip: You can configure your own custom pre-merge checks in the settings. β¨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| if err != nil { | ||
| slog.Error("Error updating node label to remediating", "error", err) | ||
| processingErrors.WithLabelValues("label_update_error", nodeName).Inc() | ||
| return "", err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so want to call this out:
this will have some behavior changes, now it'll pretty much always return an error when it experiences an error so there's cases where reconciliation will then fully retry rather than ignoring the error and continuing the process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
| slog.Info("CR completed or failed, allowing retry", "node", nodeName, "crName", groupState.MaintenanceCR) | ||
|
|
||
| if err := r.annotationManager.RemoveGroupFromState(ctx, nodeName, group); err != nil { | ||
| //TODO: this also has built in client retry on the api request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so it looks like some of the managers like annotationManager have built in retries. I'm planning to have a flag set on the manager struct that tells it to retry or not so that we don't double up on retries when following this "error for retry" pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
lalitadithya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, I like the direction! Let's keep pushing forward on this
| if err != nil { | ||
| slog.Error("Error updating node label to remediating", "error", err) | ||
| processingErrors.WithLabelValues("label_update_error", nodeName).Inc() | ||
| return "", err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
| _ = r.runLogCollector(ctx, healthEvent) | ||
|
|
||
| nodeRemediatedStatus, _, err := r.performRemediation(ctx, healthEventWithStatus) | ||
| err = r.runLogCollector(ctx, healthEvent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't want to retry on this one. If the log collection fails, that is fine, we can proceed with remediation. There can be cases when the log collection will never to completion due to the node being in a very bad shape. If we retry on this then we may end up in an infinite loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I have some future refactors planned for the log collector anyway but for now I'll make it not throw an error if the log collector job fails to complete in time but still throw an error and retry if there's i.e. a failure on the api request to create the job resource.
Eventually I want to get the flow of the log collector to be something like this:
- Creates log collector Job (retry on error here)
- Requeues event to be re-processed after X min, this allows other events to be processed
- Checks if job finished, requeue if not (until timeout or complete)
- Eventually after completed or the timeout has passed, continue on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would likely come later once we're able to remove the "legacy" reconciler mode because it would depend on the ctrl-runtime queuing functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up doing this in this PR, created a ctrl runtime version of each of the interfaces so it seemed to fit here
| slog.Info("CR completed or failed, allowing retry", "node", nodeName, "crName", groupState.MaintenanceCR) | ||
|
|
||
| if err := r.annotationManager.RemoveGroupFromState(ctx, nodeName, group); err != nil { | ||
| //TODO: this also has built in client retry on the api request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
5d99015 to
f0f60b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Fix all issues with AI Agents π€
In @.idea/NVSentinel.iml:
- Around line 1-4: The repository is tracking IDE files .idea/NVSentinel.iml and
.idea/vcs.xml; update .gitignore to add a blanket ignore for .idea/ (or
uncomment/add the β.idea/β entry) so .iml and VCS configs are excluded, then
remove the tracked files from Git with git rm --cached .idea/NVSentinel.iml
.idea/vcs.xml and commit the change to stop committing IDE-specific config
moving forward.
In @fault-remediation/main.go:
- Around line 171-182: The retry loop around
components.FaultRemediationReconciler.Reconcile does not check gCtx cancellation
and can block shutdown; modify the loop to respect gCtx.Done() by exiting early
when the context is cancelled (check gCtx.Err() or select on gCtx.Done() before
each retry and before sleeping), return or break out of the loop when cancelled,
and avoid sleeping unconditionally by using a context-aware wait (select on
time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay) and
gCtx.Done()). Ensure you still log the last error but stop retrying if gCtx is
cancelled.
In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-74: GetRemediationState can fail leaving node nil which later
causes a panic when node.DeepCopy() is called; update the error path in
GetRemediationState handling so that when err != nil you either return the error
immediately or create a safe non-nil node placeholder before continuing (e.g.,
instantiate a new corev1.Node or ensure callers check for nil), and ensure
callers of RemediationStateAnnotation logic (where node.DeepCopy() is invoked)
only call DeepCopy on a non-nil node; reference GetRemediationState,
RemediationStateAnnotation, and node.DeepCopy to locate and fix the nil check
and handling.
In @fault-remediation/pkg/crstatus/checker.go:
- Around line 58-66: The ObjectKey for the GET only sets Name and omits
Namespace, which breaks lookups for namespaced CRs; update the client.ObjectKey
creation to include the CR namespace (e.g., client.ObjectKey{Name: crName,
Namespace: maintenanceResource.Namespace} or the appropriate local variable
holding the namespace) before calling c.client.Get, and ensure the warning log
(slog.Warn) also includes the namespace for clearer diagnostics; if the resource
is cluster-scoped, allow the Namespace to be empty when constructing the
ObjectKey.
In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code calls AnnotationManager.UpdateRemediationState
twice: first using crName and then again using actualCRName after computing
actualCRName := createdCR.GetName(); remove the redundant first update (the
block that uses crName) and keep only the second update that sets the annotation
to actualCRName; ensure the logic still checks group :=
common.GetRemediationGroupForAction(healthEvent.RecommendedAction) and
c.AnnotationManager != nil before calling
AnnotationManager.UpdateRemediationState so the annotation is updated once with
the real CR name (refer to createdCR.GetName, crName, actualCRName, and
AnnotationManager.UpdateRemediationState to locate the code).
In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 100-127: Two test cases in remediation_test.go use the same name
"Successful rebootnode creation", causing ambiguous test output; change the
`name` field for one or both cases to be unique (e.g. include dryRun state) so
they read distinct names like "Successful rebootnode creation - non-dry-run" and
"Successful rebootnode creation - dry-run"; update the `name` values inside the
table-driven test entries that contain `nodeName: "test-node-1"` and `dryRun:
false`/`true` (the test case structs shown in the diff) to avoid duplicate test
case names.
In @fault-remediation/pkg/remediation/remediation.go:
- Around line 304-317: Between checking existingJobs.Items and calling
c.client.Create(ctx, job) there is a race: another reconcile may create the job
causing Create to return an IsAlreadyExists error; update the creation path in
the function handling existingJobs/ Create so that after err :=
c.client.Create(ctx, job) you check for apierrors.IsAlreadyExists(err) (or the
equivalent IsAlreadyExists helper your codebase uses) and treat that case as
success by returning ctrl.Result{RequeueAfter: 10 * time.Second}, nil (or
proceed as if job was created), while still returning other errors unchanged;
reference symbols: existingJobs.Items, c.client.Create, job, and
apierrors.IsAlreadyExists.
- Around line 183-208: The first annotation update block duplicates the later,
authoritative update and uses the provisional crName; remove the initial call
that checks common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
and calls c.annotationManager.UpdateRemediationState with crName, and keep only
the second update after obtaining actualCRName from maintenance.GetName so the
node annotation is set once with the real CR name (use the existing
healthEvent.NodeName, group and actualCRName with
c.annotationManager.UpdateRemediationState).
π§Ή Nitpick comments (15)
.idea/vcs.xml (1)
1-6: IDE configuration files should be gitignored, not committed.The
.idea/vcs.xmlfile is an IntelliJ IDEA configuration artifact that is local to each developer's environment and should not be committed to version control. These files are typically auto-generated by the IDE, vary across developers, and create unnecessary noise and merge conflicts.Add
.idea/(or at minimum.idea/vcs.xml) to.gitignoreinstead of committing this file. If the file was committed inadvertently, consider removing it from the repository withgit rm --cached .idea/vcs.xml.fault-remediation/pkg/annotation/annotation.go (2)
40-40: TODO comment should reference an issue.Per coding guidelines, TODO comments should reference issues for tracking.
51-57: Silent error recovery may mask annotation data corruption.Returning an empty state when JSON unmarshaling fails is defensive, but it silently discards potentially corrupted data. Consider returning the error to allow callers to decide how to handle it, or at minimum add a metric/alert for this condition.
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
29-34: Add godoc comment for exported interface.Per coding guidelines, exported functions and types require documentation comments. The interface methods would benefit from brief descriptions of their behavior and return semantics.
π Proposed fix
+// FaultRemediationClientInterface defines the contract for fault remediation operations +// including maintenance resource creation, log collection, and access to annotation/status components. type FaultRemediationClientInterface interface { CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) GetAnnotationManager() annotation.NodeAnnotationManagerInterface GetStatusChecker() crstatus.CRStatusCheckerInterface }
31-31: UseeventIDinstead ofeventIdfor Go naming conventions.Go convention for acronyms in identifiers is to use all caps (e.g.,
eventID,httpURL).fault-remediation/pkg/crstatus/crstatus_interface.go (1)
7-9: Add godoc and use named parameters for clarity.The interface lacks documentation. The unnamed
stringparameter is unclear - is it a CR name, node name, or identifier? Named parameters improve readability and self-documentation.π Proposed fix
+// CRStatusCheckerInterface defines the contract for checking CR status +// to determine if creation should be skipped (e.g., when remediation is in progress). type CRStatusCheckerInterface interface { - ShouldSkipCRCreation(context.Context, string) bool + ShouldSkipCRCreation(ctx context.Context, crName string) bool }fault-remediation/pkg/crstatus/crstatus_test.go (1)
120-212: Consider consolidating duplicate test cases.
TestCheckConditionCtrlRuntimeduplicates the exact same test cases asTestCheckCondition. Consider extracting the shared test cases into a variable and reusing them, or using a parameterized approach that tests both checker types.π Example consolidation
var checkConditionTestCases = []struct { name string cr *unstructured.Unstructured expected bool }{ // ... shared test cases } func TestCheckCondition(t *testing.T) { cfg := &config.MaintenanceResource{CompleteConditionType: "Completed"} checker := NewCRStatusChecker(nil, nil, cfg, false) for _, tt := range checkConditionTestCases { t.Run(tt.name, func(t *testing.T) { assert.Equal(t, tt.expected, checker.checkCondition(tt.cr)) }) } } func TestCheckConditionCtrlRuntime(t *testing.T) { cfg := &config.MaintenanceResource{CompleteConditionType: "Completed"} checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false) for _, tt := range checkConditionTestCases { t.Run(tt.name, func(t *testing.T) { assert.Equal(t, tt.expected, checker.checkCondition(tt.cr)) }) } }fault-remediation/pkg/annotation/annotation_interface.go (1)
10-13: Consider adding a domain prefix to the annotation key.Kubernetes best practices recommend using a domain prefix for custom annotations (e.g.,
nvsentinel.nvidia.com/latestFaultRemediationState) to avoid collisions with other tools and clearly indicate ownership.π Proposed fix
const ( // AnnotationKey is the key for the node annotation that tracks remediation state - AnnotationKey = "latestFaultRemediationState" + AnnotationKey = "nvsentinel.nvidia.com/latestFaultRemediationState" )fault-remediation/pkg/events/health_event.go (1)
1-3: Add package-level documentation.Per coding guidelines, package-level godoc is required for all Go packages. Consider adding a brief description of what the
eventspackage provides.Suggested documentation
+// Package events provides health event data types for fault remediation workflows. package events import "github.com/nvidia/nvsentinel/data-models/pkg/model"fault-remediation/pkg/crstatus/deprecated_checker.go (3)
15-15: Missing package-level documentation.Per coding guidelines, package-level godoc is required for all Go packages. Add a package comment describing the purpose of this package.
π Suggested fix
+// Package crstatus provides functionality for checking the status of Custom Resources +// to determine whether maintenance operations should be skipped based on existing CR state. package crstatus
50-75: Consider returning error for REST mapping failures instead of silently allowing creation.When
RESTMappingfails (line 62-66), the method logs an error but returnsfalse, which allows CR creation to proceed. This could mask configuration issues. For the PR's goal of "throw errors to trigger retries," consider propagating this error.
77-91: Clarify the return value semantics.When
statusorconditionsare not found (lines 79-86), the method returnstrue(meaning "skip creation"). This seems counterintuitiveβif the CR exists but has no status yet, skipping creation is correct. However, a brief comment explaining this logic would improve maintainability.fault-remediation/pkg/remediation/remediation.go (1)
362-387: Potential nil pointer dereference when checking job annotations.Line 367 checks
job.Annotations != nil && job.Annotations[...], but if the job was just created,Annotationsmight be nil, causing the metrics to be recorded. However, ifAnnotationsis nil, the annotation won't exist, so the condition is correct. The issue is that after updating, if the update fails (line 375-378), the function returnsfalse, err, potentially causing duplicate metrics on retry. Consider moving the metrics recording after the successful annotation update.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
356-358: TODO comments indicate incomplete state transition handling.Multiple TODO comments (lines 356-357, 418-419, 476-478, 507-508, 559-561) indicate that error handling for state transitions is being ignored. These should be tracked as follow-up work to ensure proper state management.
Do you want me to open a new issue to track these TODO items for proper state transition error handling?
fault-remediation/pkg/remediation/deprecated_remediation.go (1)
379-391: Missing labels on Job template spec.Labels are set on
job.Labels(line 384) but not onjob.Spec.Template.Labels. This could affect label-based job selection in some scenarios, though the currentListwithMatchingLabelsqueries Job objects directly.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (24)
.idea/NVSentinel.iml.idea/vcs.xmlfault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/crstatus/deprecated_checker.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/fault_remediation_client_interface.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yamlfault-remediation/pkg/remediation/templates/rebootnode-template.yaml
π€ Files with no reviewable changes (1)
- fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/annotation/annotation.gofault-remediation/main.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/crstatus/deprecated_checker.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/deprecated_remediation_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.go
π§ Learnings (16)
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/remediation/deprecated_remediation_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
fault-remediation/pkg/remediation/remediation_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
fault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
fault-remediation/main.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.go
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.go
𧬠Code graph analysis (9)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (4)
fault-remediation/pkg/events/health_event.go (1)
HealthEventData(11-14)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(7-9)fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
FaultRemediationClientInterface(29-34)TemplateData(37-44)fault-remediation/pkg/remediation/remediation.go (1)
NewCtrlRuntimeRemediationClient(47-95)commons/pkg/statemanager/statemanager.go (1)
StateManager(197-200)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-32)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-32)
fault-remediation/main.go (3)
fault-remediation/pkg/initializer/init.go (2)
InitializationParams(38-44)InitializeAll(51-165)fault-remediation/pkg/reconciler/reconciler.go (1)
FaultRemediationReconciler(61-69)commons/pkg/auditlogger/roundtripper.go (1)
NewAuditingRoundTripper(42-47)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)fault-remediation/pkg/crstatus/checker.go (1)
NewCtrlRuntimeCRStatusChecker(34-44)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
fault-remediation/pkg/remediation/deprecated_remediation.go (8)
fault-remediation/pkg/config/config.go (1)
Template(27-30)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/crstatus/deprecated_checker.go (2)
CRStatusChecker(29-34)NewCRStatusChecker(36-48)fault-remediation/pkg/annotation/deprecated_annotation.go (1)
NewNodeAnnotationManager(38-42)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(7-9)fault-remediation/pkg/events/health_event.go (1)
HealthEventData(11-14)fault-remediation/pkg/common/equivalence_groups.go (1)
GetRemediationGroupForAction(35-45)fault-remediation/pkg/metrics/metrics.go (3)
LogCollectorErrors(86-92)LogCollectorJobs(71-77)LogCollectorJobDuration(78-85)
πͺ YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
[error] 15-15: syntax error: expected , but found ''
(syntax)
π Additional comments (42)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)
15-20: LGTM - Template structure is correct.The Go template syntax with
{{.ApiGroup}}/{{.Version}}is appropriate for template rendering. The YAMLlint error is expected since YAML linters cannot parse Go template placeholders until they are rendered.Consider whether a
namespacefield should be added to metadata if this is a namespaced resource, or document that it's cluster-scoped.fault-remediation/pkg/remediation/templates/log-collector-job.yaml (1)
16-32: Hardcoded test values - clarify if this is a test fixture or production template.The namespace
testand imagetest:testappear to be placeholder values. If this template is intended for production use (likerebootnode-template.yaml), these should be Go template placeholders (e.g.,{{.Namespace}},{{.Image}}). If this is purely a test fixture, consider moving it to a test data directory or adding a comment to clarify its purpose.fault-remediation/pkg/annotation/annotation_interface.go (1)
15-32: LGTM - Clean interface and type definitions.The interface is well-designed with clear method signatures. The data structures use appropriate JSON tags for serialization. Returning
*corev1.Nodealongside the state fromGetRemediationStateis a pragmatic choice to avoid redundant API calls in callers.fault-remediation/pkg/events/health_event.go (1)
5-14: LGTM on struct definitions.The two types appropriately separate JSON and BSON serialization concerns. Consider adding a comment for
HealthEventDocsimilar to the one onHealthEventDatafor consistency.fault-remediation/main.go (2)
194-197: LGTM on auditing round tripper integration.The HTTP transport is correctly wrapped with the auditing round tripper before manager creation, ensuring all Kubernetes API calls are audited.
228-239: LGTM on initialization and cleanup flow.Components are properly initialized with the manager's client, and cleanup is correctly deferred to ensure datastore resources are released on exit.
fault-remediation/pkg/initializer/init.go (2)
58-60: LGTM on validation logic.The guard correctly ensures a ctrl-runtime client is provided when ctrl-runtime mode is enabled, preventing runtime errors from missing dependencies.
88-112: LGTM on dual-mode client initialization.The branching logic cleanly separates ctrl-runtime and Kubernetes client initialization paths, with appropriate error handling for each.
fault-remediation/pkg/remediation/remediation_test.go (2)
26-81: LGTM on client creation tests.Good table-driven test coverage for template validation scenarios, including file existence checks and dry-run mode configuration.
217-413: LGTM on log collector job tests.Comprehensive coverage of job lifecycle scenarios including creation, completion, failure, timeout, and duplicate job handling. The assertions properly verify both error conditions and expected job counts.
fault-remediation/pkg/metrics/metrics.go (2)
15-21: LGTM on package refactor.Moving metrics to a dedicated package with exported identifiers enables cleaner separation of concerns and allows multiple packages to record metrics.
29-92: LGTM on exported metric variables.Metrics are correctly exported while preserving the Prometheus metric names for backward compatibility with existing dashboards and alerts.
fault-remediation/pkg/annotation/deprecated_annotation.go (2)
82-125: LGTM on GetRemediationState signature change.Returning the fetched node enables callers to avoid redundant API calls when they need both the annotation state and node object. The retry logic with
isRetryableErrorproperly handles transient failures.
44-60: LGTM on patchNodeWithRetry implementation.The retry logic correctly uses exponential backoff via
retry.DefaultRetryand appropriately logs retryable errors before continuing.fault-remediation/pkg/reconciler/reconciler.go (5)
66-67: LGTM on public Config field and interface-based annotation manager.Making
Configpublic enables access from main.go for retry configuration. Usingannotation.NodeAnnotationManagerInterfaceimproves testability and supports both legacy and ctrl-runtime implementations.
170-193: LGTM on runLogCollector refactor.Returning
ctrl.Resultanderrorallows proper propagation of requeue requests and errors for retry handling, aligning with the ctrl-runtime pattern.
237-244: Good use of errors.Join for combining errors.Using
errors.Jointo combinecreateMaintenanceResourceErrorand label update errors ensures both failures are visible in logs and upstream error handling.
406-410: Verify error handling change aligns with retry intent.Line 409 now returns the error instead of continuing, which will trigger retries. Per the coding guidelines, within
retry.RetryOnConflictblocks, errors should not be wrapped to preserve retry behavior. However, this is outside such a block, so the unwrapped error is appropriate here.
438-441: Error propagation on RemoveGroupFromState failure.Returning an error here will trigger retries when annotation cleanup fails. This is the intended behavior per the PR discussion to "throw error to trigger retry." The empty CR name return prevents false positive CR existence checks on retry.
fault-remediation/pkg/crstatus/deprecated_checker.go (2)
29-48: LGTM!The struct definition and constructor follow Go conventions with proper field initialization. The use of
*restmapper.DeferredDiscoveryRESTMapperaligns with the k8s.io/client-go patterns for dynamic resource mapping.
93-112: LGTM!The
findConditionStatusandisTerminalmethods correctly implement the condition-checking logic. Terminal states ("True" or "False") properly indicate completion, while empty or "Unknown" states allow for CR creation retry.fault-remediation/pkg/remediation/deprecated_remediation_test.go (5)
15-40: LGTM!The package rename to
remediationand import updates align with the PR's restructuring. The test file properly imports the neweventspackage forHealthEventData.
197-251: Inconsistent capitalization in test strings.Lines 198, 200, 210, 217, and 227 use inconsistent capitalization for "Config" (e.g., "in-cluster Config" vs typical "config"). This appears intentional per AI summary, but verify this matches actual error messages from the Kubernetes client libraries.
306-344: LGTM!The test setup correctly uses the renamed public fields (
Clientset,KubeClient,RestMapper,DryRunMode,Template,TemplateData) and properly configures the mock client for testing CR creation.
346-369: LGTM!The test correctly uses
events.HealthEventDataand validates the updatedCreateMaintenanceResourcesignature returning(string, error). The assertion logic properly handles both success and failure cases.
373-418: LGTM!Tests for
RunLogCollectorJobcorrectly use the updated signature witheventIdparameter and handle the(interface{}, error)return pattern.fault-remediation/pkg/crstatus/checker.go (2)
28-44: LGTM!The struct and constructor properly implement the controller-runtime based CR status checker with appropriate field initialization.
71-105: LGTM!The condition-checking methods are identical to the deprecated version, maintaining behavioral parity between the two implementations.
fault-remediation/pkg/remediation/remediation.go (2)
38-95: LGTM!The
CtrlRuntimeRemediationClientstruct and constructor are well-structured. Template loading, dry-run configuration, and dependency initialization (annotation manager, status checker) are properly handled with appropriate error checking.
437-478: LGTM!The timeout checking logic with configurable
LOG_COLLECTOR_TIMEOUTenvironment variable and proper fallback to default is well implemented. The annotation-based guard against duplicate metrics recording is a good pattern.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
208-217: LGTM!The controller-runtime Manager setup with envtest follows best practices. Disabling metrics binding with
BindAddress: "0"is appropriate for tests.
302-315: LGTM!The
createTestRemediationClienthelper properly constructsremediation.TemplateDataand usesNewCtrlRuntimeRemediationClientwith the controller-runtime client.
880-888: LGTM!The metrics assertions correctly use the dedicated
metricspackage constants and verify that events are properly counted across different status types (created, skipped).fault-remediation/pkg/remediation/deprecated_remediation.go (3)
62-74: LGTM!The struct field exports align with the API surface changes documented in the AI summary. The
nodeExistsFuncoverride for testing is a good pattern.
165-171: LGTM!The accessor methods properly return the interface types, enabling dependency injection and testing.
506-518: Intentional nil error return for non-fatal log collector failures.The code explicitly returns
nilerror for timeout (line 509) and job completion/failure (line 517) to allow reconciliation to continue. This aligns with the PR description's intent. The slog.Error calls ensure visibility into these issues.fault-remediation/pkg/reconciler/reconciler_test.go (6)
40-66: LGTM!The
MockK8sClientproperly implements the updated interface signatures withevents.HealthEventData,ctrl.Result, and the accessor methods returning interface types.
102-134: LGTM!The
MockNodeAnnotationManagercorrectly implements the updatedGetRemediationStatesignature returning(*annotation.RemediationStateAnnotation, *corev1.Node, error)and uses the annotation package types.
190-234: LGTM!The
TestNewReconcilertest properly uses table-driven testing and validates both dry-run enabled and disabled scenarios with the updated return signature.
336-390: LGTM!The
TestPerformRemediationWithSuccesstest correctly validates the success path with the updated API, including theHealthEventDocconversion and CR name assertion.
601-666: LGTM!The
TestRunLogCollectorJobErrorScenariostest properly validates thectrl.Resultreturn pattern including the requeue scenario withRequeueAfter.
946-1009: LGTM!The
TestLogCollectorOnlyCalledWhenShouldCreateCRtest validates the fix for Issue #441, ensuring log collector is only called whenshouldCreateCRis true to prevent duplicate jobs.
f0f60b1 to
b05f1bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
π€ Fix all issues with AI Agents
In @.gitignore:
- Line 143: The .gitignore now contains a blanket `.idea/` entry but still keeps
redundant granular `.idea/*` patterns (e.g., `.idea/mongoSettings.xml`,
`.idea/replstate.xml`, `.idea/**/sonarlint/`) which are unnecessary; remove all
specific `.idea/` file and subdirectory patterns that are subsumed by the
`.idea/` rule so the file is clean and maintainable while keeping the single
`.idea/` line.
In @fault-remediation/main.go:
- Around line 171-182: The retry loop around
components.FaultRemediationReconciler.Reconcile does not observe gCtx
cancellation; modify the loop that uses UpdateMaxRetries and UpdateRetryDelay to
break/return when gCtx is done by selecting on gCtx.Done() before retrying:
after a failed Reconcile, replace the unconditional time.Sleep(...) with a
context-aware wait using select { case
<-time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay): /*
continue */ case <-gCtx.Done(): /* exit loop/return to allow graceful shutdown
*/ }, and also check <-gCtx.Done() at the top of the retry iteration to avoid
starting another attempt when the context is cancelled.
In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-74: GetRemediationState may fail and leave `node` empty; do not
swallow that error and continue using `node.DeepCopy()`. In the block handling
`state, node, err := m.GetRemediationState(ctx, nodeName)` remove the fallback
that creates an empty `RemediationStateAnnotation` and instead return the error
immediately (propagate `err`) so callers won't operate on an invalid `node`;
keep `RemediationStateAnnotation` initialization only where a valid node/state
is present and ensure any subsequent use of `node.DeepCopy()` occurs after a
successful GetRemediationState.
In @fault-remediation/pkg/crstatus/checker.go:
- Around line 58-66: The lookup uses client.ObjectKey with only Name, causing
failures for namespaced CRs; include the resource namespace when building the
key so c.client.Get(ctx, key, obj) can find namespaced objects. Update the
ObjectKey construction (used before calling c.client.Get) to set Namespace from
the MaintenanceResource (e.g., maintenanceResource.Namespace or the variable
holding the config.Namespace) while keeping Name as crName, so both namespaced
and cluster-scoped CRs are handled.
In @fault-remediation/pkg/crstatus/deprecated_checker.go:
- Around line 77-91: The checkCondition logic is inverted: in
CRStatusChecker.checkCondition you should treat missing status/conditions as
non-terminal (do not skip) and consider terminal condition statuses as reasons
to skip; change the early returns for failed/unfound unstructured.NestedMap and
NestedSlice to return false, and replace the final return of
"!c.isTerminal(conditionStatus)" with "c.isTerminal(conditionStatus)"; keep
using findConditionStatus and isTerminal to compute the conditionStatus and
determine terminality.
In @fault-remediation/pkg/events/health_event.go:
- Around line 5-8: Add a godoc comment immediately above the exported type
HealthEventDoc that briefly describes what the struct represents (e.g., a
persistent/document representation of a health event), mentions its ID field and
embedded model.HealthEventWithStatus, and any important JSON serialization
behavior; place the comment directly above the HealthEventDoc declaration so it
satisfies Go documentation guidelines.
- Around line 1-3: Add a package-level godoc comment at the top of
health_event.go describing the purpose and responsibilities of the events
package (e.g., what types of events it models and how callers should use it).
Insert a single-line or multi-line comment beginning with "Package events"
immediately above the package events declaration so the package-level
documentation appears in godoc and satisfies the coding guidelines.
In @fault-remediation/pkg/initializer/init.go:
- Line 114: The current slog.Info("Successfully initialized k8s client") is
misleading when UseCtrlRuntime is true because no k8s client is created in that
mode; change the logging around where UseCtrlRuntime is checked (referencing the
UseCtrlRuntime flag/variable and the slog.Info call) to emit a conditional
message: if UseCtrlRuntime is true log something like "Running in
controller-runtime mode; no standalone k8s client initialized", otherwise keep
"Successfully initialized k8s client". Ensure you update only the message logic
near the existing slog.Info invocation so it accurately reflects which path ran.
In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: Update the TODO in metrics.go to reference a tracking issue: replace
the existing "//TODO: evaluate and remove redundant metrics with ctrl-runtime
defaults" with a TODO that includes the issue ID and brief context (e.g. "//
TODO(issue-1234): evaluate and remove redundant metrics with ctrl-runtime
defaults") so the task is traceable; ensure the issue ID matches the repo's
issue tracker and keep the explanatory text unchanged beyond adding the issue
reference.
In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code updates the node annotation twice: once using
crName and again using actualCRName; remove the first UpdateRemediationState
call (the block that uses crName immediately after computing group :=
common.GetRemediationGroupForAction(...)) so only the subsequent update that
uses actualCRName := createdCR.GetName() remains; ensure you keep the group
computation and the second UpdateRemediationState call (and its error handling)
intact and remove the duplicate slog.Warn/return branch associated with the
first update.
In @fault-remediation/pkg/remediation/remediation.go:
- Around line 310-317: The create path in the reconciler currently calls
c.client.Create(ctx, job) after checking len(existingJobs.Items) == 0 and does
not handle the race where another reconciler created the job and Create returns
an already-exists error; update the c.client.Create error handling to detect
apierrors.IsAlreadyExists(err) (k8s.io/apimachinery/pkg/api/errors) and treat
that case as success by returning ctrl.Result{RequeueAfter: 10 * time.Second},
nil (same behavior as successful create) instead of returning the error, while
still returning real errors unchanged.
- Around line 183-208: The first call that updates the node annotation using the
provisional crName is redundant; remove the initial block that calls
annotationManager.UpdateRemediationState with crName (the block after group :=
common.GetRemediationGroupForAction(healthEvent.RecommendedAction) that uses
crName) and keep only the later update that uses the authoritative actualCRName
from maintenance.GetName(); ensure you still check group != "" and
c.annotationManager != nil and return the existing error handling (slog.Warn and
return err) when the single UpdateRemediationState call fails.
In @fault-remediation/pkg/remediation/templates/log-collector-job.yaml:
- Line 20: Replace the hardcoded namespace value "test" in the
log-collector-job.yaml template with a parameterized Go template variable (e.g.,
change the "namespace: test" entry to use "{{.Namespace}}"), update the
template's expected data model to include Namespace, and ensure any code or
deployment that renders this template (the caller that supplies template data)
passes the Namespace value; this mirrors the approach used in
rebootnode-template.yaml so the template becomes environment-agnostic.
- Line 32: The job template currently hardcodes the placeholder image string
"test:test"; replace this with a real image reference or a templated parameter
so deployments use the correct log-collector image. Update the image field in
the log-collector job template (the line containing "image: test:test") to
reference a configurable variable (e.g., a Helm value or environment variable
like logCollector.image / .Values.logCollector.image) or the intended production
image name, and ensure any charts/values or deployment tooling are updated to
provide that value.
π§Ή Nitpick comments (6)
fault-remediation/pkg/events/health_event.go (1)
10-14: Clarify godoc comment to distinguish HealthEventData from HealthEventDoc.The godoc comment doesn't explain the distinction between
HealthEventData(BSON-tagged) andHealthEventDoc(JSON-tagged). Consider documenting the intended use case for each type to improve maintainability.π Proposed improvement
-// HealthEventData represents health event data with string ID for compatibility +// HealthEventData represents health event data with BSON "_id,omitempty" tag for MongoDB storage. +// Use HealthEventDoc for JSON-based representations. type HealthEventData struct { ID string `bson:"_id,omitempty"` model.HealthEventWithStatus `bson:",inline"` }fault-remediation/pkg/crstatus/crstatus_test.go (1)
120-212: Consider reducing test duplication with a helper function.
TestCheckConditionCtrlRuntimeduplicates ~90 lines fromTestCheckCondition. Consider extracting a helper function that accepts the checker as a parameter to test both implementations with the same test cases, improving maintainability and ensuring test case parity.π Proposed refactor
+func testCheckCondition(t *testing.T, checker interface { + checkCondition(*unstructured.Unstructured) bool +}) { + tests := []struct { + name string + cr *unstructured.Unstructured + expected bool + }{ + { + name: "no status returns skip - in progress", + cr: &unstructured.Unstructured{ + Object: map[string]any{ + "metadata": map[string]any{"name": "test-cr"}, + }, + }, + expected: true, + }, + // ... rest of test cases + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + result := checker.checkCondition(tt.cr) + assert.Equal(t, tt.expected, result) + }) + } +} + func TestCheckCondition(t *testing.T) { cfg := &config.MaintenanceResource{ CompleteConditionType: "Completed", } checker := NewCRStatusChecker(nil, nil, cfg, false) - - tests := []struct { - // ... test cases - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - result := checker.checkCondition(tt.cr) - assert.Equal(t, tt.expected, result) - }) - } + testCheckCondition(t, checker) } func TestCheckConditionCtrlRuntime(t *testing.T) { cfg := &config.MaintenanceResource{ CompleteConditionType: "Completed", } checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false) - - tests := []struct { - // ... duplicate test cases - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - result := checker.checkCondition(tt.cr) - assert.Equal(t, tt.expected, result) - }) - } + testCheckCondition(t, checker) }fault-remediation/pkg/initializer/init.go (1)
88-112: Consider extracting TemplateData initialization to reduce duplication.The
TemplateDatastruct initialization (lines 91-95 and 103-106) is duplicated across both initialization paths. Extract it to a variable before the conditional to improve maintainability.π Proposed refactor
+ templateData := remediation.TemplateData{ + TemplateMountPath: tomlConfig.Template.MountPath, + TemplateFileName: tomlConfig.Template.FileName, + MaintenanceResource: tomlConfig.MaintenanceResource, + } + if params.UseCtrlRuntime { remediationClient, err = remediation.NewCtrlRuntimeRemediationClient( ctrlruntimeClient, - params.DryRun, remediation.TemplateData{ - TemplateMountPath: tomlConfig.Template.MountPath, - TemplateFileName: tomlConfig.Template.FileName, - MaintenanceResource: tomlConfig.MaintenanceResource, - }) + params.DryRun, + templateData) if err != nil { return nil, fmt.Errorf("error while initializing ctrl runtime client: %w", err) } } else { remediationClient, clientSet, err = remediation.NewK8sClient( params.KubeconfigPath, params.DryRun, - remediation.TemplateData{ - TemplateMountPath: tomlConfig.Template.MountPath, - TemplateFileName: tomlConfig.Template.FileName, - MaintenanceResource: tomlConfig.MaintenanceResource, - }, + templateData, ) if err != nil { return nil, fmt.Errorf("error while initializing kubernetes client: %w", err) } }fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
17-54: Consider grouping imports according to Go conventions.The imports have non-standard ordering with
contextfollowed by a blank line, thenlog. Standard Go convention groups stdlib imports together, then a blank line, then external packages, then a blank line, then internal packages.
1109-1111: Avoid usingtime.Sleepfor synchronization in tests.Using
time.Sleep(500 * time.Millisecond)for waiting on event processing is fragile. Consider usingassert.Eventuallyconsistently for better test reliability.π Suggested approach
- // Allow time for second event to be processed (should be deduplicated) - time.Sleep(500 * time.Millisecond) + // Wait for second event to be processed (should be deduplicated) + time.Sleep(100 * time.Millisecond) // Brief pause to allow processingOr better, add an
Eventuallycheck for a specific condition that indicates processing completed.fault-remediation/pkg/remediation/templates/log-collector-job.yaml (1)
28-32: Consider adding resource limits and additional parameterization.To improve cluster stability and operational flexibility:
- Define CPU and memory resource requests and limits for the log-collector container
- Consider parameterizing additional fields such as
ttlSecondsAfterFinished,serviceAccountName, and container arguments/env vars if they vary across deploymentsπ Example resource limits
containers: - name: log-collector image: {{.Image}} + resources: + requests: + memory: "128Mi" + cpu: "100m" + limits: + memory: "256Mi" + cpu: "200m"
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (23)
.gitignorefault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/crstatus/deprecated_checker.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/fault_remediation_client_interface.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yamlfault-remediation/pkg/remediation/templates/rebootnode-template.yaml
π€ Files with no reviewable changes (1)
- fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
π§ Files skipped from review as they are similar to previous changes (4)
- fault-remediation/pkg/crstatus/crstatus_interface.go
- fault-remediation/pkg/remediation/remediation_test.go
- fault-remediation/pkg/remediation/fault_remediation_client_interface.go
- fault-remediation/pkg/annotation/annotation_interface.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/crstatus/deprecated_checker.gofault-remediation/main.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/remediation/deprecated_remediation.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π§ Learnings (17)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
fault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
fault-remediation/main.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.go
𧬠Code graph analysis (11)
fault-remediation/pkg/initializer/init.go (4)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
FaultRemediationClientInterface(29-34)TemplateData(37-44)fault-remediation/pkg/remediation/remediation.go (1)
NewCtrlRuntimeRemediationClient(47-95)fault-remediation/pkg/config/config.go (2)
Template(27-30)MaintenanceResource(18-24)commons/pkg/statemanager/statemanager.go (2)
StateManager(197-200)NewStateManager(206-210)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-32)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)fault-remediation/pkg/crstatus/checker.go (1)
NewCtrlRuntimeCRStatusChecker(34-44)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (8)
health-monitors/kubernetes-object-monitor/pkg/cel/environment.go (1)
Environment(30-35)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
FaultRemediationClientInterface(29-34)TemplateData(37-44)fault-remediation/pkg/remediation/remediation.go (1)
NewCtrlRuntimeRemediationClient(47-95)commons/pkg/statemanager/statemanager.go (2)
StateManager(197-200)RemediatingLabelValue(171-171)fault-remediation/pkg/events/health_event.go (1)
HealthEventDoc(5-8)fault-remediation/pkg/annotation/annotation_interface.go (1)
AnnotationKey(12-12)fault-remediation/pkg/metrics/metrics.go (4)
TotalEventsReceived(33-38)EventHandlingDuration(62-68)EventsProcessed(39-45)ProcessingErrors(46-52)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
fault-remediation/pkg/remediation/deprecated_remediation_test.go (4)
fault-remediation/pkg/config/config.go (1)
Template(27-30)fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
TemplateData(37-44)fault-remediation/pkg/events/health_event.go (1)
HealthEventData(11-14)fault-remediation/pkg/remediation/deprecated_remediation.go (1)
FaultRemediationClient(62-74)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-32)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
HealthEventData(11-14)HealthEventDoc(5-8)fault-remediation/pkg/annotation/annotation_interface.go (3)
NodeAnnotationManagerInterface(16-21)RemediationStateAnnotation(24-26)EquivalenceGroupState(29-32)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(7-9)data-models/pkg/protos/health_event.pb.go (2)
RecommendedAction_RESTART_BM(48-48)RecommendedAction_NONE(43-43)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
fault-remediation/pkg/reconciler/reconciler.go (8)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
FaultRemediationClientInterface(29-34)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/metrics/metrics.go (7)
EventHandlingDuration(62-68)TotalEventsReceived(33-38)TotalUnsupportedRemediationActions(53-59)ProcessingErrors(46-52)EventsProcessed(39-45)CRStatusSkipped(26-26)CRStatusCreated(25-25)commons/pkg/statemanager/statemanager.go (1)
StateManager(197-200)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)fault-remediation/pkg/events/health_event.go (2)
HealthEventDoc(5-8)HealthEventData(11-14)data-models/pkg/model/health_event_extentions.go (1)
HealthEventWithStatus(53-57)store-client/pkg/client/convenience.go (1)
UpdateHealthEventStatus(31-35)
fault-remediation/pkg/remediation/deprecated_remediation.go (6)
fault-remediation/pkg/config/config.go (1)
Template(27-30)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(7-9)fault-remediation/pkg/events/health_event.go (1)
HealthEventData(11-14)fault-remediation/pkg/common/equivalence_groups.go (1)
GetRemediationGroupForAction(35-45)fault-remediation/pkg/metrics/metrics.go (2)
LogCollectorJobs(71-77)LogCollectorJobDuration(78-85)
πͺ YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
[error] 15-15: syntax error: expected , but found ''
(syntax)
π Additional comments (29)
.gitignore (1)
143-143:β οΈ Scope mismatch: .gitignore changes don't align with PR objectives.This file's changes focus on consolidating IDE configuration patterns, which is unrelated to the PR's stated objectives around remediation retry behavior and error handling. The PR summary describes refactoring fault remediation with dual execution modes and controller-runtime integration, not .gitignore cleanup.
If this .gitignore consolidation is intentional as part of broader housekeeping, clarify this in the PR description. Alternatively, consider submitting it as a separate, focused PR to keep scopes distinct.
fault-remediation/pkg/metrics/metrics.go (1)
33-92: LGTM: Metric variable exports are well-structured.The renamed metric variables follow Go naming conventions and properly integrate with the controller-runtime metrics registry. The metric definitions (names, help texts, labels) are preserved correctly.
fault-remediation/pkg/annotation/deprecated_annotation.go (3)
83-125: LGTM: GetRemediationState signature change properly implemented.The updated method signature correctly returns
*corev1.Nodealongside the remediation state. All return paths (success and error) properly handle the additional return value, and error wrapping follows Go conventions with%w.
131-131: LGTM: Call sites correctly updated for new GetRemediationState signature.Both
UpdateRemediationStateandRemoveGroupFromStateproperly handle the additional*corev1.Nodereturn value by using_to discard it, which is appropriate since these methods don't require the node reference.Also applies to: 186-186
45-80: LGTM: Robust retry logic with appropriate error classification.The retry implementation correctly handles common transient Kubernetes API errors (conflicts, timeouts, rate limiting, service unavailability) with exponential backoff. Error wrapping follows Go conventions with
%wto preserve error chains.fault-remediation/pkg/initializer/init.go (3)
58-60: LGTM: Proper validation of ctrl-runtime client.The validation correctly ensures that a ctrl-runtime client is provided when
UseCtrlRuntimeis enabled, preventing nil pointer errors in the ctrl-runtime initialization path.
148-157: LGTM: Reconciler configuration properly wired.The reconciler configuration correctly integrates the remediation client abstraction, retry settings, and log collector flag. (Note: StateManager nil issue addressed in separate comment.)
152-153: StateManager will panic with nil clientSet in ctrl-runtime mode.When
UseCtrlRuntimeis true,clientSetremains nil but is passed tostatemanager.NewStateManager(clientSet). StateManager methods directly callclientSet.CoreV1().Nodes().Get()without nil checks, causing a panic at runtime. The TODO acknowledges this but doesn't prevent the runtime error.Either implement a ctrl-runtime version of StateManager or add nil checks to guard StateManager method calls in ctrl-runtime mode.
Likely an incorrect or invalid review comment.
fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)
15-476: Test updates correctly reflect the refactored API.The test changes appropriately adapt to the new public API surface, including:
- Field capitalizations (Clientset, KubeClient, RestMapper, DryRunMode, Template, TemplateData)
- Updated
CreateMaintenanceResourcesignature returning(string, error)and acceptingevents.HealthEventData- Updated
RunLogCollectorJobsignature includingeventIdparameter and returning(ctrl.Result, error)The test logic and assertions remain sound.
fault-remediation/pkg/remediation/remediation.go (2)
47-95: LGTM: Well-structured constructor with proper validation.The constructor correctly:
- Validates template file existence before reading
- Handles template parsing errors
- Initializes dry-run mode appropriately
- Sets up annotation manager and status checker components
322-478: LGTM: Robust log collector status checking with metric guards.The status checking implementation correctly:
- Handles complete, failed, and timeout states separately
- Uses annotation-based guards to prevent duplicate metric recording across reconciliations
- Configures timeout via environment variable with safe fallback
- Requeues appropriately when job is still running
fault-remediation/pkg/crstatus/checker.go (2)
28-44: LGTM - Clean refactoring to controller-runtime client.The struct and constructor are well-structured with the controller-runtime client integration. The simplified field set improves maintainability.
71-85: LGTM - Condition checking logic is correct.The
checkConditionmethod properly handles missing status/conditions by returningtrue(allowing CR creation), and correctly delegates toisTerminalfor status evaluation.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
208-217: LGTM - Controller-runtime manager setup is correct.The manager is properly configured with the test environment config, scheme, and disabled metrics server (BindAddress: "0"). The client is correctly obtained from the manager.
302-315: LGTM - Test remediation client creation properly uses new API.The
createTestRemediationClientcorrectly constructsremediation.TemplateDataand usesremediation.NewCtrlRuntimeRemediationClientwith the controller-runtime client.fault-remediation/pkg/remediation/deprecated_remediation.go (2)
506-517: Verify intentional error suppression for log collector failures.The code logs errors but returns
nilfor both timeout and job failure cases. Based on the past review discussion, this is intentional to allow reconciliation to continue. However, consider whether returningctrl.Result{Requeue: true}might be more appropriate for transient failures.Confirm that swallowing errors here aligns with the intended behavior discussed in past reviews - allowing remediation to proceed even if log collection fails.
62-74: LGTM - Struct fields properly exposed with consistent naming.The
FaultRemediationClientstruct fields are well-organized with clear naming conventions. ThenodeExistsFuncallows for test overrides.fault-remediation/pkg/reconciler/reconciler.go (6)
203-211: Error propagation enables retry behavior as intended.Returning the error from
UpdateNVSentinelStateNodeLabelallows controller-runtime to retry the reconciliation. This aligns with the PR objective.
220-242: Good error aggregation pattern witherrors.Join.The code properly handles the case where CR creation fails by:
- Recording the error but continuing to update state
- Using
errors.Jointo combine errors when both operations fail- Returning the CR creation error after state update
This ensures state is updated even on failure while still propagating errors for retry.
406-410: Error propagation fromGetRemediationStateenables retry.Previously this may have silently continued; now it returns the error to trigger a retry. The underscore for the unused node return value is appropriate.
438-441: Error propagation fromRemoveGroupFromStateenables retry.Returning
false, "", errinstead of just logging ensures the reconciler will retry on annotation update failures.
328-330: Verify log collector error handling aligns with intended behavior.The code returns early if
err != nilor!result.IsZero(). Per past review discussions, log collector failures should not block remediation. Confirm this matches the current implementation inRunLogCollectorJob.Based on the
deprecated_remediation.gochanges,RunLogCollectorJobnow returnsnilerror on timeout/failure, so this early return would only trigger on actual API errors during job creation.
332-344: Good error handling with status propagation.The pattern of:
- Attempting remediation
- Updating status regardless of outcome
- Joining errors if both fail
- Returning the original error
Ensures status is always updated while still triggering retries on failure.
fault-remediation/pkg/reconciler/reconciler_test.go (5)
40-66: LGTM - Mock implementations properly updated for new interfaces.The
MockK8sClientcorrectly implements the updatedFaultRemediationClientInterfacewith the new return types forCreateMaintenanceResourceandRunLogCollectorJob.
102-134: LGTM - MockNodeAnnotationManager implements the new interface correctly.The mock properly returns the 3-tuple
(*RemediationStateAnnotation, *corev1.Node, error)fromGetRemediationState, withnilfor the node which is appropriate for unit tests.
383-390: LGTM - Test properly constructs HealthEventDoc from HealthEventData.The conversion pattern from
events.HealthEventDatatoevents.HealthEventDocis correctly applied in the test.
443-446: Test correctly expects error on remediation failure.The test verifies that
performRemediationreturns an error whenCreateMaintenanceResourcefails, confirming the retry behavior.
489-491: Test correctly expects error on label update failure.The test now expects an error when
UpdateNVSentinelStateNodeLabelfails, validating the new retry-on-error behavior introduced by this PR.fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)
1-20: Template is missing requiredforcefield in spec.The RebootNode CRD (janitor.dgxc.nvidia.com/v1alpha1) requires both
force(boolean, defaults to false) andnodeNamefields in the spec. Your template only includesnodeName, which will cause validation errors when creating RebootNode resources.Add the missing field:
Required fix
apiVersion: {{.ApiGroup}}/{{.Version}} kind: RebootNode metadata: name: maintenance-{{.NodeName}}-{{.HealthEventID}} spec: nodeName: {{.NodeName}} force: falseAlso verify that
ApiGroupandVersionplaceholders are populated withjanitor.dgxc.nvidia.comandv1alpha1respectively during template rendering.β Skipped due to learnings
Learnt from: tanishagoyal2 Repo: NVIDIA/NVSentinel PR: 610 File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370 Timestamp: 2025-12-23T10:34:13.121Z Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
b05f1bf to
8788f69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
π€ Fix all issues with AI agents
In @commons/pkg/statemanager/statemanager_test.go:
- Around line 281-302: Remove the dead commented-out test block for
TestUpdateNVSentinelStateNodeLabelWithUpdateFailureCtrlRuntime: delete the
entire commented function (including references to ctrlRuntimeStateManager,
fakeClient, and the UpdateNVSentinelStateNodeLabel calls) from
statemanager_test.go; if the test work is intended later, open a tracking issue
referencing this test name and move any needed notes there instead of leaving
commented code.
In @commons/pkg/statemanager/statemanager.go:
- Around line 310-388: The
ctrlRuntimeStateManager.UpdateNVSentinelStateNodeLabel implementation lacks
retry-on-conflict handling; wrap the Get/Modify/Update sequence in a retry loop
(e.g., retry.OnError with errors.IsConflict) so transient optimistic-concurrency
conflicts are retried, introduce a local nodeModified bool to track whether the
label was actually changed/removed inside the retry closure, and ensure the
function returns that nodeModified and the final error from the retry call; keep
using validateStateTransition, manager.client.Get and manager.client.Update
inside the retry closure and return validationErr only after a successful
update.
In @fault-remediation/main.go:
- Around line 171-182: The retry loop around
components.FaultRemediationReconciler.Reconcile does not respect gCtx
cancellation because it uses time.Sleep; change the backoff to a context-aware
wait by replacing the
time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay) with a
select that waits on
time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay) and on
gCtx.Done(), and if gCtx is cancelled return or break out of the loop; ensure
you check gCtx.Done() before each retry and abort retries when the context is
done so graceful shutdown on SIGTERM is honored.
In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-74: GetRemediationState can fail and currently the code
swallows the error and proceeds with an empty RemediationStateAnnotation while
using the invalid node value returned from the failed call later (e.g.,
node.DeepCopy()), which can cause incorrect behavior; change the handling in the
caller so that when m.GetRemediationState(ctx, nodeName) returns an error you
immediately return that error (or wrap and return it) instead of continuing with
an empty state, ensuring you do not call methods like node.DeepCopy() on the
invalid node; update the function containing this logic to propagate the error
from GetRemediationState rather than creating a placeholder
RemediationStateAnnotation.
In @fault-remediation/pkg/crstatus/checker.go:
- Around line 58-66: The ObjectKey used for c.client.Get is missing the
Namespace, so lookups for namespaced CRs fail; update the key construction (the
client.ObjectKey passed to c.client.Get where obj, gvk and crName are used) to
include the namespace from the MaintenanceResource (e.g., use the resource's
Namespace field or crNamespace variable) so the lookup uses both Name and
Namespace before calling c.client.Get; ensure the Namespace is set only when
non-empty for cluster-scoped resources.
In @fault-remediation/pkg/crstatus/crstatus_interface.go:
- Around line 1-9: The package crstatus lacks a package-level godoc comment; add
a brief package comment immediately above "package crstatus" describing the
package's purpose (e.g., utilities for checking/handling CR status), mention the
exported interface CRStatusCheckerInterface and its method
ShouldSkipCRCreation(context.Context, string) so docs are clear, keeping the
comment concise and in godoc style.
In @fault-remediation/pkg/crstatus/deprecated_checker.go:
- Around line 77-91: The checkCondition method in CRStatusChecker has inverted
skip logic: when status or conditions are missing (in the nested map/slice
checks in checkCondition) it currently returns true but should return false
(treat missing as non-terminal), and the final return should not negate
isTerminal β replace "return !c.isTerminal(conditionStatus)" with "return
c.isTerminal(conditionStatus)"; locate these in checkCondition (calls to
unstructured.NestedMap, unstructured.NestedSlice, findConditionStatus and
isTerminal) and invert those boolean returns accordingly.
In @fault-remediation/pkg/events/health_event.go:
- Around line 5-8: Add a godoc comment immediately above the exported type
HealthEventDoc describing its purpose and fields; mention that it represents a
HealthEvent document with an ID and inlined model.HealthEventWithStatus (so
readers understand the JSON tags and inline embedding). Ensure the comment
starts with "HealthEventDoc" and is a complete sentence per Go conventions.
- Around line 1-3: Add a package-level godoc comment above the package
declaration for package events in health_event.go that briefly documents the
package purpose and intended usage (e.g., what health events are represented and
how consumers should use this package). Ensure the comment is a full sentence
starting with "Package events ..." and sits immediately above the existing
"package events" line so godoc tools pick it up; update any existing top-of-file
comments if present to follow the "Package events ..." convention.
In @fault-remediation/pkg/initializer/init.go:
- Line 117: The log message "Successfully initialized k8s client" is misleading
when UseCtrlRuntime is true; update the logging in the client initialization
(where slog.Info is called) to check the UseCtrlRuntime flag and log a precise
message (e.g., "Successfully initialized ctrl-runtime remediation client" when
UseCtrlRuntime is true, otherwise "Successfully initialized k8s client"),
referencing the UseCtrlRuntime boolean and the existing slog.Info call to locate
the spot to change.
In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code calls UpdateRemediationState twiceβfirst using
crName and then again using actualCRName from createdCR.GetName(); remove the
first redundant update (the block that uses crName) so only the second
UpdateRemediationState call remains, keeping the group lookup via
common.GetRemediationGroupForAction(healthEvent.RecommendedAction) and the
c.AnnotationManager checks intact and returning errors as currently done in the
second block.
In @fault-remediation/pkg/remediation/remediation.go:
- Around line 183-192: Remove the redundant provisional annotation update:
delete the block that calls c.annotationManager.UpdateRemediationState(ctx,
healthEvent.NodeName, group, crName) (the update using the provisional crName)
and its error handling; keep the later authoritative update that uses
actualCRName retrieved from maintenance.GetName() so only the single, final
UpdateRemediationState call remains. Ensure any related variables (crName) are
still set if needed elsewhere, but do not perform the early
UpdateRemediationState call with the provisional name.
π§Ή Nitpick comments (9)
fault-remediation/pkg/crstatus/crstatus_test.go (1)
120-212: Consider extracting shared test cases to reduce duplication.The test logic is correct and comprehensive. However, the test cases are identical to
TestCheckCondition(lines 26-118). Consider extracting the shared test cases into a variable or helper function to reduce duplication and improve maintainability.β»οΈ Proposed refactor to share test cases
+// Shared test cases for both checker variants +func getConditionTestCases() []struct { + name string + cr *unstructured.Unstructured + expected bool +} { + return []struct { + name string + cr *unstructured.Unstructured + expected bool + }{ + { + name: "no status returns skip - in progress", + cr: &unstructured.Unstructured{ + Object: map[string]any{ + "metadata": map[string]any{"name": "test-cr"}, + }, + }, + expected: true, + }, + // ... rest of test cases + } +} + func TestCheckCondition(t *testing.T) { cfg := &config.MaintenanceResource{ CompleteConditionType: "Completed", } checker := NewCRStatusChecker(nil, nil, cfg, false) - - tests := []struct { - name string - cr *unstructured.Unstructured - expected bool - }{ - // ... test cases - } + tests := getConditionTestCases() for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { @@ -113,93 +115,7 @@ func TestCheckConditionCtrlRuntime(t *testing.T) { // ... same setup checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false) - - tests := []struct { - name string - cr *unstructured.Unstructured - expected bool - }{ - // ... identical test cases - } + tests := getConditionTestCases() for _, tt := range tests { // ... same loopcommons/pkg/statemanager/statemanager_test.go (1)
17-32: Import ordering does not follow Go conventions.Standard library imports should be grouped separately from third-party imports. The
k8s.io/apimachinery/pkg/typesimport on line 20 is mixed with the standard library block.β»οΈ Suggested fix
import ( "context" "fmt" - "k8s.io/apimachinery/pkg/types" "testing" "github.com/stretchr/testify/assert" v1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/runtime" "k8s.io/apimachinery/pkg/runtime/schema" + "k8s.io/apimachinery/pkg/types" "k8s.io/client-go/kubernetes/fake" ktesting "k8s.io/client-go/testing" ctrlFake "sigs.k8s.io/controller-runtime/pkg/client/fake" )fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
15-15: Missing package-level godoc.As per coding guidelines, package-level godoc is required for all Go packages.
π Proposed fix
+// Package remediation provides interfaces and implementations for fault remediation +// operations, including maintenance resource creation and log collection. package remediation
31-31: Parameter name should use Go acronym convention.Go convention is to use
ID(all caps) for acronyms, notId.β»οΈ Proposed fix
- RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) + RunLogCollectorJob(ctx context.Context, nodeName string, eventID string) (ctrl.Result, error)commons/pkg/statemanager/statemanager.go (1)
144-156: Import ordering does not follow Go conventions.Standard library and third-party imports are mixed. Group standard library imports together, separated from third-party imports.
β»οΈ Proposed fix
import ( "context" "fmt" - corev1 "k8s.io/api/core/v1" - "k8s.io/apimachinery/pkg/types" "log/slog" - "sigs.k8s.io/controller-runtime/pkg/client" + corev1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/api/errors" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/types" "k8s.io/client-go/kubernetes" "k8s.io/client-go/util/retry" + "sigs.k8s.io/controller-runtime/pkg/client" )fault-remediation/pkg/annotation/annotation.go (1)
1-1: Missing package-level godoc.As per coding guidelines, package-level godoc is required for all Go packages.
π Proposed fix
+// Package annotation provides node annotation management for tracking remediation state +// across fault remediation operations. package annotationfault-remediation/main.go (1)
199-199: TODO comment should reference an issue.As per coding guidelines, TODO comments in Go code should reference issues for tracking.
π Proposed fix
- //TODO: setup informers for node and job + //TODO(#issue_number): setup informers for node and jobfault-remediation/pkg/remediation/deprecated_remediation.go (1)
208-210: Add context when returning template execution error.Per coding guidelines, wrap errors with context for better traceability.
π Suggested enhancement
if err = c.Template.Execute(&buf, c.TemplateData); err != nil { slog.Error("Failed to execute maintenance Template", "error", err) - return "", err + return "", fmt.Errorf("failed to execute maintenance template: %w", err) }fault-remediation/pkg/reconciler/reconciler.go (1)
203-211: Consider adding context when returning label update error.Per coding guidelines, wrapping errors with context improves traceability.
π Suggested enhancement
_, err := r.Config.StateManager.UpdateNVSentinelStateNodeLabel(ctx, healthEventWithStatus.HealthEvent.NodeName, statemanager.RemediatingLabelValue, false) if err != nil { slog.Error("Error updating node label to remediating", "error", err) metrics.ProcessingErrors.WithLabelValues("label_update_error", nodeName).Inc() - return "", err + return "", fmt.Errorf("failed to update node label to remediating: %w", err) }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
commons/go.sumis excluded by!**/*.sum
π Files selected for processing (26)
.gitignorecommons/go.modcommons/pkg/statemanager/statemanager.gocommons/pkg/statemanager/statemanager_test.gofault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/crstatus/deprecated_checker.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/fault_remediation_client_interface.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yamlfault-remediation/pkg/remediation/templates/rebootnode-template.yaml
π€ Files with no reviewable changes (1)
- fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
π§ Files skipped from review as they are similar to previous changes (4)
- fault-remediation/pkg/remediation/remediation_test.go
- .gitignore
- fault-remediation/pkg/remediation/templates/log-collector-job.yaml
- fault-remediation/pkg/annotation/annotation_interface.go
π§° Additional context used
π Path-based instructions (3)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/crstatus/crstatus_test.gocommons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/main.gocommons/pkg/statemanager/statemanager.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/crstatus/deprecated_checker.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/crstatus/crstatus_test.gocommons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
commons/go.mod
π§ Learnings (25)
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
fault-remediation/pkg/events/health_event.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/metrics/metrics.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/crstatus/crstatus_test.gocommons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-remediation/pkg/crstatus/crstatus_test.gocommons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Applied to files:
commons/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
commons/go.modfault-remediation/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
commons/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
commons/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
commons/pkg/statemanager/statemanager_test.gocommons/pkg/statemanager/statemanager.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
commons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
fault-remediation/main.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
fault-remediation/pkg/metrics/metrics.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code
Applied to files:
fault-remediation/pkg/metrics/metrics.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
𧬠Code graph analysis (10)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)fault-remediation/pkg/crstatus/checker.go (1)
NewCtrlRuntimeCRStatusChecker(34-44)
commons/pkg/statemanager/statemanager_test.go (1)
commons/pkg/statemanager/statemanager.go (9)
QuarantinedLabelValue(166-166)NVSentinelStateLabelKey(159-159)DrainingLabelValue(169-169)NVSentinelStateLabelValue(162-162)DrainSucceededLabelValue(170-170)DrainFailedLabelValue(171-171)RemediatingLabelValue(174-174)RemediationSucceededLabelValue(175-175)RemediationFailedLabelValue(176-176)
fault-remediation/main.go (3)
fault-remediation/pkg/reconciler/reconciler.go (1)
FaultRemediationReconciler(61-69)commons/pkg/auditlogger/roundtripper.go (1)
NewAuditingRoundTripper(42-47)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-32)
fault-remediation/pkg/remediation/deprecated_remediation_test.go (4)
fault-remediation/pkg/config/config.go (1)
Template(27-30)fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
TemplateData(37-44)fault-remediation/pkg/events/health_event.go (1)
HealthEventData(11-14)fault-remediation/pkg/remediation/deprecated_remediation.go (1)
FaultRemediationClient(62-74)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-32)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
HealthEventData(11-14)HealthEventDoc(5-8)fault-remediation/pkg/annotation/annotation_interface.go (3)
NodeAnnotationManagerInterface(16-21)RemediationStateAnnotation(24-26)EquivalenceGroupState(29-32)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(7-9)data-models/pkg/protos/health_event.pb.go (2)
RecommendedAction_RESTART_BM(48-48)RecommendedAction_NONE(43-43)
fault-remediation/pkg/reconciler/reconciler.go (8)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
FaultRemediationClientInterface(29-34)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/metrics/metrics.go (5)
EventHandlingDuration(62-68)TotalEventsReceived(33-38)TotalUnsupportedRemediationActions(53-59)ProcessingErrors(46-52)EventsProcessed(39-45)commons/pkg/statemanager/statemanager.go (1)
StateManager(200-203)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)fault-remediation/pkg/events/health_event.go (2)
HealthEventDoc(5-8)HealthEventData(11-14)data-models/pkg/model/health_event_extentions.go (1)
HealthEventWithStatus(53-57)store-client/pkg/storewatcher/watch_store.go (1)
ChangeStreamWatcher(43-57)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
πͺ YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
[error] 15-15: syntax error: expected , but found ''
(syntax)
π Additional comments (37)
commons/go.mod (1)
26-26: Dual json-patch versions are expected transitive dependencies and require no action.Both
github.com/evanphx/json-patch/v5 v5.9.11andgopkg.in/evanphx/json-patch.v4 v4.13.0are indirect (transitive) dependencies pulled in by different upstream packages. The presence of both versions is normal in Go module management when different dependencies require different versions. Neither is directly used in the commons module code, and consolidation is not possible without changing upstream dependencies. The v5.9.11 version is stable and actively maintained.Likely an incorrect or invalid review comment.
fault-remediation/pkg/crstatus/checker.go (3)
28-44: LGTM!The type definition and constructor are well-structured, correctly using the controller-runtime client and properly initializing all fields.
71-85: LGTM!The condition checking logic correctly handles missing status/conditions and delegates to helper methods for status evaluation.
87-106: LGTM!The helper methods correctly implement condition status lookup and terminal state checking with appropriate type assertions and fallback behavior.
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)
1-20: LGTM! Static analysis error is a false positive.The template structure is correct. The YAMLlint syntax error on line 15 is a false positiveβthe file uses Go template placeholders (e.g.,
{{.ApiGroup}}) that will be rendered before being parsed as YAML.fault-remediation/pkg/events/health_event.go (1)
10-14: LGTM!The
HealthEventDatatype is properly documented and the struct definition is correct with appropriate BSON tags for MongoDB compatibility.commons/pkg/statemanager/statemanager_test.go (5)
269-279: LGTM!Good test coverage for the ctrl-runtime path's Get failure scenario, using the fake client without pre-created objects to simulate a node not found error.
304-330: LGTM!Comprehensive test for adding a label via the ctrl-runtime path with proper verification of the final node state.
332-361: LGTM!Proper test for label removal with verification that the label no longer exists on the node.
449-495: LGTM!Good table-driven test pattern for verifying label removal from all possible states works without validation errors in the ctrl-runtime path.
497-572: LGTM!Comprehensive state transition test covering both valid and invalid transitions, with proper verification that labels are set even for unexpected transitions.
fault-remediation/pkg/initializer/init.go (3)
51-60: LGTM!Good defensive validation ensuring the ctrl-runtime client is provided when
UseCtrlRuntimeis true. This prevents nil pointer panics downstream.
83-115: LGTM!The dual-mode initialization properly separates the ctrl-runtime and k8s client paths, fixing the previous issue where
clientSetcould be nil in ctrl-runtime mode. Each path now correctly initializes its required dependencies.
151-160: LGTM!The reconciler configuration now correctly receives the abstracted
RemediationClientandStateManager, enabling both initialization paths to work with the same reconciler interface.fault-remediation/pkg/metrics/metrics.go (2)
15-27: LGTM!Clean package reorganization with appropriate exported constants for CR status tracking.
33-92: LGTM!Metrics properly exported and registered with the controller-runtime metrics registry. The naming follows Prometheus conventions with the
fault_remediation_prefix.fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
29-44: LGTM on interface design.The interface provides a clean abstraction over remediation operations with appropriate accessors for annotation management and status checking. The
TemplateDatastruct properly embedsconfig.MaintenanceResourcefor template rendering.commons/pkg/statemanager/statemanager.go (1)
298-306: LGTM on struct and constructor.Clean implementation following the same pattern as
NewStateManager.fault-remediation/pkg/annotation/annotation.go (2)
107-132: LGTM!Clean implementation for clearing the remediation state annotation with proper nil-check and patch-based update.
134-169: LGTM!Well-structured logic to remove a specific group, with automatic cleanup when no groups remain.
fault-remediation/main.go (3)
194-197: LGTM!Good integration of the auditing round-tripper wrapper for request auditing in the ctrl-runtime path.
228-244: LGTM!Proper initialization flow with deferred cleanup for the datastore components in the ctrl-runtime path.
122-136: LGTM!Clean separation of the non-ctrl-runtime initialization with proper deferred cleanup.
fault-remediation/pkg/annotation/deprecated_annotation.go (2)
46-59: LGTM!The retry logic with
retry.OnErroris correctly implemented, and wrapping errors with%wpreserves the error chain for retry detection.
83-124: LGTM!The signature update to return the Node object alongside the state is correctly implemented and aligns with the new interface definition. This enables owner-reference-based operations in remediation flows.
fault-remediation/pkg/remediation/remediation.go (1)
375-377: LGTM: Error handling aligns with PR objectives.The Update calls at lines 375-377, 413-416, and 465-468 correctly return errors without wrapping, which will trigger retries in the reconciliation loop as intended by the PR title "retry on errors and throw errors to trigger retries."
Also applies to: 413-416, 465-468
fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)
15-418: LGTM!The test updates correctly reflect the refactoring from private to exported fields in the FaultRemediationClient struct, and the usage of the new events.HealthEventData type.
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
1-1380: LGTM!The e2e test suite is well-structured and follows coding guidelines:
- Uses
envtestfor testing Kubernetes controllers as recommended- Correctly handles the updated
GetRemediationStatesignature throughout- Comprehensive test coverage for various reconciliation scenarios
- Proper metrics validation
fault-remediation/pkg/remediation/deprecated_remediation.go (2)
365-504: LGTM: Comprehensive metrics instrumentation.The metrics instrumentation throughout
RunLogCollectorJobprovides good observability with appropriate labels (error types, node names, job outcomes).
254-264: LGTM: Proper AlreadyExists error handling.The extraction of AlreadyExists handling into
handleCRCreateAlreadyExistsimproves code organization and ensures consistent annotation updates when a CR already exists.fault-remediation/pkg/reconciler/reconciler.go (4)
407-441: Verify error handling strategy is intentional.The function returns different
shouldCreateCRvalues depending on the error type:
- Line 409: Returns
true(allow creation) whenGetRemediationStatefails- Line 440: Returns
false(prevent creation) whenRemoveGroupFromStatefailsThis appears intentional (fail-open for read errors, fail-closed for write errors), but could benefit from inline comments explaining the reasoning.
170-193: LGTM: runLogCollector signature updated for controller-runtime integration.The signature changes enable proper requeue handling via
ctrl.Resultand improve log collector job labeling witheventUID.
328-344: LGTM: Proper error aggregation with errors.Join.The error handling ensures status updates are always attempted even when remediation fails, and properly aggregates multiple errors using
errors.Joinfor comprehensive error reporting.
101-352: LGTM: Comprehensive metrics instrumentation.Metrics are consistently recorded throughout the reconciliation flow with appropriate labels for error types, node names, and status values, providing good observability.
fault-remediation/pkg/reconciler/reconciler_test.go (3)
40-66: LGTM: Mock interfaces updated to match new signatures.The mock implementations correctly reflect the updated interface methods with
events.HealthEventData,ctrl.Resultreturn types, and new annotation/status checker interfaces.
102-134: LGTM: Mock annotation manager correctly implements new interface.The mock properly returns the expanded 3-tuple from
GetRemediationStateand uses the correct types from theannotationpackage.
190-1009: LGTM: Test cases comprehensively updated for new interfaces.All test cases correctly use
events.HealthEventDataandevents.HealthEventDoctypes, mock the new return signatures, and validate the updated error handling and return patterns.
fault-remediation/main.go
Outdated
| for i := 1; i <= components.FaultRemediationReconciler.Config.UpdateMaxRetries; i++ { | ||
| _, err = components.FaultRemediationReconciler.Reconcile(gCtx, &event) | ||
| if err == nil { | ||
| break | ||
| } | ||
|
|
||
| slog.Error("Error processing event", "event", event, "error", err) | ||
|
|
||
| if i < components.FaultRemediationReconciler.Config.UpdateMaxRetries { | ||
| time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry loop does not respect context cancellation.
The retry loop will continue even if the context is cancelled (e.g., on SIGTERM). Replace time.Sleep with a context-aware wait to allow graceful shutdown during retries.
π§ Proposed fix
for i := 1; i <= components.FaultRemediationReconciler.Config.UpdateMaxRetries; i++ {
+ select {
+ case <-gCtx.Done():
+ slog.Info("Context cancelled, stopping retry loop")
+ return gCtx.Err()
+ default:
+ }
+
_, err = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
if err == nil {
break
}
slog.Error("Error processing event", "event", event, "error", err)
if i < components.FaultRemediationReconciler.Config.UpdateMaxRetries {
- time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay)
+ select {
+ case <-time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay):
+ case <-gCtx.Done():
+ slog.Info("Context cancelled during retry delay")
+ return gCtx.Err()
+ }
}
}π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for i := 1; i <= components.FaultRemediationReconciler.Config.UpdateMaxRetries; i++ { | |
| _, err = components.FaultRemediationReconciler.Reconcile(gCtx, &event) | |
| if err == nil { | |
| break | |
| } | |
| slog.Error("Error processing event", "event", event, "error", err) | |
| if i < components.FaultRemediationReconciler.Config.UpdateMaxRetries { | |
| time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay) | |
| } | |
| } | |
| for i := 1; i <= components.FaultRemediationReconciler.Config.UpdateMaxRetries; i++ { | |
| select { | |
| case <-gCtx.Done(): | |
| slog.Info("Context cancelled, stopping retry loop") | |
| return gCtx.Err() | |
| default: | |
| } | |
| _, err = components.FaultRemediationReconciler.Reconcile(gCtx, &event) | |
| if err == nil { | |
| break | |
| } | |
| slog.Error("Error processing event", "event", event, "error", err) | |
| if i < components.FaultRemediationReconciler.Config.UpdateMaxRetries { | |
| select { | |
| case <-time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay): | |
| case <-gCtx.Done(): | |
| slog.Info("Context cancelled during retry delay") | |
| return gCtx.Err() | |
| } | |
| } | |
| } |
π€ Prompt for AI Agents
In @fault-remediation/main.go around lines 171 - 182, The retry loop around
components.FaultRemediationReconciler.Reconcile does not respect gCtx
cancellation because it uses time.Sleep; change the backoff to a context-aware
wait by replacing the
time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay) with a
select that waits on
time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay) and on
gCtx.Done(), and if gCtx is cancelled return or break out of the loop; ensure
you check gCtx.Done() before each retry and abort retries when the context is
done so graceful shutdown on SIGTERM is honored.
8788f69 to
1b3ff7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (2)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
45-60: Return unwrapped errors in retry blocks to preserve retry behavior.Line 55 wraps the error inside a
retry.OnErrorblock. Per coding guidelines, errors should be returned without wrapping within retry blocks to preserve retry behavior, as wrapping can interfere with error type checking.π§ Proposed fix
func (m *NodeAnnotationManager) patchNodeWithRetry(ctx context.Context, nodeName string, patch []byte) error { - return retry.OnError(retry.DefaultRetry, isRetryableError, func() error { + err := retry.OnError(retry.DefaultRetry, isRetryableError, func() error { _, err := m.kubeClient.CoreV1().Nodes().Patch(ctx, nodeName, types.MergePatchType, patch, metav1.PatchOptions{}) if err != nil && isRetryableError(err) { slog.Warn("Retryable error patching node annotation. Retrying...", "node", nodeName, "error", err) } - if err != nil { - return fmt.Errorf("failed to patch node %s: %w", nodeName, err) - } - - return nil + return err }) + if err != nil { + return fmt.Errorf("failed to patch node %s: %w", nodeName, err) + } + return nil }Based on coding guidelines.
fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)
225-235: Correct template filename capitalization to match actual file.The code references
"rebootnode-Template.yaml"(capital T), but the actual template file isrebootnode-template.yaml(lowercase). Update line 227 to use the correct lowercase filename.
π€ Fix all issues with AI agents
In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-73: The code after the GetRemediationState error return is
dead: remove the unreachable initialization of state (the
RemediationStateAnnotation with EquivalenceGroups) that appears after "return
err"; if your intent was to recover and use a default state instead of returning
an error, assign state = &RemediationStateAnnotation{EquivalenceGroups:
make(map[string]EquivalenceGroupState)} before returning and change the "return
err" to "return nil" (and update callers accordingly); reference symbols:
m.GetRemediationState, state, RemediationStateAnnotation, EquivalenceGroups,
EquivalenceGroupState.
In @fault-remediation/pkg/crstatus/deprecated_checker.go:
- Around line 77-91: In CRStatusChecker.checkCondition, the skip/allow logic is
inverted: when status or conditions are missing you should treat the resource as
non-terminal (do not skip) and when a conditionStatus is terminal you should
skip. Change the two early returns that currently return true to return false
(for the status/conditions-not-found or error cases), and change the final
return from !c.isTerminal(conditionStatus) to c.isTerminal(conditionStatus);
this uses the existing methods findConditionStatus and isTerminal to determine
terminality correctly.
In @fault-remediation/pkg/reconciler/reconciler_test.go:
- Around line 48-50: The local interface named CRStatusCheckerInterface is
unused and duplicates crstatus.CRStatusCheckerInterface used by the mock; remove
the local type definition of CRStatusCheckerInterface from the test file and
ensure all references (e.g., the mock that currently expects
crstatus.CRStatusCheckerInterface) use the package-scoped
crstatus.CRStatusCheckerInterface, updating imports if necessary so the test
only relies on crstatus.CRStatusCheckerInterface.
In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code performs two redundant annotation updates: first
calling c.AnnotationManager.UpdateRemediationState with crName, then again with
actualCRName after obtaining actualCRName := createdCR.GetName(); remove the
first update block (the if that checks group != "" and calls
UpdateRemediationState with crName) and the redundant reassignment of group
before the second update, leaving only the single UpdateRemediationState call
that uses actualCRName (keep references to common.GetRemediationGroupForAction,
createdCR.GetName, and c.AnnotationManager.UpdateRemediationState) so the node
annotation is set once with the real CR name and the function still returns
actualCRName, nil.
In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 56-58: The test calls NewCtrlRuntimeRemediationClient with a
TemplateData that sets TemplateFileName to "rebootnode-Template.yaml" which
mismatches the actual file name; change the TemplateData.TemplateFileName value
to "rebootnode-template.yaml" (lowercase "template") so the test loads the
correct template file.
In @fault-remediation/pkg/remediation/remediation.go:
- Around line 356-376: The nil-check logic incorrectly skips metrics when
job.Annotations is nil; change the guard from "job.Annotations != nil &&
job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal" to
"job.Annotations == nil || job.Annotations[jobMetricsAlreadyCountedAnnotation]
!= trueStringVal" so freshly-created jobs without annotations still enter the
block (you can keep the existing updateJob.Annotations initialization). Apply
this same fix in the analogous checks inside checkLogCollectorFailed and
checkLogCollectorTimedOut so they use the same "annotations == nil ||
annotations[key] != trueStringVal" condition.
π§Ή Nitpick comments (21)
commons/pkg/statemanager/statemanager_test.go (1)
508-508: Use consistent context creation.This test uses
context.TODO()while all other test functions in this file usecontext.Background()(lines 270, 282, 310, 341, 370, 399, 441). Maintain consistency across the test suite.β»οΈ Proposed fix
- ctx := context.TODO() + ctx := context.Background()commons/pkg/statemanager/statemanager.go (1)
387-387: Prefer explicit nil return for clarity.At this point in the code,
erris guaranteed to benil(any non-nil error would have triggered an early return at line 374). For clarity and explicitness, consider returningtrue, nildirectly.β»οΈ Proposed fix
- return true, err + return true, nilfault-remediation/pkg/remediation/deprecated_remediation.go (1)
506-517: Consider returning errors instead of swallowing them for proper retry handling.Lines 506-517 swallow errors from log collector job timeout and failure by returning
ctrl.Result{}, nil. While the comments indicate this is intentional to allow reconciliation to continue, completely swallowing errors prevents proper retry/backoff behavior and can mask issues.Consider returning the error with a requeue request:
return ctrl.Result{Requeue: true, RequeueAfter: 5*time.Minute}, fmt.Errorf(...)This allows the controller-runtime to apply exponential backoff and proper error tracking while still continuing reconciliation.
fault-remediation/pkg/crstatus/crstatus_test.go (1)
120-212: Eliminate test duplication with a parameterized helper.The entire test function
TestCheckConditionCtrlRuntimeduplicatesTestCheckConditionwith only the checker initialization differing. This violates DRY and creates a maintenance burden.β»οΈ Refactor to use table-driven test with multiple checkers
-func TestCheckConditionCtrlRuntime(t *testing.T) { - cfg := &config.MaintenanceResource{ - CompleteConditionType: "Completed", - } - checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false) - - tests := []struct { - name string - cr *unstructured.Unstructured - expected bool - }{ - // ... duplicate test cases ... - } - - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - result := checker.checkCondition(tt.cr) - assert.Equal(t, tt.expected, result) - }) - } -} +func TestCheckCondition(t *testing.T) { + cfg := &config.MaintenanceResource{ + CompleteConditionType: "Completed", + } + + checkers := []struct { + name string + checker interface{ checkCondition(*unstructured.Unstructured) bool } + }{ + { + name: "K8s", + checker: NewCRStatusChecker(nil, nil, cfg, false), + }, + { + name: "CtrlRuntime", + checker: NewCtrlRuntimeCRStatusChecker(nil, cfg, false), + }, + } + + tests := []struct { + name string + cr *unstructured.Unstructured + expected bool + }{ + // ... single set of test cases ... + } + + for _, c := range checkers { + t.Run(c.name, func(t *testing.T) { + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + result := c.checker.checkCondition(tt.cr) + assert.Equal(t, tt.expected, result) + }) + } + }) + } +}Based on coding guidelines, as per the learnings for table-driven tests when testing multiple scenarios in Go.
fault-remediation/pkg/remediation/remediation_test.go (1)
114-126: Make test name more descriptive.The test name "dry run" is less descriptive than recommended. Consider following the pattern from the past review comment suggestion.
π Suggested improvement
{ - name: "dry run", + name: "Successful rebootnode creation - dry run", nodeName: "test-node-1", dryRun: true, recommendedAction: protos.RecommendedAction_RESTART_BM,Based on coding guidelines, as per the naming format
TestFunctionName_Scenario_ExpectedBehaviorfor Go tests.fault-remediation/pkg/initializer/init.go (2)
89-115: Extract TemplateData construction to reduce duplication.The
TemplateDataconstruction (lines 92-96 and 105-109) is duplicated in both initialization branches. This creates a maintenance burden if the template configuration structure changes.β»οΈ Extract template data construction
+ templateData := remediation.TemplateData{ + TemplateMountPath: tomlConfig.Template.MountPath, + TemplateFileName: tomlConfig.Template.FileName, + MaintenanceResource: tomlConfig.MaintenanceResource, + } + if params.UseCtrlRuntime { remediationClient, err = remediation.NewCtrlRuntimeRemediationClient( ctrlruntimeClient, - params.DryRun, remediation.TemplateData{ - TemplateMountPath: tomlConfig.Template.MountPath, - TemplateFileName: tomlConfig.Template.FileName, - MaintenanceResource: tomlConfig.MaintenanceResource, - }) + params.DryRun, + templateData) if err != nil { return nil, fmt.Errorf("error while initializing ctrl runtime client: %w", err) } stateManager = statemanager.NewCtrlRuntimeStateManager(ctrlruntimeClient) } else { remediationClient, clientSet, err = remediation.NewK8sClient( params.KubeconfigPath, params.DryRun, - remediation.TemplateData{ - TemplateMountPath: tomlConfig.Template.MountPath, - TemplateFileName: tomlConfig.Template.FileName, - MaintenanceResource: tomlConfig.MaintenanceResource, - }, + templateData, )
117-117: Make log message more informative about initialization mode.The generic log message doesn't indicate which client type was initialized, making it harder to debug initialization issues.
π Proposed improvement
- slog.Info("Successfully initialized client") + if params.UseCtrlRuntime { + slog.Info("Successfully initialized ctrl-runtime remediation client") + } else { + slog.Info("Successfully initialized Kubernetes remediation client") + }fault-remediation/pkg/remediation/remediation.go (1)
357-367: Error from annotation update causes early return without recording metrics.If
c.client.Update(ctx, updateJob)fails at line 364, the function returnsfalse, errwithout recording the success metric. This means a successful job completion may not be tracked if the annotation update fails. Consider recording metrics before attempting the annotation update.β»οΈ Proposed refactor
if completeCondition != nil && completeCondition.Status == metav1.ConditionTrue { slog.Info("Log collector job completed successfully", "job", job.Name) - // Use job's actual duration instead of custom tracking - // reconciliation can be called multiple times so use annotation to make sure we're not duplicate recording metrics - if job.Annotations != nil && job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal { + if job.Annotations == nil || job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal { + duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds() + metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc() + metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration) + updateJob := job.DeepCopy() if updateJob.Annotations == nil { updateJob.Annotations = map[string]string{} } - updateJob.Annotations[jobMetricsAlreadyCountedAnnotation] = trueStringVal - - err := c.client.Update(ctx, updateJob) - if err != nil { - return false, err + if err := c.client.Update(ctx, updateJob); err != nil { + slog.Warn("Failed to mark job metrics as recorded", "job", job.Name, "error", err) + // Continue - metrics already recorded } - - duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds() - - metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc() - metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration) } - return true, nil }fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)
17-40: Import grouping does not follow Go conventions.The
eventsimport at line 19 is placed between standard library imports and third-party imports. Go convention is to group imports: standard library, then external packages, then internal packages.β»οΈ Proposed fix
import ( "context" - "github.com/nvidia/nvsentinel/fault-remediation/pkg/events" "testing" "text/template" "github.com/google/uuid" "github.com/stretchr/testify/assert" corev1 "k8s.io/api/core/v1" metameta "k8s.io/apimachinery/pkg/api/meta" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" "k8s.io/apimachinery/pkg/runtime/schema" "k8s.io/client-go/discovery" "k8s.io/client-go/discovery/cached/memory" "k8s.io/client-go/dynamic" "k8s.io/client-go/kubernetes/fake" "k8s.io/client-go/rest" "k8s.io/client-go/restmapper" "github.com/nvidia/nvsentinel/data-models/pkg/model" "github.com/nvidia/nvsentinel/data-models/pkg/protos" "github.com/nvidia/nvsentinel/fault-remediation/pkg/config" + "github.com/nvidia/nvsentinel/fault-remediation/pkg/events" )fault-remediation/pkg/reconciler/reconciler_test.go (1)
17-38: Import grouping does not follow Go conventions.Imports are mixed:
errorsandtesting/timeare separated by internal package imports. Group standard library imports together, followed by external packages, then internal packages.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
17-54: Import grouping does not follow Go conventions.Imports are scattered with blank lines in unexpected places (lines 19, 23). Standard library, external, and internal packages should be in separate groups.
168-179: Consider consolidating package-level test variables.Multiple package-level variables are declared across lines 168-179. Consider grouping related variables or using a test fixture struct to improve readability and make test setup/teardown clearer.
fault-remediation/pkg/reconciler/reconciler.go (9)
183-190: Consider error wrapping for context.The error from
RunLogCollectorJobis returned without additional context. As per coding guidelines, errors should be wrapped usingfmt.Errorf("context: %w", err)to provide clarity about where the failure occurred.π Proposed error wrapping
result, err := r.Config.RemediationClient.RunLogCollectorJob(ctx, healthEvent.NodeName, eventUID) if err != nil { slog.Error("Log collector job failed for node", "node", healthEvent.NodeName, "error", err) - return ctrl.Result{}, err + return ctrl.Result{}, fmt.Errorf("log collector job failed for node %s: %w", healthEvent.NodeName, err) }
203-211: Add error context for label update failure.The error from
UpdateNVSentinelStateNodeLabelis returned without wrapping. Adding context will make debugging easier when this operation fails during the "remediating" state update.π Proposed error wrapping
_, err := r.Config.StateManager.UpdateNVSentinelStateNodeLabel(ctx, healthEventWithStatus.HealthEvent.NodeName, statemanager.RemediatingLabelValue, false) if err != nil { slog.Error("Error updating node label to remediating", "error", err) metrics.ProcessingErrors.WithLabelValues("label_update_error", nodeName).Inc() - return "", err + return "", fmt.Errorf("failed to update node label to remediating for node %s: %w", nodeName, err) }
220-244: Good error handling pattern but consider adding more context.The deferred state update pattern (attempting to set label to "failed" even when CR creation fails) is solid and aligns with the PR objective to handle errors properly. However, the returned errors lack context.
π Enhanced error wrapping for clarity
_, err = r.Config.StateManager.UpdateNVSentinelStateNodeLabel(ctx, healthEventWithStatus.HealthEvent.NodeName, remediationLabelValue, false) if err != nil { slog.Error("Error updating node label", "label", remediationLabelValue, "error", err) metrics.ProcessingErrors.WithLabelValues("label_update_error", nodeName).Inc() - return "", errors.Join(createMaintenanceResourceError, err) + labelErr := fmt.Errorf("failed to update node label to %s for node %s: %w", remediationLabelValue, nodeName, err) + return "", errors.Join(createMaintenanceResourceError, labelErr) } if createMaintenanceResourceError != nil { - return "", createMaintenanceResourceError + return "", fmt.Errorf("failed to create maintenance resource for node %s: %w", nodeName, createMaintenanceResourceError) }
259-264: Add error context for state clearing failure.The error from
ClearRemediationStateshould be wrapped with context per coding guidelines to aid debugging.π Proposed error wrapping
if err := r.annotationManager.ClearRemediationState(ctx, nodeName); err != nil { slog.Error("Failed to clear remediation state for node", "node", nodeName, "error", err) - return ctrl.Result{}, err + return ctrl.Result{}, fmt.Errorf("failed to clear remediation state for node %s: %w", nodeName, err) }
267-272: Add error context for mark processed failure.Per coding guidelines, wrap the error with context to clarify which operation failed.
π Proposed error wrapping
if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil { metrics.ProcessingErrors.WithLabelValues("mark_processed_error", nodeName).Inc() slog.Error("Error updating resume token", "error", err) - return ctrl.Result{}, err + return ctrl.Result{}, fmt.Errorf("failed to mark cancellation event as processed for node %s: %w", nodeName, err) }
328-344: Good error propagation for retry behavior.The handling correctly propagates errors from
runLogCollectorandperformRemediationto trigger retries, and useserrors.Jointo combine multiple failures. The pattern of updating remediation status even when performRemediation fails is solid.However, for consistency with coding guidelines, consider adding context to the final error returns to clarify which operation failed in the remediation flow.
291-296: Pattern: Multiple MarkProcessed errors lack context.These four locations return errors from
MarkProcessedorcheckExistingCRStatuswithout wrapping. While the error propagation is correct for retry behavior, adding context would improve debuggability.Consider wrapping these errors with context about the operation and node name, similar to the pattern suggested in other review comments.
Also applies to: 302-307, 316-321, 348-353
364-367: Add error context for document ID extraction.For consistency with the error handling at line 385 and coding guidelines, wrap this error with context.
π Proposed error wrapping
documentID, err := utils.ExtractDocumentID(eventWithToken.Event) if err != nil { - return err + return fmt.Errorf("failed to extract document ID from event: %w", err) }
406-410: Add error context for annotation operations.Both
GetRemediationStateandRemoveGroupFromStateerrors should be wrapped with context per coding guidelines.π Proposed error wrapping
state, _, err := r.annotationManager.GetRemediationState(ctx, nodeName) if err != nil { slog.Error("Error getting remediation state", "node", nodeName, "error", err) - return true, "", err + return true, "", fmt.Errorf("failed to get remediation state for node %s: %w", nodeName, err) } // ... later ... if err = r.annotationManager.RemoveGroupFromState(ctx, nodeName, group); err != nil { slog.Error("Failed to remove CR from annotation", "error", err) - return false, "", err + return false, "", fmt.Errorf("failed to remove group %s from remediation state for node %s: %w", group, nodeName, err) }Also applies to: 438-441
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
commons/go.sumis excluded by!**/*.sum
π Files selected for processing (26)
.gitignorecommons/go.modcommons/pkg/statemanager/statemanager.gocommons/pkg/statemanager/statemanager_test.gofault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/crstatus/deprecated_checker.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/fault_remediation_client_interface.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yamlfault-remediation/pkg/remediation/templates/rebootnode-template.yaml
π€ Files with no reviewable changes (1)
- fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
β Files skipped from review due to trivial changes (1)
- fault-remediation/pkg/remediation/templates/log-collector-job.yaml
π§ Files skipped from review as they are similar to previous changes (3)
- fault-remediation/pkg/events/health_event.go
- fault-remediation/pkg/annotation/annotation_interface.go
- commons/go.mod
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/initializer/init.gocommons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/main.gofault-remediation/pkg/annotation/deprecated_annotation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/crstatus/deprecated_checker.gocommons/pkg/statemanager/statemanager.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
commons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π§ Learnings (23)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
fault-remediation/pkg/crstatus/crstatus_interface.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/metrics/metrics.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
fault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/metrics/metrics.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
fault-remediation/pkg/initializer/init.gocommons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/remediation/deprecated_remediation.gocommons/pkg/statemanager/statemanager.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
commons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
commons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
commons/pkg/statemanager/statemanager_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/remediation/deprecated_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/main.gocommons/pkg/statemanager/statemanager.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
fault-remediation/main.gofault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
fault-remediation/pkg/remediation/remediation_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code
Applied to files:
fault-remediation/pkg/metrics/metrics.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
fault-remediation/pkg/remediation/deprecated_remediation.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
𧬠Code graph analysis (13)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-32)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
FaultRemediationClientInterface(29-34)TemplateData(37-44)fault-remediation/pkg/remediation/remediation.go (1)
NewCtrlRuntimeRemediationClient(47-95)fault-remediation/pkg/config/config.go (2)
Template(27-30)MaintenanceResource(18-24)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)fault-remediation/pkg/crstatus/checker.go (1)
NewCtrlRuntimeCRStatusChecker(34-44)
fault-remediation/main.go (4)
fault-remediation/pkg/initializer/init.go (2)
InitializationParams(38-44)InitializeAll(51-168)fault-remediation/pkg/reconciler/reconciler.go (1)
FaultRemediationReconciler(61-69)commons/pkg/auditlogger/roundtripper.go (1)
NewAuditingRoundTripper(42-47)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-32)
fault-remediation/pkg/remediation/remediation_test.go (5)
fault-remediation/pkg/remediation/remediation.go (2)
NewCtrlRuntimeRemediationClient(47-95)CtrlRuntimeRemediationClient(38-45)fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
TemplateData(37-44)fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_RESTART_BM(48-48)fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)
fault-remediation/pkg/remediation/deprecated_remediation.go (11)
fault-remediation/pkg/config/config.go (1)
Template(27-30)fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
TemplateData(37-44)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/crstatus/deprecated_checker.go (2)
CRStatusChecker(29-34)NewCRStatusChecker(36-48)fault-remediation/pkg/annotation/deprecated_annotation.go (1)
NewNodeAnnotationManager(38-42)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)fault-remediation/pkg/reconciler/reconciler_test.go (1)
CRStatusCheckerInterface(48-50)fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)platform-connectors/pkg/pipeline/factory.go (1)
Create(32-39)fault-remediation/pkg/common/equivalence_groups.go (1)
GetRemediationGroupForAction(35-45)fault-remediation/pkg/metrics/metrics.go (3)
LogCollectorErrors(86-92)LogCollectorJobs(71-77)LogCollectorJobDuration(78-85)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
fault-remediation/pkg/remediation/deprecated_remediation_test.go (4)
fault-remediation/pkg/config/config.go (1)
Template(27-30)fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
TemplateData(37-44)fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)fault-remediation/pkg/remediation/deprecated_remediation.go (1)
FaultRemediationClient(62-74)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(18-24)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
HealthEventData(12-15)HealthEventDoc(6-9)fault-remediation/pkg/annotation/annotation_interface.go (3)
NodeAnnotationManagerInterface(16-21)RemediationStateAnnotation(24-26)EquivalenceGroupState(29-32)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)data-models/pkg/protos/health_event.pb.go (2)
RecommendedAction_RESTART_BM(48-48)RecommendedAction_NONE(43-43)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
fault-remediation/pkg/remediation/remediation.go (1)
NewCtrlRuntimeRemediationClient(47-95)commons/pkg/statemanager/statemanager.go (4)
NewStateManager(209-213)StateManager(200-203)DrainSucceededLabelValue(170-170)RemediatingLabelValue(174-174)fault-remediation/pkg/annotation/annotation_interface.go (1)
AnnotationKey(12-12)fault-remediation/pkg/metrics/metrics.go (6)
TotalEventsReceived(33-38)EventHandlingDuration(62-68)EventsProcessed(39-45)CRStatusCreated(25-25)CRStatusSkipped(26-26)ProcessingErrors(46-52)
πͺ YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
[error] 15-15: syntax error: expected , but found ''
(syntax)
π Additional comments (38)
commons/pkg/statemanager/statemanager_test.go (1)
269-549: Comprehensive test coverage for controller-runtime state manager.These tests provide excellent coverage of the controller-runtime state manager implementation, mirroring the existing client-go tests and ensuring both implementations behave consistently. The table-driven approach for state transition validation is particularly well-structured.
commons/pkg/statemanager/statemanager.go (2)
298-306: Clean constructor pattern for controller-runtime state manager.The type declaration and constructor follow established patterns and properly return the StateManager interface, enabling flexible client selection.
310-388: Well-structured controller-runtime implementation.The implementation properly mirrors the client-go version while adapting to controller-runtime patterns. The error handling, logging, and validation logic are consistent with the existing state manager. As discussed in previous reviews, relying on controller-runtime's reconciliation retry mechanism is an appropriate design choice for this context.
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)
15-15: YAMLlint error is a false positive for Go template syntax.The yamllint syntax error on line 15 is expected because
{{.ApiGroup}}/{{.Version}}is Go template syntax that will be rendered at runtime. This is not a valid concern.fault-remediation/pkg/annotation/deprecated_annotation.go (1)
83-125: LGTM: GetRemediationState signature update is consistent.The updated signature correctly returns the node alongside the state, and all return paths properly include the node value. Callers appropriately handle the extra return value.
fault-remediation/pkg/remediation/deprecated_remediation.go (2)
254-316: LGTM: AlreadyExists handling is well-structured.The new
handleCRCreateAlreadyExistshelper properly handles the case where a CR already exists by updating the node annotation and logging appropriately. The separation into a helper method improves readability.
379-385: No issue found - constants are properly defined in the same package.The constants
logCollectorNodeLabelandlogCollectorEventLabelare defined inremediation.go(lines 32-33) within the same package. In Go, package-level constants defined in one file are accessible from all other files in the same package without explicit imports. This code will compile successfully.Likely an incorrect or invalid review comment.
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
29-34: LGTM: Clean interface design.The
FaultRemediationClientInterfaceprovides a well-defined contract with appropriate method signatures for remediation operations, annotation management, and status checking.
36-44: LGTM: Well-structured TemplateData definition.The
TemplateDatastruct properly embedsconfig.MaintenanceResourceand includes all necessary fields for template rendering and health event tracking.fault-remediation/pkg/crstatus/crstatus_interface.go (1)
1-11: LGTM! Package documentation and interface are well-defined.The package-level godoc has been added as requested in the previous review, and the interface follows Go conventions with a clear, focused contract.
fault-remediation/pkg/metrics/metrics.go (1)
15-93: LGTM! Metric exports and package reorganization are correct.The package rename and metric variable exports follow Go conventions. The TODO comment at line 30 was already flagged in a previous review and is outside the scope of these changes.
fault-remediation/main.go (3)
122-136: LGTM!The non-controller-runtime setup properly initializes components, sets up deferred cleanup, and handles the metrics server and event processing. The structure is clean and follows the expected patterns.
228-239: LGTM!The controller-runtime setup correctly initializes components with the manager's client and properly defers cleanup. The wiring with
mgr.GetClient()aligns with the controller-runtime pattern.
194-197: The current code is correct.rest.Config.Wrap()is a convenience method that modifies the config's transport wrapping in-place by composing wrappers ontoWrapTransport. It does not return a value requiring reassignment. The approach used here is the idiomatic way to add HTTP transport wrappers in client-go, and it properly stacks multiple wrappers when needed.fault-remediation/pkg/annotation/annotation.go (3)
15-25: LGTM!The struct and constructor follow Go conventions and properly initialize the manager with the controller-runtime client.
106-131: LGTM!
ClearRemediationStateproperly handles the nil annotations case and uses the merge-from patch pattern correctly.
133-168: LGTM!
RemoveGroupFromStatecorrectly handles the case where no groups remain by delegating toClearRemediationState, and properly propagates errors.fault-remediation/pkg/remediation/remediation.go (3)
46-95: LGTM!The constructor properly validates template existence, parses the template, initializes dry-run mode, and sets up annotation manager and status checker. Error handling is thorough with context-wrapped errors.
105-200: LGTM!
CreateMaintenanceResourceproperly handles dry-run mode, creates owner references for garbage collection, handlesIsAlreadyExistserrors gracefully, and updates node annotation with the actual CR name. The previous duplicate annotation update issue has been addressed.
246-309: LGTM!
launchLogCollectorJobproperly reads the manifest, sets labels for deduplication, handles the case of multiple existing jobs, and requeues after creation to check status later.fault-remediation/pkg/remediation/deprecated_remediation_test.go (2)
347-369: LGTM!Test correctly updated to use
events.HealthEventDataand handles the new(string, error)return signature fromCreateMaintenanceResource.
396-418: LGTM!Tests properly updated to use
eventIdparameter and handle the(ctrl.Result, error)return type fromRunLogCollectorJob.fault-remediation/pkg/crstatus/checker.go (3)
28-44: LGTM!The struct and constructor properly initialize the status checker with the controller-runtime client and configuration.
58-69: LGTM!The
ObjectKeynow correctly includes bothNameandNamespacefrom the config, addressing the previous review concern about missing namespace for namespaced CRs.
71-105: LGTM!The condition checking logic correctly traverses the unstructured status map, finds the configured condition type, and determines if the status is terminal.
fault-remediation/pkg/reconciler/reconciler_test.go (3)
40-66: LGTM!
MockK8sClientproperly updated with new method signatures usingevents.HealthEventData,ctrl.Result, and the correct interface types for annotation manager and status checker.
102-134: LGTM!
MockNodeAnnotationManagerproperly implements the updated interface with the newGetRemediationStatesignature returning(*annotation.RemediationStateAnnotation, *corev1.Node, error).
601-666: LGTM!
TestRunLogCollectorJobErrorScenariosis a well-structured table-driven test that covers success, failure, and requeue scenarios with proper assertions on thectrl.Resultreturn type.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
208-217: Manager created but not started before using GetClient.The manager is created at lines 208-213 and
GetClient()is called at line 217, but the manager isn't started until the goroutine at lines 250-254. While controller-runtime allows this pattern, be aware that the client's cache won't be synced until the manager starts, which could cause timing issues in tests.
302-315: LGTM!
createTestRemediationClientproperly usesremediation.NewCtrlRuntimeRemediationClientwith the shared controller-runtime client and returns the interface type. Template data configuration is correct.
880-888: LGTM!Metrics assertions properly use the dedicated metrics package symbols (
metrics.TotalEventsReceived,metrics.EventsProcessed, etc.) and verify the expected behavior for CR creation and deduplication.
1312-1328: LGTM!
cleanupNodeAnnotationshelper properly usesannotation.AnnotationKeyfor cleanup, maintaining consistency with the rest of the codebase.fault-remediation/pkg/reconciler/reconciler.go (6)
25-28: LGTM: Clean interface-based refactoring.The addition of new internal packages and the shift to interface-based dependencies (RemediationClient, annotationManager) improves testability and modularity. The exposed Config field and initialization flow are consistent.
Also applies to: 51-51, 66-67, 83-84
92-127: LGTM: Proper event parsing and routing.The Reconcile method correctly records metrics, handles parse errors by marking them processed (preventing infinite retries on malformed events), and routes to appropriate handlers based on node quarantine status.
156-165: Verify: Should label update errors trigger retry?The error from
UpdateNVSentinelStateNodeLabelis logged and recorded in metrics but not propagated. Given the PR's goal to "throw errors to trigger retries", consider whether label update failures should prevent event processing and trigger a retry.If the label update is critical for cluster state consistency, propagating the error might be appropriate. The current implementation could lead to events being skipped even when state updates fail.
Based on the PR objectives and the importance of label consistency, should this error be propagated to the caller?
383-386: LGTM: Proper error wrapping with context.This correctly follows the coding guideline to wrap errors with context using
fmt.Errorf("context: %w", err).
448-498: LGTM: Robust error handling with proper classification.The method correctly:
- Wraps errors with context (line 491)
- Records metrics with appropriate error labels based on failure type
- Marks malformed events as processed to prevent queue blocking
- Returns structured HealthEventDoc type
503-584: LGTM: Clean utility methods.The utility methods properly handle:
- Watcher lifecycle management
- Error aggregation in CloseAll using errors.Join
- Controller-runtime integration with proper channel adaptation
- Goroutine cleanup on context cancellation or channel closure
| func (c *CRStatusChecker) checkCondition(obj *unstructured.Unstructured) bool { | ||
| status, found, err := unstructured.NestedMap(obj.Object, "status") | ||
| if err != nil || !found { | ||
| return true | ||
| } | ||
|
|
||
| conditions, found, err := unstructured.NestedSlice(status, "conditions") | ||
| if err != nil || !found { | ||
| return true | ||
| } | ||
|
|
||
| conditionStatus := c.findConditionStatus(conditions) | ||
|
|
||
| return !c.isTerminal(conditionStatus) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Inverted skip/allow logic in checkCondition remains unfixed.
This critical issue was previously identified but not yet addressed. The return values in checkCondition are inverted:
- Lines 80 and 85: Return
true(skip CR creation) when status or conditions are missing, but missing status/conditions should be treated as non-terminal (do not skip). Should returnfalse. - Line 90: Returns
!c.isTerminal(conditionStatus), meaning skip when NOT terminal. But terminal status should trigger skip. Should returnc.isTerminal(conditionStatus).
π Proposed fix
func (c *CRStatusChecker) checkCondition(obj *unstructured.Unstructured) bool {
status, found, err := unstructured.NestedMap(obj.Object, "status")
if err != nil || !found {
- return true
+ return false
}
conditions, found, err := unstructured.NestedSlice(status, "conditions")
if err != nil || !found {
- return true
+ return false
}
conditionStatus := c.findConditionStatus(conditions)
- return !c.isTerminal(conditionStatus)
+ return c.isTerminal(conditionStatus)
}π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (c *CRStatusChecker) checkCondition(obj *unstructured.Unstructured) bool { | |
| status, found, err := unstructured.NestedMap(obj.Object, "status") | |
| if err != nil || !found { | |
| return true | |
| } | |
| conditions, found, err := unstructured.NestedSlice(status, "conditions") | |
| if err != nil || !found { | |
| return true | |
| } | |
| conditionStatus := c.findConditionStatus(conditions) | |
| return !c.isTerminal(conditionStatus) | |
| } | |
| func (c *CRStatusChecker) checkCondition(obj *unstructured.Unstructured) bool { | |
| status, found, err := unstructured.NestedMap(obj.Object, "status") | |
| if err != nil || !found { | |
| return false | |
| } | |
| conditions, found, err := unstructured.NestedSlice(status, "conditions") | |
| if err != nil || !found { | |
| return false | |
| } | |
| conditionStatus := c.findConditionStatus(conditions) | |
| return c.isTerminal(conditionStatus) | |
| } |
π€ Prompt for AI Agents
In @fault-remediation/pkg/crstatus/deprecated_checker.go around lines 77 - 91,
In CRStatusChecker.checkCondition, the skip/allow logic is inverted: when status
or conditions are missing you should treat the resource as non-terminal (do not
skip) and when a conditionStatus is terminal you should skip. Change the two
early returns that currently return true to return false (for the
status/conditions-not-found or error cases), and change the final return from
!c.isTerminal(conditionStatus) to c.isTerminal(conditionStatus); this uses the
existing methods findConditionStatus and isTerminal to determine terminality
correctly.
| group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction) | ||
| if group != "" && c.AnnotationManager != nil { | ||
| if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName, | ||
| group, crName); err != nil { | ||
| slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName, | ||
| "error", err) | ||
|
|
||
| return "", err | ||
| } | ||
| } | ||
|
|
||
| // Get the actual name of the created CR | ||
| actualCRName := createdCR.GetName() | ||
| log.Printf("Created Maintenance CR %s successfully for node %s", actualCRName, healthEvent.NodeName) | ||
|
|
||
| // Update node annotation with CR reference | ||
| group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction) | ||
| if group != "" && c.annotationManager != nil { | ||
| if err := c.annotationManager.UpdateRemediationState(ctx, healthEvent.NodeName, | ||
| group = common.GetRemediationGroupForAction(healthEvent.RecommendedAction) | ||
| if group != "" && c.AnnotationManager != nil { | ||
| if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName, | ||
| group, actualCRName); err != nil { | ||
| slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName, | ||
| "error", err) | ||
|
|
||
| return "", err | ||
| } | ||
| } | ||
|
|
||
| return actualCRName, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate annotation update - only the second update is necessary.
Lines 266-275 update the annotation with crName, then lines 282-291 update again with actualCRName. Since actualCRName := createdCR.GetName() provides the actual CR name from the created resource, only the second update is needed. Additionally, line 282 redundantly re-assigns group.
π Suggested fix
- group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
- if group != "" && c.AnnotationManager != nil {
- if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
- group, crName); err != nil {
- slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName,
- "error", err)
-
- return "", err
- }
- }
-
// Get the actual name of the created CR
actualCRName := createdCR.GetName()
log.Printf("Created Maintenance CR %s successfully for node %s", actualCRName, healthEvent.NodeName)
// Update node annotation with CR reference
- group = common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
+ group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
if group != "" && c.AnnotationManager != nil {
if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
group, actualCRName); err != nil {π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction) | |
| if group != "" && c.AnnotationManager != nil { | |
| if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName, | |
| group, crName); err != nil { | |
| slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName, | |
| "error", err) | |
| return "", err | |
| } | |
| } | |
| // Get the actual name of the created CR | |
| actualCRName := createdCR.GetName() | |
| log.Printf("Created Maintenance CR %s successfully for node %s", actualCRName, healthEvent.NodeName) | |
| // Update node annotation with CR reference | |
| group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction) | |
| if group != "" && c.annotationManager != nil { | |
| if err := c.annotationManager.UpdateRemediationState(ctx, healthEvent.NodeName, | |
| group = common.GetRemediationGroupForAction(healthEvent.RecommendedAction) | |
| if group != "" && c.AnnotationManager != nil { | |
| if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName, | |
| group, actualCRName); err != nil { | |
| slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName, | |
| "error", err) | |
| return "", err | |
| } | |
| } | |
| return actualCRName, nil | |
| // Get the actual name of the created CR | |
| actualCRName := createdCR.GetName() | |
| log.Printf("Created Maintenance CR %s successfully for node %s", actualCRName, healthEvent.NodeName) | |
| // Update node annotation with CR reference | |
| group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction) | |
| if group != "" && c.AnnotationManager != nil { | |
| if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName, | |
| group, actualCRName); err != nil { | |
| slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName, | |
| "error", err) | |
| return "", err | |
| } | |
| } | |
| return actualCRName, nil |
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/deprecated_remediation.go around lines 266
- 293, The code performs two redundant annotation updates: first calling
c.AnnotationManager.UpdateRemediationState with crName, then again with
actualCRName after obtaining actualCRName := createdCR.GetName(); remove the
first update block (the if that checks group != "" and calls
UpdateRemediationState with crName) and the redundant reassignment of group
before the second update, leaving only the single UpdateRemediationState call
that uses actualCRName (keep references to common.GetRemediationGroupForAction,
createdCR.GetName, and c.AnnotationManager.UpdateRemediationState) so the node
annotation is set once with the real CR name and the function still returns
actualCRName, nil.
| newStateLabelValue NVSentinelStateLabelValue, | ||
| removeStateLabel bool, | ||
| ) (bool, error) { | ||
| node := &corev1.Node{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of duplicating the logic, can we instead get a Kubernetes client from controller runtime manager and use that:
client, err := kubernetes.NewForConfig(mgr.GetConfig())
It is not ideal, but it will allow us to keep the logic in a single function, and we don't have to make bug fixes in two places. Once we have all the modules cut over, we can do a final refactor to use the controller-runtime client, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the reason was because the retries here would conflict with the built in ctrl runtime retries when an error is returned but that's probably worth the tradeoff of not having the duplicate logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as earlier about converting clients so that we don't have duplicated logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ended up consolidating on ctrl-runtime
| var obj map[string]any | ||
| if err = yaml.Unmarshal(buf.Bytes(), &obj); err != nil { | ||
| slog.Error("Failed to unmarshal YAML", "error", err) | ||
| return "", err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap errors please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should just bite the bullet and go all in on controller-runtime for this module... I'm not really a fan of all the duplication we have due to use of "kubenertes" client v/s controller-runtime client. TBH, it is much more than what I had initially anticipated. I'm worried that bug fixes will become complex till we get rid of one of the flows. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed I think that would be much cleaner at this point
6806866 to
560b1dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (2)
fault-remediation/pkg/initializer/init.go (1)
38-43: Blocker:InitializationParams.configis unexported, likely breaking callers outsideinitializer.
IfInitializeAllis called from another package (typical),InitializationParams{config: ...}wonβt compile. Rename/export it (e.g.,RestConfig *rest.Config).Proposed diff
type InitializationParams struct { - config *rest.Config + RestConfig *rest.Config TomlConfigPath string DryRun bool EnableLogCollector bool }fault-remediation/pkg/crstatus/checker.go (1)
46-74: Incomplete comment and potential namespace handling for cluster-scoped resources.
Line 50 has an incomplete comment:
// note, ifβ this should be completed or removed.Line 67 uses
resource.Namespacefor the ObjectKey, but theconfig.MaintenanceResourcestruct has aScopefield that indicates whether the resource is cluster-scoped or namespaced. For cluster-scoped resources, ensure thatNamespaceis empty; otherwise, the lookup may fail silently. The current code logs and returnsfalse(allowing creation), which provides a safe fallback, but you may want to explicitly handle scope.Suggested fix for the incomplete comment
- return true // note, if + return true // No configuration found; skip CR creation
π€ Fix all issues with AI agents
In @.gitignore:
- Line 143: The .gitignore now contains a broad ".idea/" rule but still keeps
many specific .idea/* entries (e.g., ".idea/replstate.xml", ".idea/sonarlint/",
".idea/sonarIssues.xml"), causing redundancy and making the PR summary
inaccurate; fix by either removing all the specific .idea/* lines so the single
".idea/" entry covers them, or remove the general ".idea/" entry and keep the
explicit granular rulesβapply the chosen approach consistently in the .gitignore
and update the PR summary to accurately reflect the consolidation or retained
granularity.
In @fault-remediation/main.go:
- Line 111: Replace the bare TODO comment "//TODO: setup informers for node and
job" with a TODO that references a tracked issue (e.g., include the issue number
or full issue URL) so it reads something like "// TODO: setup informers for node
and job β tracked in ISSUE-1234" (or a repo issue URL); ensure the comment
includes the canonical issue identifier per project guidelines and keep the same
intent text.
In @fault-remediation/pkg/annotation/annotation_test.go:
- Around line 146-168: The test TestClearRemediationState calls
AnnotationManager.ClearRemediationState and then immediately reassigns err when
calling client.Get, so the result of ClearRemediationState is never asserted;
after invoking annotationManager.ClearRemediationState(context.TODO(), nodeName)
capture and assert the call succeeded (e.g., assert.NoError(t, err) or
require.NoError(t, err)) before you reuse err for the client.Get call to ensure
failures in ClearRemediationState are detected.
In @fault-remediation/pkg/annotation/annotation.go:
- Around line 63-73: In UpdateRemediationState, the code assigns a default state
on the error path then immediately returns the error, making the assignment
dead/unused; either remove the unreachable assignment (delete the state =
&RemediationStateAnnotation{...} line) so the function simply logs and returns
the error, or if the intended behavior is to continue with a default state
instead of returning, initialize state and node to sane defaults and clear err
(do not return) so the rest of UpdateRemediationState can proceed; refer to
UpdateRemediationState, the local variables state/node/err and the call to
GetRemediationState to locate the change.
- Line 40: Replace the bare TODO comment in annotation.go ("TODO: maybe split
this up so it's not returning both node and state") with a TODO that references
the tracked issue ID or URL (e.g., "TODO: track in ISSUE-1234" or "TODO:
https://.../issues/1234") so the note points to a concrete ticket; update the
comment near the function or return logic that currently returns both node and
state (where the TODO is located) to include that issue reference.
- Around line 27-60: In GetRemediationState the JSON unmarshal error is
currently only logged and the function returns a nil error, hiding corrupt
annotations; change the handler for json.Unmarshal failure (in the block
referencing AnnotationKey and RemediationStateAnnotation) to return a wrapped
error (e.g. fmt.Errorf("failed to unmarshal remediation annotation for node %s:
%w", nodeName, err)) so the caller can retry/fail, or alternatively attempt to
clear the bad annotation via the client (m.client.Update) and return a clear
success/error outcomeβensure you return a non-nil error when unmarshal fails and
keep references to node and nodeName for context.
In @fault-remediation/pkg/initializer/init.go:
- Around line 82-97: The remediation client error message is misleading and the
kube client error is returned unwrapped; update the
remediation.NewRemediationClient error handling to return a clear, specific
message like "error initializing remediation client" (or similar) instead of
"ctrl runtime client", and wrap the kubernetes.NewForConfig error with context
using fmt.Errorf("error creating kube client: %w", err) so both failures provide
actionable context; adjust the return statements around
remediation.NewRemediationClient and kubernetes.NewForConfig accordingly in
init.go.
In @fault-remediation/pkg/reconciler/reconciler_test.go:
- Around line 49-51: Remove the locally defined CRStatusCheckerInterface type
declaration and replace any local usages with the imported
crstatus.CRStatusCheckerInterface; delete the type block "type
CRStatusCheckerInterface interface { IsSuccessful(ctx context.Context, crName
string) bool }" and ensure all references in the test (e.g., mock variables,
function signatures) use crstatus.CRStatusCheckerInterface, and if necessary
adjust imports to avoid unused-import or missing-symbol errors.
In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go:
- Around line 29-35: Exported type FaultRemediationClientInterface lacks a godoc
comment; add a one-line Go doc comment immediately above the type declaration
that begins with "FaultRemediationClientInterface" and briefly describes its
purpose and role (e.g., that it defines methods for creating maintenance
resources, running log collector jobs, and providing access to annotation
manager, status checker, and config). Ensure the comment follows Go convention
(starts with the type name) and references the interface as a whole; leave the
existing method signatures unchanged.
In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 526-527: The test assertion uses assert.Equal with arguments
reversed; change the call in remediation_test.go from assert.Equal(t,
result.RequeueAfter, tt.requeueTime) to assert.Equal(t, tt.requeueTime,
result.RequeueAfter) so the expected value (tt.requeueTime) is first and the
actual (result.RequeueAfter) is second for clearer failure output.
- Around line 29-97: The table-driven test leaves tt.client nil which makes
TestNewCtrlRuntimeClient brittle; either populate tt.client with a real fake
controller-runtime client before calling NewRemediationClient or remove the
unused client field from the test cases. To fix, in TestNewCtrlRuntimeClient
initialize tt.client using controller-runtime's fake client builder (e.g.,
fake.NewClientBuilder().WithScheme(yourScheme).WithObjects(...).Build()) for
each subtest that needs a non-nil client and pass that into
NewRemediationClient, or simplify the table by deleting the client field and
always pass nil if the constructor is expected to accept nil.
In @fault-remediation/pkg/remediation/remediation.go:
- Around line 116-139: Validate and harden loadAndParseTemplate: ensure fileName
is a plain base name (no path separators or parent refs) and refuse values
containing "/" or "\" or ".." (or compare filepath.Base(fileName) == fileName),
then build the path and verify the resolved absolute path is inside the
mountPath root before reading; also set the template option to fail on missing
keys by calling tmpl = template.New(templateName).Option("missingkey=error")
prior to Parse so templates error on unknown data.
- Around line 43-114: Add proper godoc comments above the exported
FaultRemediationClient type and the NewRemediationClient function: for
FaultRemediationClient add a one-line summary describing its role (e.g.,
"FaultRemediationClient manages remediation templates, annotation manager and
status checking for remediation actions.") and for NewRemediationClient add a
summary plus brief parameter/return description (single sentence each)
explaining the client, dryRun flag, and remediationConfig parameters and what
the function returns (constructed *FaultRemediationClient or error). Ensure the
comments start immediately above the declarations and follow Go doc style (begin
with the symbol name).
- Around line 456-491: In FaultRemediationClient.checkLogCollectorComplete: fix
the annotation gating and nil-time panics by treating a missing annotations map
as "not counted" (i.e., always attempt to set jobMetricsAlreadyCountedAnnotation
when its value != trueStringVal), create or initialize updateJob.Annotations
before setting the flag, and perform nil checks for job.Status.StartTime and
job.Status.CompletionTime before computing duration (skip metrics or use safe
zero/early-return if either is nil); ensure you still call c.client.Update with
the patched annotation and only record metrics when duration is computed safely
and the annotation was not already true.
In @fault-remediation/pkg/remediation/templates/rebootnode-template.yaml:
- Around line 15-20: The YAML scalars that begin with template substitutions
must be quoted to avoid invalid YAML; update the template to quote the
apiVersion and nodeName values so they become apiVersion:
"{{.ApiGroup}}/{{.Version}}" and spec.nodeName: "{{.NodeName}}". Locate the
apiVersion line and the spec nodeName line in the RebootNode template and wrap
the entire substitution (including the slash) in double quotes; keep other
fields (e.g., metadata.name) unchanged unless they also begin with a brace.
π§Ή Nitpick comments (11)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)
104-108: Drop unnecessary quotes for consistency.
path: "/readyz"andport: "health"work, but quoting is inconsistent with livenessProbe and adds noise.Proposed diff
readinessProbe: httpGet: - path: "/readyz" - port: "health" + path: /readyz + port: healthfault-remediation/pkg/remediation/remediation_test.go (1)
22-24: Considerenvtestfor higher-fidelity behavior where it matters.
Repo guidance suggestsenvtestover fake clients for Kubernetes controller behavior; these tests validate controller-runtime client flows, status, and Jobs, so fake-client semantics may miss real API behavior.Also applies to: 243-246, 505-509
fault-remediation/pkg/annotation/annotation_test.go (2)
15-202: Align test names with the repoβs naming convention.
Consider renaming toTestGetRemediationState_NodeNotFound_ReturnsError, etc., to matchTestFunctionName_Scenario_ExpectedBehavior.
10-10: Considerenvtestif these tests are intended to validate controller behavior.
Repo guidance prefersenvtestover fake clients; if youβre relying on real API semantics for annotations/patches, fake client can diverge.fault-remediation/main.go (1)
147-151: Cleanup runs with a cancelled ctx; use a fresh timeout context forCloseAll.
On SIGTERM,ctxis cancelled;reconciler.CloseAll(ctx)may be unable to close gracefully.Proposed tweak
defer func() { - if err := reconciler.CloseAll(ctx); err != nil { + shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + if err := reconciler.CloseAll(shutdownCtx); err != nil { slog.Error("failed to close datastore components", "error", err) } }()fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
17-26: Non-standard import grouping.The imports have inconsistent grouping with blank lines in unexpected places. Go convention groups imports into standard library, external packages, and internal packages, each separated by a single blank line.
Suggested import grouping
import ( "context" - "log" "os" "path/filepath" - "sigs.k8s.io/controller-runtime/pkg/client" "sync" "testing" "time" + + "sigs.k8s.io/controller-runtime/pkg/client" ...
370-372: TODO comments should reference issues.Per coding guidelines, TODO comments should reference GitHub issues. These TODOs indicate potential issues with state transition handling that should be tracked.
- // TODO: ignoring error otherwise need to properly walk state transitions - _, _ = stateManager.UpdateNVSentinelStateNodeLabel(ctx, nodeName, statemanager.DrainSucceededLabelValue, false) + // TODO(#XXX): ignoring error otherwise need to properly walk state transitions + _, _ = stateManager.UpdateNVSentinelStateNodeLabel(ctx, nodeName, statemanager.DrainSucceededLabelValue, false)
1123-1125: Potential test flakiness with fixed sleep duration.Using
time.Sleep(500 * time.Millisecond)for synchronization can lead to flaky tests. Consider usingassert.Eventuallyorrequire.Eventuallywith appropriate timeout and polling intervals for more reliable eventual consistency checks.fault-remediation/pkg/reconciler/reconciler_test.go (2)
17-26: Non-standard import grouping.Similar to the e2e test file, imports have inconsistent grouping. Standard library imports should be grouped together, followed by external packages, then internal packages.
Suggested import ordering
import ( "context" - "errors" "fmt" - "github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation" - "github.com/nvidia/nvsentinel/fault-remediation/pkg/events" - corev1 "k8s.io/api/core/v1" - ctrl "sigs.k8s.io/controller-runtime" "testing" "time" + "errors" + + corev1 "k8s.io/api/core/v1" + ctrl "sigs.k8s.io/controller-runtime" + ... + + "github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation" + "github.com/nvidia/nvsentinel/fault-remediation/pkg/events" ...
120-135: MockNodeAnnotationManager always returns nil for Node.The
GetRemediationStatemock always returnsnilfor the*corev1.Nodereturn value. This is acceptable for current unit tests but may cause issues if tests need to verify node-related behavior. Consider documenting this limitation or adding a configurable node return value.fault-remediation/pkg/reconciler/reconciler.go (1)
276-279: High cyclomatic complexity flagged by nolint.The
// nolint: cyclop // todocomment indicates this function has high complexity. While this is noted and not blocking, consider refactoringhandleRemediationEventin a follow-up to improve maintainabilityβfor example, by extracting the log collector, CR creation, and status update flows into separate helper methods.Would you like me to open an issue to track the refactoring of
handleRemediationEventto reduce complexity?
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
commons/go.sumis excluded by!**/*.sum
π Files selected for processing (29)
.gitignorecommons/go.modcommons/pkg/statemanager/statemanager.godistros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yamldistros/kubernetes/nvsentinel/values-full.yamldistros/kubernetes/nvsentinel/values-tilt-mongodb.yamldistros/kubernetes/nvsentinel/values-tilt.yamldistros/kubernetes/nvsentinel/values.yamlfault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/annotation.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/remediation.gofault-remediation/pkg/reconciler/remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yamlfault-remediation/pkg/remediation/templates/rebootnode-template.yaml
π€ Files with no reviewable changes (7)
- distros/kubernetes/nvsentinel/values.yaml
- distros/kubernetes/nvsentinel/values-full.yaml
- fault-remediation/pkg/reconciler/remediation_test.go
- distros/kubernetes/nvsentinel/values-tilt.yaml
- distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
- fault-remediation/pkg/reconciler/annotation.go
- fault-remediation/pkg/reconciler/remediation.go
β Files skipped from review due to trivial changes (2)
- commons/go.mod
- fault-remediation/pkg/remediation/templates/log-collector-job.yaml
π§ Files skipped from review as they are similar to previous changes (5)
- fault-remediation/pkg/events/health_event.go
- fault-remediation/pkg/crstatus/crstatus_interface.go
- fault-remediation/pkg/metrics/metrics.go
- fault-remediation/pkg/crstatus/crstatus_test.go
- commons/pkg/statemanager/statemanager.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/annotation/annotation.gofault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/remediation/fault_remediation_client_interface.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π§ Learnings (14)
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/annotation/annotation_interface.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
𧬠Code graph analysis (7)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(23-25)AnnotationKey(11-11)EquivalenceGroupState(28-34)fault-remediation/pkg/annotation/annotation.go (1)
AnnotationManager(16-18)
fault-remediation/pkg/remediation/remediation_test.go (4)
fault-remediation/pkg/config/config.go (2)
Template(47-50)MaintenanceResource(27-44)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_COMPONENT_RESET(44-44)fault-remediation/pkg/remediation/remediation.go (2)
NewRemediationClient(56-114)FaultRemediationClient(43-54)fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)
fault-remediation/pkg/initializer/init.go (2)
fault-remediation/pkg/remediation/remediation.go (1)
NewRemediationClient(56-114)commons/pkg/statemanager/statemanager.go (2)
NewStateManager(205-209)StateManager(196-199)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(23-25)AnnotationKey(11-11)EquivalenceGroupState(28-34)
fault-remediation/main.go (3)
fault-remediation/pkg/initializer/init.go (2)
InitializationParams(38-43)InitializeAll(49-148)commons/pkg/auditlogger/roundtripper.go (1)
NewAuditingRoundTripper(42-47)fault-remediation/pkg/reconciler/reconciler.go (1)
FaultRemediationReconciler(61-69)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(27-44)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(15-20)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)
πͺ YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
[error] 15-15: syntax error: expected , but found ''
(syntax)
π Additional comments (16)
.gitignore (1)
143-143: Verify scope: this change appears orthogonal to PR objectives.The PR objectives describe retry behavior and fault-remediation system refactoring, but this change affects only
.gitignore. Confirm that core remediation/retry logic changes are included elsewhere in the PR.distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)
96-111: Verify the app actually serves/readyzon thehealthport after this chart change.
This chart now hard-wires readiness topath: "/readyz"andport: "health"(Line 106-107). If the binary doesnβt expose/readyzon the health listener (or uses a different scheme/port), pods will flap.fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)
17-20: Confirm namespace handling for namespaced actions.
IfRebootNodeis namespaced and the client doesnβt set namespace after rendering, omittingmetadata.namespacewill cause create failures (or default-namespace surprises). Consider addingnamespace: "{{.Namespace}}"gated by scope, or ensure the code always injects namespace for namespaced actions.fault-remediation/pkg/annotation/annotation_interface.go (1)
9-34: LGTM: clear, minimal public contract and JSON shape.
AnnotationKeyconstant + typedRemediationStateAnnotation/EquivalenceGroupStatematches intended annotation payload.fault-remediation/pkg/annotation/annotation.go (1)
62-105:andfault-remediation/pkg/crstatus/checker.go (2)
76-90: LGTM: Condition checking logic is correct.The
checkConditionmethod properly extracts nested status/conditions from the unstructured object and delegates tofindConditionStatusandisTerminal. The boolean logic correctly returnstrue(skip creation) when terminal conditions aren't met.
28-44: LGTM: Clean refactor to controller-runtime client.The struct and constructor are properly updated to use
client.Clientfrom controller-runtime. This aligns with the PR's goal of consolidating on controller-runtime patterns.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
208-217: LGTM: Proper controller-runtime manager setup for tests.The manager is correctly initialized with the envtest config and metrics disabled (BindAddress: "0"), which is appropriate for test environments. Using
envtestaligns with the coding guidelines.
302-329: LGTM: Test remediation client factory properly updated.The
createTestRemediationClientfunction correctly uses the newremediation.NewRemediationClientwith controller-runtime client and returns the interface type. Template configuration is properly structured.fault-remediation/pkg/reconciler/reconciler_test.go (2)
350-404: LGTM: Tests properly updated for new performRemediation signature.The tests correctly handle the updated
performRemediationsignature that returns(string, error). Both success and failure paths are covered with appropriate assertions.
576-613: LGTM: Log collector test properly updated for new signature.The test correctly uses the new
RunLogCollectorJobsignature returning(ctrl.Result, error)and properly verifies the conditional invocation based onEnableLogCollectorand action type.fault-remediation/pkg/reconciler/reconciler.go (5)
203-211: Early return on label update failure may leave state inconsistent.If the label update to
RemediatingLabelValuefails, the function returns immediately without attempting CR creation. This could leave the node in the previous state while the event is retried. Consider whether this is the desired behavior, or if the CR creation should still be attempted (with the label error logged but not blocking).However, given the PR intent to "throw errors to trigger retries," this early return will cause controller-runtime to requeue and retry the entire operation, which is a valid approach for achieving consistency.
Verify that the retry behavior is acceptable for your use case. If label updates are flaky, this could cause repeated retries before CR creation succeeds.
218-245: LGTM: Error handling properly defers CR creation error for state update.The code correctly captures
createMaintenanceResourceErrorbut continues to update the state label before returning. Usingerrors.Jointo combine errors when both the CR creation and label update fail is the right approach. This ensures the caller receives all relevant error information for logging/debugging.
328-356: LGTM: Log collector and remediation flow properly chain errors.The updated flow correctly:
- Returns early if log collector fails or requests requeue
- Captures
performRemediationErrand uses it to determinenodeRemediatedStatus- Uses
errors.Joinwhen both remediation and status update fail- Returns the remediation error after status update attempt
This supports the PR goal of throwing errors to trigger retries.
418-459: Error handling in checkExistingCRStatus now returns errors instead of silently allowing.The changes on lines 421 and 453-455 now return errors instead of silently allowing CR creation. This is a behavioral change:
- Line 421:
GetRemediationStatefailure now returns error (previously may have allowed creation)- Line 455:
RemoveGroupFromStatefailure now returns(false, "", err)instead of allowing creationThis aligns with the PR intent to trigger retries on errors. However, ensure this doesn't cause infinite retry loops if these operations consistently fail for a node.
Verify that transient failures in annotation operations will eventually succeed on retry, and that permanent failures (e.g., node doesn't exist) are handled appropriately upstream.
66-87: LGTM: Public Config field and constructor updates.The
Configfield is now exported (capitalized), and the constructor properly initializes bothConfigandannotationManagerfrom the provided configuration. This aligns with the broader refactor to expose configuration for testing and external access.
| assert.Equal(t, result.RequeueAfter, tt.requeueTime) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap assert.Equal argument order (expected, actual).
assert.Equal(t, result.RequeueAfter, tt.requeueTime) should be assert.Equal(t, tt.requeueTime, result.RequeueAfter) for clearer failures.
Proposed diff
- assert.Equal(t, result.RequeueAfter, tt.requeueTime)
+ assert.Equal(t, tt.requeueTime, result.RequeueAfter)π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert.Equal(t, result.RequeueAfter, tt.requeueTime) | |
| assert.Equal(t, tt.requeueTime, result.RequeueAfter) |
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation_test.go around lines 526 -
527, The test assertion uses assert.Equal with arguments reversed; change the
call in remediation_test.go from assert.Equal(t, result.RequeueAfter,
tt.requeueTime) to assert.Equal(t, tt.requeueTime, result.RequeueAfter) so the
expected value (tt.requeueTime) is first and the actual (result.RequeueAfter) is
second for clearer failure output.
| type FaultRemediationClient struct { | ||
| client client.Client | ||
| dryRunMode []string | ||
|
|
||
| // Multi-template support | ||
| remediationConfig config.TomlConfig | ||
| templates map[string]*template.Template // map from template file name to parsed template | ||
| templateMountPath string | ||
|
|
||
| annotationManager annotation.NodeAnnotationManagerInterface | ||
| statusChecker *crstatus.CRStatusChecker | ||
| } | ||
|
|
||
| func NewRemediationClient( | ||
| client client.Client, | ||
| dryRun bool, | ||
| remediationConfig config.TomlConfig, | ||
| ) (*FaultRemediationClient, error) { | ||
| // Determine template mount path | ||
| templateMountPath := remediationConfig.Template.MountPath | ||
| if templateMountPath == "" { | ||
| return nil, fmt.Errorf("template mount path is not configured") | ||
| } | ||
|
|
||
| // Pre-load and parse all templates | ||
| templates := make(map[string]*template.Template) | ||
|
|
||
| // Load templates for multi-template actions | ||
| for actionName, maintenanceResource := range remediationConfig.RemediationActions { | ||
| if maintenanceResource.TemplateFileName == "" { | ||
| return nil, fmt.Errorf("remediation action %s is missing template file configuration", actionName) | ||
| } | ||
|
|
||
| tmpl, err := loadAndParseTemplate(templateMountPath, maintenanceResource.TemplateFileName, actionName) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to load template for action %s: %w", actionName, err) | ||
| } | ||
|
|
||
| templates[actionName] = tmpl | ||
| } | ||
|
|
||
| // Validate namespace configuration for namespaced resources | ||
| for actionName, maintenanceResource := range remediationConfig.RemediationActions { | ||
| if maintenanceResource.Scope == "Namespaced" && maintenanceResource.Namespace == "" { | ||
| return nil, fmt.Errorf("remediation action %s is namespaced but missing namespace configuration", actionName) | ||
| } | ||
| } | ||
|
|
||
| ctrlRuntimeRemediationClient := &FaultRemediationClient{ | ||
| client: client, | ||
| templates: templates, | ||
| templateMountPath: templateMountPath, | ||
| remediationConfig: remediationConfig, | ||
| } | ||
|
|
||
| if dryRun { | ||
| ctrlRuntimeRemediationClient.dryRunMode = []string{metav1.DryRunAll} | ||
| } else { | ||
| ctrlRuntimeRemediationClient.dryRunMode = []string{} | ||
| } | ||
|
|
||
| // Initialize annotation manager | ||
| ctrlRuntimeRemediationClient.annotationManager = annotation.NewAnnotationManager(client) | ||
|
|
||
| ctrlRuntimeRemediationClient.statusChecker = crstatus.NewCRStatusChecker( | ||
| client, | ||
| remediationConfig.RemediationActions, | ||
| dryRun, | ||
| ) | ||
|
|
||
| return ctrlRuntimeRemediationClient, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing godoc for exported FaultRemediationClient and NewRemediationClient.
Both are exported but lack leading comments (per guidelines).
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go around lines 43 - 114, Add
proper godoc comments above the exported FaultRemediationClient type and the
NewRemediationClient function: for FaultRemediationClient add a one-line summary
describing its role (e.g., "FaultRemediationClient manages remediation
templates, annotation manager and status checking for remediation actions.") and
for NewRemediationClient add a summary plus brief parameter/return description
(single sentence each) explaining the client, dryRun flag, and remediationConfig
parameters and what the function returns (constructed *FaultRemediationClient or
error). Ensure the comments start immediately above the declarations and follow
Go doc style (begin with the symbol name).
| // loadAndParseTemplate loads and parses a template file | ||
| func loadAndParseTemplate(mountPath, fileName, templateName string) (*template.Template, error) { | ||
| templatePath := filepath.Join(mountPath, fileName) | ||
|
|
||
| // Check if the template file exists | ||
| if _, err := os.Stat(templatePath); os.IsNotExist(err) { | ||
| return nil, fmt.Errorf("template file does not exist: %s", templatePath) | ||
| } | ||
|
|
||
| // Read and parse the template | ||
| templateContent, err := os.ReadFile(templatePath) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error reading template file: %w", err) | ||
| } | ||
|
|
||
| tmpl := template.New(templateName) | ||
|
|
||
| tmpl, err = tmpl.Parse(string(templateContent)) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("error parsing template: %w", err) | ||
| } | ||
|
|
||
| return tmpl, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Harden template loading: prevent path traversal; fail on missing keys.
filepath.Join(mountPath, fileName) will accept ../... from config; also text/template defaults can silently render <no value>. Validate fileName is a base name (no separators) and consider tmpl.Option("missingkey=error").
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go around lines 116 - 139,
Validate and harden loadAndParseTemplate: ensure fileName is a plain base name
(no path separators or parent refs) and refuse values containing "/" or "\" or
".." (or compare filepath.Base(fileName) == fileName), then build the path and
verify the resolved absolute path is inside the mountPath root before reading;
also set the template option to fail on missing keys by calling tmpl =
template.New(templateName).Option("missingkey=error") prior to Parse so
templates error on unknown data.
| func (c *FaultRemediationClient) CreateMaintenanceResource( | ||
| ctx context.Context, | ||
| healthEventData *events.HealthEventData, | ||
| ) (string, error) { | ||
| healthEvent := healthEventData.HealthEvent | ||
| healthEventID := healthEventData.ID | ||
|
|
||
| // Generate CR name | ||
| crName := fmt.Sprintf("maintenance-%s-%s", healthEvent.NodeName, healthEventID) | ||
|
|
||
| // Skip custom resource creation if dry-run is enabled | ||
| if len(c.dryRunMode) > 0 { | ||
| slog.Info("DRY-RUN: Skipping custom resource creation", "node", healthEvent.NodeName) | ||
| return crName, nil | ||
| } | ||
|
|
||
| recommendedActionName := healthEvent.RecommendedAction.String() | ||
|
|
||
| maintenanceResource, selectedTemplate, actionKey, err := | ||
| c.selectRemediationActionAndTemplate(recommendedActionName, healthEvent.NodeName) | ||
| if err != nil { | ||
| return "", fmt.Errorf("error selecting remediation action and template: %w", err) | ||
| } | ||
|
|
||
| // Get the node object to extract UID for owner reference | ||
| // This also verifies the node exists before creating CR | ||
| node, err := c.getNodeForOwnerReference(ctx, healthEvent.NodeName) | ||
| if err != nil { | ||
| slog.Warn("Failed to get node for owner reference, skipping CR creation", | ||
| "node", healthEvent.NodeName, | ||
| "error", err) | ||
|
|
||
| return "", fmt.Errorf("failed to get node for owner reference: %w", err) | ||
| } | ||
|
|
||
| slog.Info("Creating maintenance CR", | ||
| "node", healthEvent.NodeName, | ||
| "template", actionKey, | ||
| "nodeUID", node.UID) | ||
|
|
||
| templateData := TemplateData{ | ||
| NodeName: healthEvent.NodeName, | ||
| HealthEventID: healthEventID, | ||
| RecommendedAction: healthEvent.RecommendedAction, | ||
| RecommendedActionName: recommendedActionName, | ||
|
|
||
| ApiGroup: maintenanceResource.ApiGroup, | ||
| Version: maintenanceResource.Version, | ||
| Kind: maintenanceResource.Kind, | ||
| Namespace: maintenanceResource.Namespace, | ||
| } | ||
|
|
||
| maintenance, yamlStr, err := renderMaintenanceFromTemplate(selectedTemplate, templateData) | ||
| if err != nil { | ||
| slog.Error("Failed to render maintenance template", | ||
| "template", actionKey, | ||
| "error", err) | ||
|
|
||
| return "", fmt.Errorf("error rendering maintenance template: %w", err) | ||
| } | ||
|
|
||
| slog.Debug("Generated YAML from template", | ||
| "template", actionKey, | ||
| "yaml", yamlStr) | ||
|
|
||
| setNodeOwnerRef(maintenance, node) | ||
|
|
||
| err = c.client.Create(ctx, maintenance) | ||
| if err != nil { | ||
| if apierrors.IsAlreadyExists(err) { | ||
| slog.Info("Maintenance CR already exists for node, treating as success", "CR", | ||
| crName, "node", healthEvent.NodeName) | ||
| } else { | ||
| return "", err | ||
| } | ||
| } | ||
|
|
||
| actualCRName := maintenance.GetName() | ||
| slog.Info("Created Maintenance CR successfully", | ||
| "crName", actualCRName, | ||
| "node", healthEvent.NodeName, | ||
| "template", actionKey) | ||
|
|
||
| group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction) | ||
| if group != "" && c.annotationManager != nil { | ||
| if err = c.annotationManager.UpdateRemediationState(ctx, healthEvent.NodeName, | ||
| group, actualCRName, recommendedActionName); err != nil { | ||
| slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName, | ||
| "error", err) | ||
|
|
||
| return "", err | ||
| } | ||
| } | ||
|
|
||
| return actualCRName, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid logging full rendered YAML; also ensure CR name/namespace are deterministic.
Line 211-214 logs yamlStr, which can leak secrets/credentials embedded in templates. Additionally, crName is computed but not enforced on the rendered object; AlreadyExists handling can return an empty/incorrect actualCRName if the template uses generateName or omits metadata.name.
Also applies to: 211-214
| func (c *FaultRemediationClient) launchLogCollectorJob( | ||
| ctx context.Context, | ||
| nodeName string, | ||
| eventUID string, | ||
| ) (batchv1.Job, ctrl.Result, error) { | ||
| // Read Job manifest | ||
| manifestPath := os.Getenv(LogCollectorManifestPathEnv) | ||
| if manifestPath == "" { | ||
| manifestPath = filepath.Join(c.templateMountPath, "log-collector-job.yaml") | ||
| } | ||
|
|
||
| content, err := os.ReadFile(manifestPath) | ||
| if err != nil { | ||
| metrics.LogCollectorErrors.WithLabelValues("manifest_read_error", nodeName).Inc() | ||
| return batchv1.Job{}, ctrl.Result{}, fmt.Errorf("failed to read log collector manifest: %w", err) | ||
| } | ||
|
|
||
| // Create Job from manifest using strong types | ||
| job := &batchv1.Job{} | ||
| if err = yaml.Unmarshal(content, job); err != nil { | ||
| metrics.LogCollectorErrors.WithLabelValues("manifest_unmarshal_error", nodeName).Inc() | ||
| return batchv1.Job{}, ctrl.Result{}, fmt.Errorf("failed to unmarshal Job manifest: %w", err) | ||
| } | ||
|
|
||
| // Set target node | ||
| job.Spec.Template.Spec.NodeName = nodeName | ||
| labels := map[string]string{ | ||
| logCollectorNodeLabel: nodeName, | ||
| logCollectorEventLabel: eventUID, | ||
| } | ||
|
|
||
| job.Labels = labels | ||
| // Get job if exists otherwise create | ||
| existingJobs := &batchv1.JobList{} | ||
|
|
||
| err = c.client.List( | ||
| ctx, | ||
| existingJobs, | ||
| client.MatchingLabels(labels), | ||
| client.InNamespace(job.GetNamespace()), | ||
| ) | ||
| if err != nil { | ||
| return batchv1.Job{}, ctrl.Result{}, err | ||
| } | ||
|
|
||
| // There should not be multiple jobs for same event, in this case return error | ||
| // this will then requeue and wait until the jobs clear | ||
| if len(existingJobs.Items) > 1 { | ||
| return batchv1.Job{}, | ||
| ctrl.Result{}, | ||
| fmt.Errorf("expecting zero or one log collector job per event per node, found %v", existingJobs.Items) | ||
| } | ||
|
|
||
| if len(existingJobs.Items) == 0 { | ||
| err = c.client.Create(ctx, job) | ||
| if err != nil { | ||
| return batchv1.Job{}, ctrl.Result{}, err | ||
| } | ||
| // if created, requeue to check status later | ||
| return batchv1.Job{}, ctrl.Result{RequeueAfter: 10 * time.Second}, nil | ||
| } | ||
|
|
||
| return existingJobs.Items[0], ctrl.Result{}, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log-collector Job namespace/labels handling is fragile (can fail or break manifests).
- Line 397-398:
client.InNamespace(job.GetNamespace())with empty namespace can error or list unexpectedly; validatejob.Namespaceis set (Jobs are namespaced). - Line 389:
job.Labels = labelsoverwrites manifest labels; merge instead. - Line 408: error formats full
existingJobs.Items(very large + may leak env/config); report names/UIDs instead.
4cea34a to
aaf78c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 19
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (10)
fault-remediation/pkg/crstatus/checker.go (2)
15-26: Add package-level documentation.The package lacks a godoc comment describing its purpose and functionality.
As per coding guidelines, package-level godoc is required for all Go packages. Add documentation explaining that this package provides CR status checking functionality for determining whether remediation CR creation should be skipped.
46-74: Add missing package and function documentation; complete the incomplete comment.Three issues require attention:
- Line 50: Incomplete comment
// note, ifshould be removed or completed.- Missing godoc for exported function
ShouldSkipCRCreation(required per Go conventions).- Missing package-level godoc for the
crstatuspackage (required per coding guidelines).The error handling at lines 69-71 is intentional by design: the function returns
booland is part of an interface that explicitly specifies this return type. The fail-open behavior (logging and returningfalseon Get error) is a safe default that allows upstream retry logic to proceed.Proposed improvements
- Add package-level documentation at the top of the file (after copyright header).
- Add godoc for the exported function.
- Remove or complete the incomplete comment at line 50:
-func (c *CRStatusChecker) ShouldSkipCRCreation(ctx context.Context, actionName string, crName string) bool { +// ShouldSkipCRCreation determines whether creating a new CR should be skipped based on +// the existence and status of an existing CR for the given action and name. +func (c *CRStatusChecker) ShouldSkipCRCreation(ctx context.Context, actionName string, crName string) bool { resource, exists := c.remediationActions[actionName] if !exists { slog.Error("No remediation configuration found for action", "action", actionName) - return true // note, if + return true }fault-remediation/main.go (1)
103-143: Critical: InitializationParams.config is never set.The
InitializationParamsstruct expects aconfig *rest.Configfield (as seen in fault-remediation/pkg/initializer/init.go line 38), but this field is never populated before passingparamstoInitializeAllat line 140. This will cause the initialization to use a nil config, leading to a runtime error when trying to create the Kubernetes client at init.go line 90.π Proposed fix
func setupCtrlRuntimeManagement(ctx context.Context, params initializer.InitializationParams) error { slog.Info("Running in controller runtime managed mode") cfg := ctrl.GetConfigOrDie() + params.config = cfg cfg.Wrap(func(rt http.RoundTripper) http.RoundTripper { return auditlogger.NewAuditingRoundTripper(rt) })Note: You may also need to export the
configfield inInitializationParams(change it toConfig) to follow Go naming conventions for exported struct fields.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
17-52: Potential compile blocker: missingannotation2/remediation2imports (aliases used but not imported).
This file referencesannotation2.AnnotationKeyandremediation2.NewRemediationClient(...), but those packages/aliases arenβt present in the shown import list.Proposed fix (imports)
import ( "context" "log" "os" "path/filepath" "sync" "testing" "time" "github.com/prometheus/client_golang/prometheus" dto "github.com/prometheus/client_model/go" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/runtime/schema" "k8s.io/client-go/dynamic" "k8s.io/client-go/kubernetes" "k8s.io/client-go/kubernetes/scheme" "k8s.io/client-go/rest" ctrl "sigs.k8s.io/controller-runtime" "sigs.k8s.io/controller-runtime/pkg/envtest" metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" + annotation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation" "github.com/nvidia/nvsentinel/commons/pkg/statemanager" "github.com/nvidia/nvsentinel/data-models/pkg/model" "github.com/nvidia/nvsentinel/data-models/pkg/protos" "github.com/nvidia/nvsentinel/fault-remediation/pkg/common" "github.com/nvidia/nvsentinel/fault-remediation/pkg/config" "github.com/nvidia/nvsentinel/fault-remediation/pkg/events" "github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics" + remediation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation" "github.com/nvidia/nvsentinel/store-client/pkg/datastore" "github.com/nvidia/nvsentinel/store-client/pkg/testutils" )Also applies to: 166-177, 299-327
179-265: Avoid flakiness: wait for manager cache sync before usingmgr.GetClient()in tests.
Without a cache sync barrier, reads via the ctrl-runtime client (used by the remediation/annotation managers) can race startup.Proposed fix (start manager + wait for cache sync)
mgr, err := ctrl.NewManager(testEnv.Config, ctrl.Options{ Scheme: scheme.Scheme, Metrics: metricsserver.Options{ BindAddress: "0", }, }) if err != nil { panic(err) } ctrlRuntimeClient = mgr.GetClient() + mgrErrCh := make(chan error, 1) + go func() { + mgrErrCh <- mgr.Start(testContext) + }() + + if ok := mgr.GetCache().WaitForCacheSync(testContext); !ok { + log.Fatalf("Failed to sync manager cache") + } + + // Optional: fail fast if manager exits early + select { + case err := <-mgrErrCh: + if err != nil { + log.Fatalf("Manager exited: %v", err) + } + default: + } remediationClient, err := createTestRemediationClient(false) if err != nil { log.Fatalf("Failed to create remediation client: %v", err) } ... - go func() { - if err := mgr.Start(testContext); err != nil { - log.Fatalf("Failed to start the test environment manager: %v", err) - } - }()Also applies to: 206-216
fault-remediation/pkg/reconciler/reconciler_test.go (2)
17-38: Compile blocker + mock drift: add missingannotation2import; align log-collector mock signature; setActionNamein mock state.
The file usesannotation2.*types but doesnβt import it; also the new production logic relies oneventIdandgroupState.ActionName, which the mocks currently drop.Proposed fix (imports + mock signatures)
import ( "context" "errors" "fmt" + annotation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation" "github.com/nvidia/nvsentinel/fault-remediation/pkg/events" corev1 "k8s.io/api/core/v1" ctrl "sigs.k8s.io/controller-runtime" "testing" "time" ... ) type MockK8sClient struct { createMaintenanceResourceFn func(ctx context.Context, healthEventData *events.HealthEventData) (string, error) - runLogCollectorJobFn func(ctx context.Context, nodeName string) (ctrl.Result, error) + runLogCollectorJobFn func(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) annotationManagerOverride annotation2.NodeAnnotationManagerInterface realStatusChecker crstatus.CRStatusCheckerInterface } func (m *MockK8sClient) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) { - return m.runLogCollectorJobFn(ctx, nodeName) + return m.runLogCollectorJobFn(ctx, nodeName, eventId) } func (m *MockNodeAnnotationManager) GetRemediationState(ctx context.Context, nodeName string) (*annotation2.RemediationStateAnnotation, *corev1.Node, error) { ... return &annotation2.RemediationStateAnnotation{ EquivalenceGroups: map[string]annotation2.EquivalenceGroupState{ "restart": { MaintenanceCR: m.existingCR, CreatedAt: time.Now(), + ActionName: protos.RecommendedAction_RESTART_BM.String(), }, }, }, nil, nil }Also applies to: 40-66, 115-135
461-504: Test comment contradicts expectation (label update failures): update wording or assertion.
The comment says remediation βshould still succeedβ, but the test assertserr != nil. Pick one to match intended behavior.Proposed fix (comment only)
- // Even with label update errors, remediation should still succeed + // Label update errors should be surfaced to trigger a retry (even if CR creation succeeded). _, err := r.performRemediation(ctx, healthEventDoc) assert.Error(t, err)fault-remediation/pkg/reconciler/reconciler.go (3)
17-43: Compile blocker: missingannotation2/remediation2imports (types referenced).
ReconcilerConfig.RemediationClient remediation2.FaultRemediationClientInterfaceandannotationManager annotation2.NodeAnnotationManagerInterfacerequire imports that arenβt present in the shown import block.Proposed fix (imports)
import ( "context" "errors" "fmt" "log/slog" "strings" "time" + annotation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation" "github.com/nvidia/nvsentinel/fault-remediation/pkg/events" "github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics" + remediation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation" "k8s.io/client-go/util/workqueue" ctrl "sigs.k8s.io/controller-runtime" ... )Also applies to: 45-67, 69-85
245-273: Donβt usecontext.Background()forMarkProcessed(can hang shutdown); preferctxor a short timeout.
This aligns with the repoβs shutdown/cancellation guidance.Proposed fix (use ctx with timeout)
func (r *FaultRemediationReconciler) handleCancellationEvent(...)(ctrl.Result, error) { ... - if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil { + markCtx, cancel := context.WithTimeout(ctx, 5*time.Second) + defer cancel() + if err := watcherInstance.MarkProcessed(markCtx, resumeToken); err != nil { metrics.ProcessingErrors.WithLabelValues("mark_processed_error", nodeName).Inc() ... } ... } func (r *FaultRemediationReconciler) parseHealthEvent(...)(events.HealthEventDoc, error) { ... - if markErr := watcherInstance.MarkProcessed(context.Background(), eventWithToken.ResumeToken); markErr != nil { + markCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + if markErr := watcherInstance.MarkProcessed(markCtx, eventWithToken.ResumeToken); markErr != nil { ... } ... }(If you want cancellation to abort marking, use
ctxinstead ofcontext.Background()inparseHealthEventtoo.)Also applies to: 459-511
399-457: Hardening: guardGetConfig()andannotationManagernils incheckExistingCRStatus.
A niltomlConfig(or nilannotationManager) will panic. Returning βallow creationβ is typically safer in those cases.Proposed fix (nil guards)
tomlConfig := r.Config.RemediationClient.GetConfig() +if tomlConfig == nil { + slog.Warn("Remediation config is nil; allowing creation", "node", nodeName, "action", actionName) + return true, "", nil +} +if r.annotationManager == nil { + slog.Warn("Annotation manager is nil; allowing creation", "node", nodeName) + return true, "", nil +}
π€ Fix all issues with AI agents
In @fault-remediation/main.go:
- Line 111: Replace the TODO comment "//TODO: setup informers for node and job"
with a TODO that references the tracking GitHub issue (e.g., "//
TODO(#<issue-number>): setup informers for node and job") so the work is
traceable; update the exact comment string in fault-remediation/main.go to
include the issue number and a short owner or milestone if applicable.
- Around line 147-151: Ensure cleanup runs even on early returns by registering
the CloseAll defer as soon as a non-nil reconciler exists or by explicitly
closing on error: after calling InitializeAll and receiving (reconciler, err),
if err != nil and reconciler != nil call reconciler.CloseAll(ctx) before
returning; otherwise, if reconciler is non-nil and initialization succeeded,
then immediately defer reconciler.CloseAll(ctx). This guarantees
reconciler.CloseAll is invoked for partially-initialized state while still
deferring cleanup for the successful path.
In @fault-remediation/pkg/annotation/annotation_interface.go:
- Line 1: Add a package-level godoc comment for the package "annotation" at the
top of annotation_interface.go (and mirror in annotation.go) describing the
package purpose and public API; insert a concise paragraph comment immediately
above the "package annotation" declaration that explains what the annotation
package provides, typical usage, and any important behavior or invariants so it
satisfies Go's package-level documentation guideline.
In @fault-remediation/pkg/annotation/annotation.go:
- Line 40: Replace the bare TODO in annotation.go with a TODO that references a
tracking issue (e.g., "TODO(#1234): split return of node and state" or a full
issue URL) so it complies with guidelines; update the comment near the package
annotation (or the function that returns both node and state) to reference the
created issue ID/URL or remove the TODO if you instead implement the change now.
- Line 1: Add a package-level godoc comment immediately above the "package
annotation" declaration in annotation.go: write a concise, complete-sentence
comment that begins with "Package annotation ..." and describes the package's
purpose, responsibilities, and intended usage (mention exported types or
functions briefly if applicable) following Go documentation conventions.
In @fault-remediation/pkg/crstatus/checker.go:
- Around line 34-44: Add a godoc comment for the exported constructor
NewCRStatusChecker that explains its purpose (constructs and returns a
*CRStatusChecker), describes the parameters (client client.Client,
remediationActions map[string]config.MaintenanceResource, dryRun bool) and
states the return value (*CRStatusChecker); place the comment immediately above
the NewCRStatusChecker function and follow Go doc conventions (start the comment
with "NewCRStatusChecker" and keep it concise).
In @fault-remediation/pkg/crstatus/crstatus_interface.go:
- Around line 9-11: The exported interface CRStatusCheckerInterface and its
method ShouldSkipCRCreation lack godoc and parameter names; add a package-level
or type-level comment for CRStatusCheckerInterface and a method comment for
ShouldSkipCRCreation describing purpose and behavior, and update the method
signature to include parameter names (e.g., ctx context.Context, namespace
string, name string) so callers and linters have clear documentation and
readable parameters.
In @fault-remediation/pkg/events/health_event.go:
- Around line 1-3: Add a package-level godoc comment immediately above "package
events" in health_event.go that briefly describes this packageβs purpose β e.g.,
that package events defines event data structures for health/monitoring events
and documents the different serialization tags (JSON, protobuf, etc.) used on
those types β so the package has a clear, guideline-compliant description.
In @fault-remediation/pkg/initializer/init.go:
- Around line 90-93: The error returned from kubernetes.NewForConfig in init.go
is returned raw; wrap it with context per guidelines. Replace the direct return
of err after calling kubernetes.NewForConfig(params.config) so that the function
returns fmt.Errorf("creating kube client with provided config: %w", err) (or
similar contextual message), ensuring you import fmt if not already and keep the
kubeClient, err := kubernetes.NewForConfig(params.config) assignment intact.
In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: Replace the bare TODO comment in metrics.go ("//TODO: evaluate and
remove redundant metrics with ctrl-runtime defaults") with a TODO that
references a tracking GitHub issue (e.g., "// TODO(#<issue-number>): evaluate
and remove redundant metrics with ctrl-runtime defaults") so the task is
tracked; update the comment text to include the issue number and, if helpful, a
short owner or ETA tag.
In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go:
- Around line 299-327: The test uses a relative "./templates" path which breaks
when go test runs from a different working directory; in
createTestRemediationClient resolve the Template.MountPath to an absolute path
anchored to the test file by using runtime.Caller to get the test file directory
and filepath.Join to build the templates path, then set
remediationConfig.Template.MountPath to that resolved path before calling
remediation2.NewRemediationClient; update any imports accordingly (runtime,
path/filepath).
In @fault-remediation/pkg/reconciler/reconciler.go:
- Around line 193-243: The bug is that performRemediation treats any
post-CR-step error (like a node-label update failure) as overall remediation
failure, which can set FaultRemediated=false even though
CreateMaintenanceResource succeeded; change the error flow in performRemediation
so CR creation success is honored: call
RemediationClient.CreateMaintenanceResource and capture
createMaintenanceResourceError, then attempt UpdateNVSentinelStateNodeLabel; if
the label update fails but createMaintenanceResourceError is nil, log/emit
metrics but return the created crName and nil error (do not join/return the
label error); only return an error (or joined errors) when
CreateMaintenanceResource itself failed (or both creation and label update
failed), referencing the symbols performRemediation, CreateMaintenanceResource,
createMaintenanceResourceError, UpdateNVSentinelStateNodeLabel,
remediationLabelValue and ensure the callerβs FaultRemediated logic can rely on
performRemediationErr==nil meaning the CR was created.
In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go:
- Around line 28-34: Add a godoc comment for the exported interface
FaultRemediationClientInterface: write a concise sentence starting with
"FaultRemediationClientInterface ..." that describes the interface's purpose and
behavior, and optionally a brief phrase listing what its key methods do
(CreateMaintenanceResource, RunLogCollectorJob, GetAnnotationManager,
GetStatusChecker, GetConfig). Place the comment immediately above the type
declaration so it satisfies Go export documentation guidelines.
- Line 31: The interface refers to annotation2.NodeAnnotationManagerInterface
but the import alias is missing; add the import alias declaration annotation2
"github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation" to the file's
import block so the GetAnnotationManager() return type resolves correctly (this
aligns with usage in GetAnnotationManager and other files like remediation.go
and reconciler.go).
In @fault-remediation/pkg/remediation/remediation.go:
- Line 462: Replace the bare "// todo" comment next to the nolint directive with
a TODO that references an existing issue or ticket (e.g., change
"//nolint:nestif // todo" to something like "//nolint:nestif // TODO: track ref
#<issue-number> or https://.../issues/<id>"), so update the comment in
remediation.go where the "//nolint:nestif // todo" appears to include the issue
identifier or URL.
- Line 148: Replace the bare "// nolint: cyclop // todo" comment with a TODO
that references a concrete issue/ID and a short rationale so it complies with
coding guidelines; for example change the "nolint: cyclop" comment to include
"TODO: <issue-number> - <brief reason>" (keeping the nolint directive), ensuring
the comment associated with the nolint: cyclop directive documents the issue
tracker ID and why the cyclomatic complexity exemption is needed.
- Line 500: Replace the bare "//nolint:nestif // todo" comment with a TODO that
references the tracking issue or ticket (e.g., "TODO: link to ISSUE-NNNN" or
include the issue URL) so it complies with the Go coding guidelines; update the
nolint line (the "//nolint:nestif" comment adjacent to the remediation
function/block) to include a clear TODO with the issue identifier and optionally
a short rationale so future maintainers can find the related discussion.
- Line 553: Replace the bare "//nolint:nestif // todo" comment with a TODO that
references a real issue ID or URL (for example "//nolint:nestif // TODO: issue
#12345" or "//nolint:nestif // TODO: https://tracker/PROJECT/issues/12345") so
the intent and follow-up are traceable; if no issue exists, create one
describing why the nested-if needs refactoring and reference that new issue in
the TODO comment where the current "//nolint:nestif // todo" appears in
remediation.go.
- Line 1: The package lacks a package-level godoc comment: add a concise package
comment immediately above the "package remediation" declaration that describes
the purpose and responsibilities of the remediation package (what it does, key
abstractions or intended use), so tools like godoc and linters will pick it up;
ensure the comment starts with "Package remediation" to follow Go documentation
conventions.
π‘ Minor comments (7)
fault-remediation/pkg/annotation/annotation_test.go-160-167 (1)
160-167: Assert error from ClearRemediationState.The test should verify that
ClearRemediationStatesucceeded before checking the annotation was removed.π§ Proposed fix
err := annotationManager.ClearRemediationState(context.TODO(), nodeName) + assert.NoError(t, err) err = client.Get(context.TODO(), types.NamespacedName{fault-remediation/pkg/annotation/annotation_test.go-199-201 (1)
199-201: Add error assertion before using state.Similar to TestUpdateRemediationState, verify that GetRemediationState succeeded before accessing the result.
π§ Proposed fix
state, _, err := annotationManager.GetRemediationState(context.TODO(), nodeName) + assert.NoError(t, err) assert.NotContains(t, state.EquivalenceGroups, removedGroup) assert.Contains(t, state.EquivalenceGroups, notRemovedGroup)fault-remediation/pkg/annotation/annotation_test.go-140-143 (1)
140-143: Add error assertion before using state.The test should verify that GetRemediationState succeeded before accessing
state.EquivalenceGroupsto avoid potential nil pointer dereference.π§ Proposed fix
state, _, err := annotationManager.GetRemediationState(context.TODO(), nodeName) + assert.NoError(t, err) assert.Contains(t, state.EquivalenceGroups, group) assert.Equal(t, crName, state.EquivalenceGroups[group].MaintenanceCR) assert.Equal(t, actionName, state.EquivalenceGroups[group].ActionName)fault-remediation/pkg/annotation/annotation.go-50-57 (1)
50-57: Unmarshal failure silently returns empty state.When JSON unmarshal fails (line 51), the error is logged but not returned, causing the caller to receive an empty state instead of being notified of the corruption. This could mask data integrity issues in the annotation.
Consider returning the unmarshal error to the caller so they can handle corrupted annotations appropriately.
π§ Proposed fix
var state RemediationStateAnnotation if err = json.Unmarshal([]byte(annotationValue), &state); err != nil { - slog.Error("Failed to unmarshal annotation", "node", nodeName, "error", err) - // Return empty state if unmarshal fails - return &RemediationStateAnnotation{ - EquivalenceGroups: make(map[string]EquivalenceGroupState), - }, node, nil + return nil, node, fmt.Errorf("failed to unmarshal remediation state annotation: %w", err) }fault-remediation/pkg/annotation/annotation.go-66-73 (1)
66-73: Dead code: state assignment followed by immediate error return.Lines 69-71 assign an empty state, but line 72 immediately returns the error, making the assignment unreachable. Either remove the assignment and return early, or remove the return to use the empty state as a fallback.
π§ Proposed fix (option 1: return early)
// Get current state state, node, err := m.GetRemediationState(ctx, nodeName) if err != nil { - slog.Warn("Failed to get current remediation state", "node", nodeName, "error", err) - state = &RemediationStateAnnotation{ - EquivalenceGroups: make(map[string]EquivalenceGroupState), - } - return err + return fmt.Errorf("failed to get current remediation state: %w", err) }π§ Proposed fix (option 2: use fallback and continue)
// Get current state state, node, err := m.GetRemediationState(ctx, nodeName) if err != nil { slog.Warn("Failed to get current remediation state", "node", nodeName, "error", err) state = &RemediationStateAnnotation{ EquivalenceGroups: make(map[string]EquivalenceGroupState), } - return err }fault-remediation/pkg/reconciler/reconciler.go-275-277 (1)
275-277: Replace// nolint: cyclop // todowith an issue reference (or remove if no longer needed).
As per coding guidelines, TODOs should reference issues.fault-remediation/pkg/reconciler/reconciler_e2e_test.go-368-370 (1)
368-370: TODOs should be tracked (and βignore errorβ should be justified) to avoid hiding regressions.
Repo guideline: TODOs should reference issues. As per coding guidelines, consider turning these into a small helper that asserts the expected state transitions, or at least link an issue ID.Also applies to: 430-432, 488-491, 519-521, 571-574
π§Ή Nitpick comments (15)
fault-remediation/pkg/events/health_event.go (1)
5-15: Consider enhancing type documentation.While basic comments exist for the exported types, they could be more descriptive about the intended use cases and the difference between the two types.
Enhanced documentation example
-// HealthEventDoc represents health event data with JSON "_id" tag for document-based storage. +// HealthEventDoc represents a health event suitable for JSON serialization. +// The ID field uses the json:"_id" tag for document-based storage systems. type HealthEventDoc struct { ID string `json:"_id"` model.HealthEventWithStatus `json:",inline"` } -// HealthEventData represents health event data with string ID for compatibility +// HealthEventData represents a health event suitable for BSON serialization. +// The ID field uses the bson:"_id,omitempty" tag for MongoDB compatibility. type HealthEventData struct { ID string `bson:"_id,omitempty"` model.HealthEventWithStatus `bson:",inline"` }fault-remediation/pkg/initializer/init.go (1)
96-96: Clarify log message.The log message "Successfully initialized client" is vague given that multiple clients are initialized (remediationClient, kubeClient). Consider being more specific about which clients were initialized or using a more general message like "Successfully initialized Kubernetes clients".
fault-remediation/pkg/remediation/remediation_test.go (1)
29-97: Consider using standard testing package assertions for simple checks.Per repository learnings, avoid introducing testify dependency for simple equality/inequality checks. Use
t.Error,t.Errorf,t.Fatalfor straightforward assertions like nil checks and equality comparisons.Based on learnings, the repository prefers standard testing package assertions for simple scenarios.
fault-remediation/pkg/remediation/remediation.go (7)
216-224: Wrap error with context before returning.Line 222 returns the error directly without wrapping it with context. This makes debugging harder as the error loses the context of where it occurred.
π§ Proposed fix
err = c.client.Create(ctx, maintenance) if err != nil { if apierrors.IsAlreadyExists(err) { slog.Info("Maintenance CR already exists for node, treating as success", "CR", crName, "node", healthEvent.NodeName) } else { - return "", err + return "", fmt.Errorf("failed to create maintenance CR: %w", err) } }As per coding guidelines, wrap errors with context using
fmt.Errorf("context: %w", err).
232-241: Wrap error with context before returning.Line 239 returns the error from UpdateRemediationState without additional context. Consider wrapping it to indicate the failure occurred during annotation update after CR creation.
π§ Proposed fix
group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction) if group != "" && c.annotationManager != nil { if err = c.annotationManager.UpdateRemediationState(ctx, healthEvent.NodeName, group, actualCRName, recommendedActionName); err != nil { slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName, "error", err) - return "", err + return "", fmt.Errorf("failed to update node annotation after CR creation: %w", err) } }As per coding guidelines, wrap errors with context using
fmt.Errorf("context: %w", err).
392-400: Wrap error with context before returning.Line 399 returns the error from
c.client.Listwithout wrapping it with context. This makes debugging harder.π§ Proposed fix
err = c.client.List( ctx, existingJobs, client.MatchingLabels(labels), client.InNamespace(job.GetNamespace()), ) if err != nil { - return batchv1.Job{}, ctrl.Result{}, err + return batchv1.Job{}, ctrl.Result{}, fmt.Errorf("failed to list existing log collector jobs: %w", err) }As per coding guidelines, wrap errors with context using
fmt.Errorf("context: %w", err).
410-417: Wrap error with context before returning.Line 413 returns the error from
c.client.Createwithout wrapping it with context.π§ Proposed fix
if len(existingJobs.Items) == 0 { err = c.client.Create(ctx, job) if err != nil { - return batchv1.Job{}, ctrl.Result{}, err + return batchv1.Job{}, ctrl.Result{}, fmt.Errorf("failed to create log collector job: %w", err) } // if created, requeue to check status later return batchv1.Job{}, ctrl.Result{RequeueAfter: 10 * time.Second}, nil }As per coding guidelines, wrap errors with context using
fmt.Errorf("context: %w", err).
467-478: Wrap error with context before returning.Line 476 returns the error from
c.client.Updatewithout wrapping it with context. This error is returned from within the metrics-recording logic for completed jobs.π§ Proposed fix
err := c.client.Update(ctx, updateJob) if err != nil { - return false, err + return false, fmt.Errorf("failed to update job annotation for metrics tracking: %w", err) }As per coding guidelines, wrap errors with context using
fmt.Errorf("context: %w", err).
505-516: Wrap error with context before returning.Line 514 returns the error from
c.client.Updatewithout wrapping it with context. This error is returned from within the metrics-recording logic for failed jobs.π§ Proposed fix
err := c.client.Update(ctx, updateJob) if err != nil { - return false, err + return false, fmt.Errorf("failed to update job annotation for metrics tracking: %w", err) }As per coding guidelines, wrap errors with context using
fmt.Errorf("context: %w", err).
557-568: Wrap error with context before returning.Line 565 returns the error from
c.client.Updatewithout wrapping it with context. This error is returned from within the metrics-recording logic for timed-out jobs.π§ Proposed fix
err := c.client.Update(ctx, updateJob) if err != nil { - return false, err + return false, fmt.Errorf("failed to update job annotation for metrics tracking: %w", err) }As per coding guidelines, wrap errors with context using
fmt.Errorf("context: %w", err).fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
1324-1340: Annotation cleanup may need conflict retry (optional).
If the reconciler updates the same Node concurrently,Nodes().Update(...)can conflict; considerretry.RetryOnConflictaround the update even in tests to reduce flakes.fault-remediation/pkg/reconciler/reconciler_test.go (3)
575-612: Test likely no longer matches production behavior: log collector onNONE.
FaultRemediationReconciler.runLogCollector(...)returns early onRecommendedAction_NONE, so this testβs βStart loop behaviorβ simulation is misleading. Consider removing it or rewriting it to assert log collector is not invoked for NONE.
614-679: Good coverage for log-collector gating/error paths; consider assertingeventIdpropagation.
Once the mock acceptseventId, these tests can validate itβs passed through (regression-proofing the retry/dedup logic).Also applies to: 681-705, 707-743
821-884: Dedup tests donβt exercise status-checker path (fields likecrSucceededcurrently unused).
Consider adding arealStatusCheckermock and assertingShouldSkipCRCreation(...)is called with the storedActionName+ CR name so the new cross-action semantics are actually verified.Also applies to: 886-957
fault-remediation/pkg/reconciler/reconciler.go (1)
569-597: Optional: consider bufferingoutinAdaptEventsto reduce backpressure.
With an unbufferedout, a slow controller can block the goroutine and indirectly block the watcher read loop.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
commons/go.sumis excluded by!**/*.sum
π Files selected for processing (29)
.gitignorecommons/go.moddistros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yamldistros/kubernetes/nvsentinel/values-full.yamldistros/kubernetes/nvsentinel/values-tilt-mongodb.yamldistros/kubernetes/nvsentinel/values-tilt.yamldistros/kubernetes/nvsentinel/values.yamlfault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/annotation.gofault-remediation/pkg/reconciler/annotation_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/remediation.gofault-remediation/pkg/reconciler/remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yamlfault-remediation/pkg/remediation/templates/rebootnode-template.yaml
π€ Files with no reviewable changes (8)
- distros/kubernetes/nvsentinel/values-tilt.yaml
- distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
- distros/kubernetes/nvsentinel/values-full.yaml
- fault-remediation/pkg/reconciler/remediation_test.go
- fault-remediation/pkg/reconciler/annotation_test.go
- distros/kubernetes/nvsentinel/values.yaml
- fault-remediation/pkg/reconciler/remediation.go
- fault-remediation/pkg/reconciler/annotation.go
π§ Files skipped from review as they are similar to previous changes (3)
- commons/go.mod
- .gitignore
- fault-remediation/pkg/remediation/templates/log-collector-job.yaml
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/crstatus/checker.gofault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/reconciler/reconciler.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π§ Learnings (17)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
fault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/events/health_event.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
fault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/remediation/remediation.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code
Applied to files:
fault-remediation/pkg/metrics/metrics.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/initializer/init.go
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/reconciler/reconciler_test.go
𧬠Code graph analysis (10)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(23-25)AnnotationKey(11-11)EquivalenceGroupState(28-34)fault-remediation/pkg/annotation/annotation.go (1)
NodeAnnotationManager(16-18)
fault-remediation/pkg/remediation/remediation_test.go (1)
fault-remediation/pkg/remediation/remediation.go (2)
NewRemediationClient(55-113)FaultRemediationClient(42-53)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(15-20)
fault-remediation/pkg/crstatus/crstatus_test.go (1)
fault-remediation/pkg/crstatus/checker.go (1)
NewCRStatusChecker(34-44)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (7)
health-monitors/kubernetes-object-monitor/pkg/cel/environment.go (1)
Environment(30-35)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)fault-remediation/pkg/config/config.go (3)
TomlConfig(59-71)Template(47-50)MaintenanceResource(27-44)fault-remediation/pkg/remediation/remediation.go (1)
NewRemediationClient(55-113)commons/pkg/statemanager/statemanager.go (3)
NewStateManager(206-210)StateManager(197-200)RemediatingLabelValue(171-171)fault-remediation/pkg/events/health_event.go (1)
HealthEventDoc(6-9)fault-remediation/pkg/metrics/metrics.go (4)
TotalEventsReceived(33-38)EventHandlingDuration(62-68)EventsProcessed(39-45)ProcessingErrors(46-52)
fault-remediation/pkg/initializer/init.go (2)
fault-remediation/pkg/remediation/remediation.go (1)
NewRemediationClient(55-113)commons/pkg/statemanager/statemanager.go (2)
NewStateManager(206-210)StateManager(197-200)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(27-44)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(23-25)AnnotationKey(11-11)EquivalenceGroupState(28-34)
fault-remediation/pkg/reconciler/reconciler_test.go (2)
fault-remediation/pkg/events/health_event.go (2)
HealthEventData(12-15)HealthEventDoc(6-9)fault-remediation/pkg/annotation/annotation_interface.go (3)
NodeAnnotationManagerInterface(15-20)RemediationStateAnnotation(23-25)EquivalenceGroupState(28-34)
fault-remediation/pkg/reconciler/reconciler.go (5)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
FaultRemediationClientInterface(28-34)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(15-20)fault-remediation/pkg/metrics/metrics.go (3)
EventHandlingDuration(62-68)TotalEventsReceived(33-38)ProcessingErrors(46-52)commons/pkg/statemanager/statemanager.go (1)
StateManager(197-200)fault-remediation/pkg/events/health_event.go (2)
HealthEventDoc(6-9)HealthEventData(12-15)
πͺ YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
[error] 15-15: syntax error: expected , but found ''
(syntax)
π Additional comments (12)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)
1-20: LGTM! Template structure is correct.The YAMLlint syntax error is a false positiveβGo template syntax (e.g.,
{{.ApiGroup}}) is not valid YAML until rendered. The template correctly defines a RebootNode CR with appropriate placeholders for dynamic substitution.fault-remediation/pkg/crstatus/crstatus_test.go (1)
34-34: Constructor update aligns with controller-runtime migration.The constructor call correctly reflects the new signature in checker.go, which now accepts a single controller-runtime client instead of separate dynamic client and REST mapper parameters.
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)
96-111: LGTM! Probe configuration correctly simplified for unconditional controller-runtime usage.The changes remove conditional logic from liveness and readiness probes, fixing them to use the "health" port and "/readyz" path. This aligns with the broader migration to exclusively use controller-runtime initialization.
fault-remediation/pkg/annotation/annotation_test.go (1)
15-120: Well-structured table-driven test.The test provides comprehensive coverage for GetRemediationState, including error cases (node not found), missing annotations, invalid JSON, and valid state parsing. Good use of table-driven testing pattern.
fault-remediation/pkg/metrics/metrics.go (1)
29-93: LGTM: Metrics extraction and export.The extraction of metrics into a dedicated package with exported identifiers is a clean refactoring that improves modularity. The metric definitions (names, labels, help text) are preserved correctly.
fault-remediation/main.go (1)
95-98: Theconfigfield inInitializationParamsdoes not need to be exported. The field is only used within theinitializerpackage itself (atinit.go:90), and it is not being set from outside the package inmain.go. The field is left as a zero value (nil) in the instantiation atmain.go:89-93, so exporting it would not be necessary.Likely an incorrect or invalid review comment.
fault-remediation/pkg/initializer/init.go (1)
81-94: Initialization flow with controller-runtime is correct.Verification confirms that downstream code properly uses the controller-runtime client for CR operations (annotation management, status checking) through the RemediationClient, and the Kubernetes clientset for node label updates through StateManager. Both clients serve their intended purposes without mixing concerns.
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
36-49: LGTM!The TemplateData struct is well-documented with clear section comments explaining the purpose of each field group.
fault-remediation/pkg/remediation/remediation.go (3)
55-113: LGTM!The constructor properly validates configuration, pre-loads templates, and returns errors to the caller. Error wrapping follows Go best practices using
fmt.Errorfwith%w.
115-138: LGTM!Template loading includes proper error handling with context wrapping using
fmt.Errorfwith%w, following Go best practices.
338-355: LGTM!The function properly orchestrates the log collector workflow and returns errors to trigger controller retries, aligning with the PR's goal to "throw errors to trigger retries."
fault-remediation/pkg/reconciler/reconciler.go (1)
169-191: No changes needed.ctrl.Result.IsZero()is available in controller-runtime v0.22.4, and the current approach of returningctrl.Result{}for empty results is idiomatic and compatible with the pinned version.
| slog.Info("Running in controller runtime managed mode") | ||
|
|
||
| mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ | ||
| //TODO: setup informers for node and job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
TODO comment should reference an issue.
Per coding guidelines, TODO comments should reference a GitHub issue for tracking.
Based on coding guidelines.
π€ Prompt for AI Agents
In @fault-remediation/main.go at line 111, Replace the TODO comment "//TODO:
setup informers for node and job" with a TODO that references the tracking
GitHub issue (e.g., "// TODO(#<issue-number>): setup informers for node and
job") so the work is traceable; update the exact comment string in
fault-remediation/main.go to include the issue number and a short owner or
milestone if applicable.
| package events | ||
|
|
||
| import "github.com/nvidia/nvsentinel/data-models/pkg/model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
Add package-level documentation.
The package lacks a godoc comment describing its purpose.
As per coding guidelines, package-level godoc is required for all Go packages. Add documentation explaining that this package defines event data structures for health events with different serialization tags.
Proposed documentation
+// Package events provides data structures for representing health events
+// with support for different serialization formats (JSON and BSON).
package eventsπ Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| package events | |
| import "github.com/nvidia/nvsentinel/data-models/pkg/model" | |
| // Package events provides data structures for representing health events | |
| // with support for different serialization formats (JSON and BSON). | |
| package events | |
| import "github.com/nvidia/nvsentinel/data-models/pkg/model" |
π€ Prompt for AI Agents
In @fault-remediation/pkg/events/health_event.go around lines 1 - 3, Add a
package-level godoc comment immediately above "package events" in
health_event.go that briefly describes this packageβs purpose β e.g., that
package events defines event data structures for health/monitoring events and
documents the different serialization tags (JSON, protobuf, etc.) used on those
types β so the package has a clear, guideline-compliant description.
| // createTestRemediationClient creates a real FaultRemediationClient for e2e tests | ||
| func createTestRemediationClient(dryRun bool) (*FaultRemediationClient, error) { | ||
|
|
||
| // Create discovery client for RESTMapper | ||
| discoveryClient, err := discovery.NewDiscoveryClientForConfig(testRestConfig) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| cachedClient := memory.NewMemCacheClient(discoveryClient) | ||
| mapper := restmapper.NewDeferredDiscoveryRESTMapper(cachedClient) | ||
|
|
||
| templatePath := filepath.Join("templates", "rebootnode-template.yaml") | ||
| templateContent, err := os.ReadFile(templatePath) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| tmpl, err := template.New("maintenance").Parse(string(templateContent)) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| // Create remediation config with the test template | ||
| func createTestRemediationClient(dryRun bool) (remediation2.FaultRemediationClientInterface, error) { | ||
| remediationConfig := config.TomlConfig{ | ||
| Template: config.Template{ | ||
| MountPath: "./templates", | ||
| FileName: "rebootnode-template.yaml", | ||
| }, | ||
| RemediationActions: map[string]config.MaintenanceResource{ | ||
| "RESTART_BM": { | ||
| ApiGroup: "janitor.dgxc.nvidia.com", | ||
| Version: "v1alpha1", | ||
| Kind: "RebootNode", | ||
| TemplateFileName: "test.yaml", | ||
| TemplateFileName: "rebootnode-template.yaml", | ||
| CompleteConditionType: "NodeReady", | ||
| EquivalenceGroup: "restart", | ||
| }, | ||
| "COMPONENT_RESET": { | ||
| ApiGroup: "janitor.dgxc.nvidia.com", | ||
| Version: "v1alpha1", | ||
| Kind: "RebootNode", | ||
| TemplateFileName: "gpu-reset.yaml", | ||
| TemplateFileName: "rebootnode-template.yaml", | ||
| CompleteConditionType: "NodeReady", | ||
| EquivalenceGroup: "restart", | ||
| }, | ||
| }, | ||
| } | ||
|
|
||
| // Create templates map | ||
| templates := map[string]*template.Template{ | ||
| "RESTART_BM": tmpl, | ||
| "COMPONENT_RESET": tmpl, // Use same template for testing | ||
| } | ||
|
|
||
| client := &FaultRemediationClient{ | ||
| clientset: testDynamic, | ||
| kubeClient: testClient, | ||
| restMapper: mapper, | ||
| remediationConfig: remediationConfig, | ||
| templates: templates, | ||
| templateMountPath: "/tmp", | ||
| annotationManager: NewNodeAnnotationManager(testClient), | ||
| statusChecker: crstatus.NewCRStatusChecker(testDynamic, mapper, remediationConfig.RemediationActions, dryRun), | ||
| } | ||
|
|
||
| if dryRun { | ||
| client.dryRunMode = []string{metav1.DryRunAll} | ||
| } else { | ||
| client.dryRunMode = []string{} | ||
| } | ||
|
|
||
| return client, nil | ||
| return remediation2.NewRemediationClient(ctrlRuntimeClient, dryRun, remediationConfig) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make template paths robust: "./templates" is sensitive to go test working directory.
This is a common CI-only failure mode; prefer resolving relative to the test file location.
Proposed fix (resolve mount path via runtime.Caller)
+import (
+ ...
+ "runtime"
+ ...
+)
func createTestRemediationClient(dryRun bool) (remediation2.FaultRemediationClientInterface, error) {
+ _, thisFile, _, _ := runtime.Caller(0)
+ templatesDir := filepath.Join(filepath.Dir(thisFile), "templates")
+
remediationConfig := config.TomlConfig{
Template: config.Template{
- MountPath: "./templates",
+ MountPath: templatesDir,
FileName: "rebootnode-template.yaml",
},
RemediationActions: map[string]config.MaintenanceResource{
...
},
}
return remediation2.NewRemediationClient(ctrlRuntimeClient, dryRun, remediationConfig)
}Committable suggestion skipped: line range outside the PR's diff.
π€ Prompt for AI Agents
In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go around lines 299 -
327, The test uses a relative "./templates" path which breaks when go test runs
from a different working directory; in createTestRemediationClient resolve the
Template.MountPath to an absolute path anchored to the test file by using
runtime.Caller to get the test file directory and filepath.Join to build the
templates path, then set remediationConfig.Template.MountPath to that resolved
path before calling remediation2.NewRemediationClient; update any imports
accordingly (runtime, path/filepath).
| @@ -0,0 +1,582 @@ | |||
| package remediation | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
Add package-level godoc comment.
As per coding guidelines, package-level godoc is required for all Go packages.
π Suggested package documentation
+// Package remediation provides functionality for managing fault remediation workflows,
+// including maintenance resource creation, log collection, and remediation state tracking.
package remediationAs per coding guidelines, package-level godoc required for all Go packages.
π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| package remediation | |
| // Package remediation provides functionality for managing fault remediation workflows, | |
| // including maintenance resource creation, log collection, and remediation state tracking. | |
| package remediation |
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 1, The package
lacks a package-level godoc comment: add a concise package comment immediately
above the "package remediation" declaration that describes the purpose and
responsibilities of the remediation package (what it does, key abstractions or
intended use), so tools like godoc and linters will pick it up; ensure the
comment starts with "Package remediation" to follow Go documentation
conventions.
| return c.statusChecker | ||
| } | ||
|
|
||
| // nolint: cyclop // todo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
TODO comment should reference an issue.
As per coding guidelines, TODO comments should reference issues in Go code.
As per coding guidelines, TODO comments should reference issues.
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 148, Replace the
bare "// nolint: cyclop // todo" comment with a TODO that references a concrete
issue/ID and a short rationale so it complies with coding guidelines; for
example change the "nolint: cyclop" comment to include "TODO: <issue-number> -
<brief reason>" (keeping the nolint directive), ensuring the comment associated
with the nolint: cyclop directive documents the issue tracker ID and why the
cyclomatic complexity exemption is needed.
| conditions []metav1.Condition, | ||
| ) (bool, error) { | ||
| completeCondition := meta.FindStatusCondition(conditions, string(batchv1.JobComplete)) | ||
| //nolint:nestif // todo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
TODO comment should reference an issue.
As per coding guidelines, TODO comments should reference issues in Go code.
As per coding guidelines, TODO comments should reference issues.
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 462, Replace the
bare "// todo" comment next to the nolint directive with a TODO that references
an existing issue or ticket (e.g., change "//nolint:nestif // todo" to something
like "//nolint:nestif // TODO: track ref #<issue-number> or
https://.../issues/<id>"), so update the comment in remediation.go where the
"//nolint:nestif // todo" appears to include the issue identifier or URL.
| ) (bool, error) { | ||
| // check if failed | ||
| failedCondition := meta.FindStatusCondition(conditions, string(batchv1.JobFailed)) | ||
| //nolint:nestif // todo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
TODO comment should reference an issue.
As per coding guidelines, TODO comments should reference issues in Go code.
As per coding guidelines, TODO comments should reference issues.
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 500, Replace the
bare "//nolint:nestif // todo" comment with a TODO that references the tracking
issue or ticket (e.g., "TODO: link to ISSUE-NNNN" or include the issue URL) so
it complies with the Go coding guidelines; update the nolint line (the
"//nolint:nestif" comment adjacent to the remediation function/block) to include
a clear TODO with the issue identifier and optionally a short rationale so
future maintainers can find the related discussion.
| } | ||
|
|
||
| // check timeout | ||
| //nolint:nestif // todo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
TODO comment should reference an issue.
As per coding guidelines, TODO comments should reference issues in Go code.
As per coding guidelines, TODO comments should reference issues.
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 553, Replace the
bare "//nolint:nestif // todo" comment with a TODO that references a real issue
ID or URL (for example "//nolint:nestif // TODO: issue #12345" or
"//nolint:nestif // TODO: https://tracker/PROJECT/issues/12345") so the intent
and follow-up are traceable; if no issue exists, create one describing why the
nested-if needs refactoring and reference that new issue in the TODO comment
where the current "//nolint:nestif // todo" appears in remediation.go.
59aab7a to
5947ed5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 13
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (7)
fault-remediation/pkg/initializer/init.go (1)
40-45: Unexportedconfigfield in exported struct prevents external initialization.The
configfield is unexported (lowercasec), butInitializationParamsis an exported struct. External callers cannot set this field when constructing the params. Either export the field or provide a constructor function.π§ Proposed fix - export the field
type InitializationParams struct { - config *rest.Config + Config *rest.Config TomlConfigPath string DryRun bool EnableLogCollector bool }Then update usages within this file to use
params.Configinstead ofparams.config.fault-remediation/main.go (1)
54-54:kubeconfigPathflag is parsed but never used.The
--kubeconfigflag is parsed intokubeconfigPathbut the code usesctrl.GetConfigOrDie()instead, which ignores this flag. Either usekubeconfigPathviaclientcmd.BuildConfigFromFlagsor remove the unused flag.π§ Proposed fix - use the flag
func setupCtrlRuntimeManagement(ctx context.Context, params initializer.InitializationParams) error { slog.Info("Running in controller runtime managed mode") - cfg := ctrl.GetConfigOrDie() + cfg, err := clientcmd.BuildConfigFromFlags("", kubeconfigPath) + if err != nil { + return fmt.Errorf("failed to build config: %w", err) + } cfg.Wrap(func(rt http.RoundTripper) http.RoundTripper {This requires importing
"k8s.io/client-go/tools/clientcmd".Also applies to: 183-183
fault-remediation/pkg/crstatus/checker.go (1)
28-76: Nil-client safety:ShouldSkipCRCreationcan panic in tests or miswired setups.
Line 70: ifc.clientis nil,c.client.Get(...)will panic. Consider a defensive guard returning βdonβt skipβ (i.e., allow create) with a warn log.Proposed fix
func (c *CRStatusChecker) ShouldSkipCRCreation(ctx context.Context, actionName string, crName string) bool { + if c.client == nil { + slog.Warn("CRStatusChecker client is nil, allowing create", "crName", crName, "action", actionName) + return false + }fault-remediation/pkg/reconciler/reconciler.go (3)
395-459: Backward-compat:groupState.ActionNamemay be empty for older annotations.
Line 443-445: if the stored annotation predatesActionName,storedActionNamecan be empty, making status checks fail open/oddly. Consider falling back to currentactionName.Proposed fix
- storedActionName := groupState.ActionName + storedActionName := groupState.ActionName + if storedActionName == "" { + storedActionName = actionName + } shouldSkip := statusChecker.ShouldSkipCRCreation(ctx, storedActionName, groupState.MaintenanceCR)
463-513: Incorrect error message on parse failure (and TODO/nolint issue refs).
Line 492: returningfmt.Errorf("error updating resume token: %w", err)is misleading (the failure is parsing, not marking processed). Also, if any nolint/TODO tags exist in this file, they should reference an issue per guidelines.Proposed fix
- return result, fmt.Errorf("error updating resume token: %w", err) + return result, fmt.Errorf("error parsing health event: %w", err)
524-538: Nil safety inCloseAll:r.ds/r.Watchercan be nil.
Given tests construct the reconciler withds=nil,CloseAllshould guard against nil before callingClose.Proposed fix
func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error { var errs []error - if err := r.ds.Close(ctx); err != nil { + if r.ds != nil { + if err := r.ds.Close(ctx); err != nil { slog.Error("failed to close datastore", "error", err) errs = append(errs, err) - } + } + } - if err := r.Watcher.Close(ctx); err != nil { + if r.Watcher != nil { + if err := r.Watcher.Close(ctx); err != nil { slog.Error("failed to close Watcher", "error", err) errs = append(errs, err) - } + } + } return errors.Join(errs...) }fault-remediation/pkg/reconciler/reconciler_test.go (1)
572-610: Test does not reflect actual reconciler behavior for NONE actions with log collector enabled.The test manually calls
RunLogCollectorJoband expects it to be invoked forRecommendedAction_NONEevents whenEnableLogCollectoris true (line 604-606). However, the productionrunLogCollectormethod returns early for NONE actions and never reaches the call toRunLogCollectorJob. Update the test to either: (1) remove the expectation thatRunLogCollectorJobis called for NONE actions, or (2) adjust production logic if NONE should actually trigger log collection.
π€ Fix all issues with AI agents
In @fault-remediation/main.go:
- Around line 89-93: The InitializationParams constructed in main builds
TomlConfigPath/DryRun/EnableLogCollector but omits the Config field, so
InitializeAll (which calls kubernetes.NewForConfig(params.Config)) receives nil;
fix by setting the Config field to the controller config (cfg returned from
ctrl.GetConfigOrDie()) when creating the InitializationParams instance or
otherwise pass cfg into setupCtrlRuntimeManagement/InitializeAll; update the
struct literal that creates params to include Config: cfg (or the correct
exported field name) so kubernetes.NewForConfig(params.Config) no longer
dereferences nil.
In @fault-remediation/pkg/annotation/annotation_test.go:
- Around line 200-202: After calling
annotationManager.GetRemediationState(context.TODO(), nodeName) you must assert
the call returned no error before inspecting state; add an assertion like
assert.NoError(t, err) (or require.NoError if you prefer test termination)
immediately after the GetRemediationState call so subsequent checks against
state, removedGroup and notRemovedGroup are valid.
- Around line 140-143: After calling annotationManager.GetRemediationState,
assert the returned error is nil to avoid masking failures β e.g., right after
state, _, err := annotationManager.GetRemediationState(context.TODO(), nodeName)
add assert.NoError(t, err) (or require.NoError(t, err)) before the subsequent
assertions on state. Ensure you use the same test assertion package already in
the file.
In @fault-remediation/pkg/annotation/annotation.go:
- Around line 62-102: UpdateUpdateRemediationState: wrap the entire
get-modify-patch read-modify-write cycle in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) inside
UpdateRemediationState; inside the retry closure call m.GetRemediationState(ctx,
nodeName) on each attempt, modify state.EquivalenceGroups[group] as before,
marshal to JSON, set node.Annotations[AnnotationKey] on a DeepCopy of the
freshly fetched node, then call m.client.Patch(ctx, node,
client.MergeFrom(patchBase)) and return that error from the closure; after the
retry call return any error from RetryOnConflict. Also add the import for
k8s.io/apimachinery/pkg/util/retry and ensure existing log statements remain
outside/after the retry as appropriate.
In @fault-remediation/pkg/crstatus/crstatus_interface.go:
- Around line 9-11: The exported CRStatusCheckerInterface and its method
ShouldSkipCRCreation lack documentation and unnamed parameters; add a
package-style godoc comment for CRStatusCheckerInterface and a descriptive
comment for ShouldSkipCRCreation, and update the method signature to use named
parameters (e.g., ShouldSkipCRCreation(ctx context.Context, namespace string,
resourceName string) bool) so callers see parameter intent; ensure comments
describe purpose, parameters, and return value and follow Go doc conventions.
In @fault-remediation/pkg/initializer/init.go:
- Around line 90-92: The error message is misleading: update the fmt.Errorf call
that wraps err (the return nil, fmt.Errorf("error while initializing ctrl
runtime client: %w", err)) to reference the remediation client instead; change
the message to something like "error while initializing remediation client: %w"
so the wrapped error (err) correctly reflects the remediation client
initialization failure.
In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: Update the TODO comment on the redundant metrics line to include the
issue tracker reference: change the bare "//TODO: evaluate and remove redundant
metrics with ctrl-runtime defaults" to include the issue ID (e.g. "// TODO:
evaluate and remove redundant metrics with ctrl-runtime defaults (ISSUE-1234)")
so the comment references the corresponding issue; ensure you replace ISSUE-1234
with the actual issue number and keep the comment format consistent with other
TODOs in the repo.
In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go:
- Around line 368-373: Replace the bare "TODO:" comments in
reconciler_e2e_test.go with tracked-issue references per repo guidelines: find
the calls near UpdateNVSentinelStateNodeLabel(...) and r.performRemediation(...)
(and the other TODO locations around lines shown) and update each TODO to
include an issue ID or URL (e.g., "TODO: ISSUE-1234 - reason") or create the
corresponding issue and reference it; ensure every TODO now includes the issue
identifier and a brief one-line context so reviewers can find the associated
ticket.
In @fault-remediation/pkg/reconciler/reconciler.go:
- Around line 170-193: The runLogCollector function incorrectly wraps a nil
error when RunLogCollectorJob returns a non-empty ctrl.Result but nil err,
causing normal requeue to be treated as an error; change the logic around
r.Config.RemediationClient.RunLogCollectorJob so you only call fmt.Errorf and
return an error when err != nil (log and wrap err via fmt.Errorf("%w", err)),
and otherwise return the returned result with a nil error (i.e., return result,
nil) from runLogCollector.
In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 377-381: The "Skip creation with dry run" test case fails because
it omits templateDir so RunLogCollectorJob errors while loading templates
instead of exercising dry-run logic; update that test case to include a valid
templateDir (e.g., the same template path used by other cases or a
testdata/templates fixture) so the template loads successfully and the test
reaches the dry-run branch, leaving dryRun: true and expectedError: false
unchanged.
In @fault-remediation/pkg/remediation/remediation.go:
- Around line 149-245: The file has an unreferenced nolint TODO and a
non-deterministic CR name: update the `// nolint: cyclop // todo` comment (and
any other `//nolint:* // todo` occurrences) to include a tracking issue/ID
reference, and make CR naming deterministic in CreateMaintenanceResource by
ensuring the rendered object has a name β after calling
renderMaintenanceFromTemplate(β¦) and before setNodeOwnerRef(β¦), check if
maintenance.GetName() is empty and if so call maintenance.SetName(crName) so
that Create(ctx, maintenance) will not fail if the template omitted
metadata.name.
- Around line 423-454: The checkLogCollectorStatus currently treats found==true
as an error and wraps nil into an error; update checkLogCollectorStatus so each
call to checkLogCollectorComplete, checkLogCollectorFailed, and
checkLogCollectorTimedOut only returns a wrapped error when err != nil, and if
found==true return success (e.g., return ctrl.Result{}, nil) rather than an
error; specifically change the three blocks that now read "if found || err !=
nil { return ..., fmt.Errorf(...%w, err) }" to first check "if err != nil {
return ctrl.Result{}, fmt.Errorf(...%w, err) }" and then "if found { return
ctrl.Result{}, nil }" so that completed/failed/timed-out jobs are treated as
terminal success by checkLogCollectorStatus.
- Around line 456-579: The annotation check and timestamp usage are unsafe: in
checkLogCollectorComplete, checkLogCollectorFailed and checkLogCollectorTimedOut
you currently gate metrics with job.Annotations != nil &&
job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal which skips
recording forever when Annotations is nil, and you dereference
job.Status.StartTime / CompletionTime without nil checks risking panics. Fix by
reading the annotation value safely (val, ok :=
job.Annotations[jobMetricsAlreadyCountedAnnotation]; if !ok || val !=
trueStringVal { ... }) and still ensure updateJob.Annotations is initialized
before writing; and compute durations only after guarding StartTime (if
job.Status.StartTime != nil) and CompletionTime (if non-nil use
CompletionTime.Sub(start) else use time.Since(start)), falling back to a safe
zero or timeout-based duration if StartTime is nil; apply these changes in
functions checkLogCollectorComplete, checkLogCollectorFailed, and
checkLogCollectorTimedOut referencing jobMetricsAlreadyCountedAnnotation and
trueStringVal.
π§Ή Nitpick comments (12)
fault-remediation/pkg/events/health_event.go (1)
11-15: Clarify the purpose of HealthEventData in the comment.The comment "for compatibility" is vague. Consider specifying what it's compatible with (e.g., BSON serialization for MongoDB storage).
π Proposed improvement
-// HealthEventData represents health event data with string ID for compatibility +// HealthEventData represents health event data with BSON tags for MongoDB storage compatibility. type HealthEventData struct { ID string `bson:"_id,omitempty"` model.HealthEventWithStatus `bson:",inline"` }fault-remediation/pkg/annotation/annotation_test.go (1)
104-117: Simplify control flow by removingelseafterreturn.The
elseblock is unnecessary afterreturnin the error case.β»οΈ Suggested refactor
if tt.expectError { assert.Error(t, err) return - } else { - assert.NoError(t, err) - assert.Equal(t, tt.node.Name, node.Name) - assert.Equal(t, len(tt.expectedState.EquivalenceGroups), len(resultState.EquivalenceGroups)) - for expectedKey, expectedValue := range tt.expectedState.EquivalenceGroups { - assert.Equal(t, expectedValue.MaintenanceCR, resultState.EquivalenceGroups[expectedKey].MaintenanceCR) - assert.Equal(t, expectedValue.ActionName, resultState.EquivalenceGroups[expectedKey].ActionName) - assert.Equal(t, expectedValue.CreatedAt.Unix(), resultState.EquivalenceGroups[expectedKey].CreatedAt.Unix()) - } } + assert.NoError(t, err) + assert.Equal(t, tt.node.Name, node.Name) + assert.Equal(t, len(tt.expectedState.EquivalenceGroups), len(resultState.EquivalenceGroups)) + for expectedKey, expectedValue := range tt.expectedState.EquivalenceGroups { + assert.Equal(t, expectedValue.MaintenanceCR, resultState.EquivalenceGroups[expectedKey].MaintenanceCR) + assert.Equal(t, expectedValue.ActionName, resultState.EquivalenceGroups[expectedKey].ActionName) + assert.Equal(t, expectedValue.CreatedAt.Unix(), resultState.EquivalenceGroups[expectedKey].CreatedAt.Unix()) + } })fault-remediation/pkg/remediation/remediation_test.go (1)
521-526: AssertingRequeueAfteron error cases may be unreliable.When
tt.expectedErroris true, theresultstruct may not be properly initialized. Consider only assertingRequeueAfterfor non-error cases.β»οΈ Suggested refactor
if tt.expectedError { assert.Error(t, err) - } else { + return + } assert.NoError(t, err) - } assert.Equal(t, result.RequeueAfter, tt.requeueTime)fault-remediation/main.go (1)
111-111: TODO comment should reference an issue.As per coding guidelines, TODO comments should reference issues in Go code.
fault-remediation/pkg/annotation/annotation.go (2)
1-1: Missing package-level godoc.As per coding guidelines, package-level godoc is required for all Go packages.
β»οΈ Proposed fix
+// Package annotation provides utilities for managing remediation state +// annotations on Kubernetes nodes. package annotation
40-40: TODO comment should reference an issue.As per coding guidelines, TODO comments should reference issues in Go code.
fault-remediation/pkg/annotation/annotation_interface.go (1)
1-1: Missing package-level godoc.As per coding guidelines, package-level godoc is required for all Go packages. Since this file defines the interface, it would be appropriate to add the package doc here.
β»οΈ Proposed fix
+// Package annotation provides interfaces and types for managing remediation +// state annotations on Kubernetes nodes. package annotationfault-remediation/pkg/remediation/remediation.go (3)
1-115: Add package-level godoc + exported API comments (repo guideline).
Line 1:package remediationshould have a// Package remediation ...comment, and exported funcs/methods (Line 56, 141, 145, 150, 340, 581) should have doc comments. As per coding guidelines.
49-54:templatesmap comment mismatches actual key;dryRunMode []stringlooks like accidental complexity.
Line 49: comment says βmap from template file nameβ, but the key used isactionName(Line 71-82). Consider fixing comment and switchingdryRunModetodryRun bool(itβs only used as a boolean flag in this file).Also applies to: 56-114
358-421: Donβt overwrite Job labels from manifest; merge instead (may break selectors/ownership).
Line 384-390:job.Labels = labelsdiscards labels from the YAML manifest (which might be required by selectors/GC/ops tooling). Prefer merging.fault-remediation/pkg/reconciler/reconciler_test.go (2)
41-63: Mock signature mismatch:runLogCollectorJobFnignoreseventId(can hide bugs).
Line 44-55:RunLogCollectorJob(ctx, nodeName, eventId)dropseventIdwhen calling the mock function. Recommend includingeventIdinrunLogCollectorJobFnso tests can assert itβs passed through correctly.
818-954: Deduplication tests donβt actually exercise status-check behavior.
TestCRBasedDeduplication/TestCrossActionRemediationWithEquivalenceGroupssetcrSucceededetc., but the mock provides noCRStatusCheckerInterface, so assertions reduce to βallow create when status checker is nilβ. Consider adding a simple mock implementingShouldSkipCRCreationto cover both βskipβ and βallow retryβ paths.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
commons/go.sumis excluded by!**/*.sum
π Files selected for processing (29)
.gitignorecommons/go.moddistros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yamldistros/kubernetes/nvsentinel/values-full.yamldistros/kubernetes/nvsentinel/values-tilt-mongodb.yamldistros/kubernetes/nvsentinel/values-tilt.yamldistros/kubernetes/nvsentinel/values.yamlfault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/annotation.gofault-remediation/pkg/reconciler/annotation_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/remediation.gofault-remediation/pkg/reconciler/remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yamlfault-remediation/pkg/remediation/templates/rebootnode-template.yaml
π€ Files with no reviewable changes (8)
- distros/kubernetes/nvsentinel/values-full.yaml
- fault-remediation/pkg/reconciler/remediation_test.go
- distros/kubernetes/nvsentinel/values-tilt.yaml
- distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
- distros/kubernetes/nvsentinel/values.yaml
- fault-remediation/pkg/reconciler/annotation_test.go
- fault-remediation/pkg/reconciler/remediation.go
- fault-remediation/pkg/reconciler/annotation.go
β Files skipped from review due to trivial changes (1)
- commons/go.mod
π§ Files skipped from review as they are similar to previous changes (3)
- .gitignore
- fault-remediation/pkg/crstatus/crstatus_test.go
- fault-remediation/pkg/remediation/templates/log-collector-job.yaml
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/initializer/init.gofault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π§ Learnings (21)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
fault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
fault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/crstatus/crstatus_interface.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
fault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/crstatus/crstatus_interface.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/crstatus/crstatus_interface.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code
Applied to files:
fault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/annotation/annotation_interface.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
fault-remediation/pkg/events/health_event.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Health monitors should include Helm chart in `distros/kubernetes/nvsentinel/charts/`
Applied to files:
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
𧬠Code graph analysis (9)
fault-remediation/pkg/initializer/init.go (2)
fault-remediation/pkg/remediation/remediation.go (1)
NewRemediationClient(56-114)commons/pkg/statemanager/statemanager.go (2)
NewStateManager(206-210)StateManager(197-200)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-35)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-35)fault-remediation/pkg/annotation/annotation.go (1)
NodeAnnotationManager(16-18)
fault-remediation/pkg/remediation/remediation_test.go (4)
fault-remediation/pkg/config/config.go (2)
Template(47-50)MaintenanceResource(27-44)data-models/pkg/protos/health_event.pb.go (2)
RecommendedAction_RESTART_BM(48-48)RecommendedAction_COMPONENT_RESET(44-44)fault-remediation/pkg/remediation/remediation.go (2)
NewRemediationClient(56-114)FaultRemediationClient(43-54)fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(27-44)
fault-remediation/pkg/reconciler/reconciler.go (6)
fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/metrics/metrics.go (7)
EventHandlingDuration(62-68)TotalEventsReceived(33-38)TotalUnsupportedRemediationActions(53-59)ProcessingErrors(46-52)EventsProcessed(39-45)CRStatusSkipped(26-26)CRStatusCreated(25-25)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)fault-remediation/pkg/events/health_event.go (2)
HealthEventDoc(6-9)HealthEventData(12-15)data-models/pkg/model/health_event_extentions.go (1)
HealthEventWithStatus(53-57)store-client/pkg/utils/document_utils.go (1)
ExtractDocumentID(59-84)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
fault-remediation/pkg/remediation/remediation.go (1)
NewRemediationClient(56-114)commons/pkg/statemanager/statemanager.go (2)
NewStateManager(206-210)StateManager(197-200)fault-remediation/pkg/annotation/annotation_interface.go (1)
AnnotationKey(12-12)fault-remediation/pkg/metrics/metrics.go (6)
TotalEventsReceived(33-38)EventHandlingDuration(62-68)EventsProcessed(39-45)CRStatusCreated(25-25)CRStatusSkipped(26-26)TotalUnsupportedRemediationActions(53-59)
fault-remediation/pkg/reconciler/reconciler_test.go (3)
fault-remediation/pkg/events/health_event.go (2)
HealthEventData(12-15)HealthEventDoc(6-9)fault-remediation/pkg/annotation/annotation_interface.go (3)
NodeAnnotationManagerInterface(16-21)RemediationStateAnnotation(24-26)EquivalenceGroupState(29-35)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)
πͺ YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
[error] 15-15: syntax error: expected , but found ''
(syntax)
π Additional comments (9)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)
99-99: LGTM! Health and readiness probe configuration unified.The removal of conditional logic around the
ctrlRuntimeEnabledflag simplifies the deployment template and aligns with the controller-runtime consolidation.Also applies to: 106-107
fault-remediation/pkg/metrics/metrics.go (1)
33-92: LGTM! Metrics properly exported for package-level access.The exported metric variables follow Go naming conventions and enable proper access from consuming packages. The migration to a dedicated
metricspackage with exported identifiers is a sound refactoring.fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)
15-20: LGTM! Go template syntax is correct.The template uses valid Go template placeholders that will be rendered into proper YAML. The static analysis syntax error is a false positiveβYAMLlint cannot parse Go template directives.
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
29-35: Interface design looks good.The interface methods are well-defined with appropriate return types. The shift to returning
(string, error)fromCreateMaintenanceResourceand(ctrl.Result, error)fromRunLogCollectorJobprovides better control flow and error handling.
37-50: TemplateData struct is well-documented.The struct provides clear field documentation and appropriate separation between node/event data and CRD routing metadata.
fault-remediation/pkg/annotation/annotation.go (1)
50-57: Silent failure on unmarshal error may mask data corruption.When
json.Unmarshalfails, the code logs an error but returns an empty state, which could lead to overwriting valid data. Consider returning an error or using a different annotation key to avoid data loss.fault-remediation/pkg/annotation/annotation_interface.go (1)
10-35: Interface and type definitions are well-structured.The interface design is clean with appropriate method signatures. The struct types have proper JSON tags for serialization. The
AnnotationKeyconstant is appropriately documented.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
299-327: Verify template mount path works undergo test(likely wrong relative path).
Line 303:MountPath: "./templates"is relative to the package working directory at test runtime; if templates actually live underfault-remediation/pkg/remediation/templates, this will fail in CI. Please verify and, if needed, switch to a path that exists from the package dir (e.g.,filepath.Join("..", "remediation", "templates")).fault-remediation/pkg/reconciler/reconciler_test.go (1)
343-344: No action neededβt.Context()is fully supported.The repository requires Go 1.25 (with toolchain go1.25.3) across all modules.
testing.T.Context()was added in Go 1.24, so it is available and compatible with the project's Go version.Likely an incorrect or invalid review comment.
e9af50e to
81189df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (2)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)
104-111: Fix extra space in readiness probe path.Line 106 has an extra space before the path value (
path: "/readyz"). This should use a single space after the colon for standard YAML formatting.π§Ή Proposed fix
readinessProbe: httpGet: - path: "/readyz" + path: "/readyz" port: "health" initialDelaySeconds: 5fault-remediation/pkg/crstatus/checker.go (1)
47-76: Add validation to ensure cluster-scoped resources have empty namespace.The
validateScopefunction infault-remediation/pkg/config/config.go(lines 133-148) only validates that namespaced resources have a non-empty namespace, but does not validate the inverse: that cluster-scoped resources must have an empty namespace. This allows cluster-scoped remediation actions to be configured with a non-empty namespace value, which would cause theclient.Getcall inShouldSkipCRCreation(line 68) to fail or behave unexpectedly.Enhance
validateScopeto also validate that cluster-scoped resources (wherescope == "Cluster") have an empty namespace, or explicitly clear the namespace for cluster-scoped ObjectKey construction.
π€ Fix all issues with AI agents
In @fault-remediation/main.go:
- Line 184: The kubeconfigPath flag is defined but never used; either remove the
flag or wire it into config loading. To wire it, replace the direct call to
ctrl.GetConfigOrDie() with logic that uses clientcmd.BuildConfigFromFlags("",
kubeconfigPath) when kubeconfigPath is non-empty and falls back to
ctrl.GetConfigOrDie() otherwise (ensuring you handle and log
BuildConfigFromFlags errors); alternatively, delete the flag declaration
(kubeconfigPath and flag.StringVar) so no unused flag remains.
In @fault-remediation/pkg/annotation/annotation_interface.go:
- Line 1: Add a package-level godoc comment for the annotation package: place a
clear, one-paragraph comment above the "package annotation" declaration (in
either annotation_interface.go or annotation.go) describing the package purpose
and high-level behavior; ensure it follows Go doc conventions (complete
sentence, starts with "Package annotation ...") so godoc tools and linters
recognize it.
In @fault-remediation/pkg/annotation/annotation.go:
- Line 1: Add a package-level godoc comment above the "package annotation"
declaration that briefly describes the purpose and responsibilities of the
annotation package (what it provides, key concepts, and any important usage
notes); ensure the comment is a complete sentence or two starting with "Package
annotation ..." so it follows Go doc conventions and tools like godoc and go vet
will recognize it.
In @fault-remediation/pkg/crstatus/checker.go:
- Around line 34-44: Add a GoDoc comment above the exported constructor
NewCRStatusChecker that briefly describes what CRStatusChecker does and the
purpose of the constructor, and document the parameters (client client.Client,
remediationActions map[string]config.MaintenanceResource, dryRun bool) and the
returned *CRStatusChecker; place the comment in standard godoc form starting
with "NewCRStatusChecker ..." so tools like godoc and linters recognize it.
In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: Replace the standalone TODO comment in metrics.go (package metrics)
with a TODO that references a tracked issue ID or URL (e.g.,
"TODO(#<issue-number>): evaluate and remove redundant metrics with ctrl-runtime
defaults") β if the issue doesn't exist, create one first and put that issue
number or link in the comment so the TODO follows the repository guideline for
Go code.
In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go:
- Around line 368-369: Several TODO comments in reconciler_e2e_test.go lack
issue references; create separate tracker issues for (a) state transition
handling, (b) StateManager error behavior, and (c) potential state machine bug,
then update each TODO to reference the new issue ID. Specifically, for
occurrences around the UpdateNVSentinelStateNodeLabel usage (symbol:
stateManager.UpdateNVSentinelStateNodeLabel) and the labels like
statemanager.DrainSucceededLabelValue where errors are currently ignored, either
handle the returned error properly in the test or replace the ignore with a TODO
that references the created issue (e.g., TODO(issue-123): reason), and do the
same for other TODOs tied to state transitions and state manager behavior in the
test file.
In @fault-remediation/pkg/reconciler/reconciler.go:
- Line 66: The Reconciler's exported field Config is unused outside
initialization and should be made private to reduce API surface: rename the
struct field from Config to config (type ReconcilerConfig) and update all
internal references (e.g., in constructors like NewReconciler/createReconciler
and methods that currently use r.Config) to r.config; ensure any tests or other
package code that relied on the exported field are updated or, if external
access is required, expose only the needed pieces (e.g., keep annotationManager
private and provide specific accessor methods instead of exposing the whole
config).
In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go:
- Line 15: The package declaration "package remediation" is missing a
package-level godoc comment; add a concise package comment immediately above the
package line describing the purpose and responsibility of the remediation
package (e.g., what it provides, key abstractions or behavior) so the package
"remediation" has proper package-level documentation per Go guidelines.
In @fault-remediation/pkg/remediation/templates/log-collector-job.yaml:
- Around line 30-32: The container spec for the "log-collector" container
currently hardcodes image: test:test; change it to accept a template parameter
(e.g., replace the literal with a template/variable like {{
.Values.logCollector.image }} or {{ .Image }} depending on your templating
system) and add an explicit imagePullPolicy (e.g., imagePullPolicy: IfNotPresent
or Always) so image source and pull behavior are configurable at runtime; update
any README/values usage accordingly to ensure the new parameter is documented
and supplied.
- Around line 18-20: The manifest hardcodes namespace: test in the log-collector
job template; replace that literal with a parameterized template variable (e.g.,
use a template placeholder for the namespace in the same file) and ensure
callers that render this template supply the namespace (or provide a sensible
default), or if this YAML is only a test fixture add a clarifying comment above
the namespace field indicating it is intentionally fixed for tests; update any
code that renders or references this template to pass the new variable (the
field to change is namespace in the same resource that has generateName:
log-collector-job-).
π§Ή Nitpick comments (8)
.gitignore (1)
143-143: Clarify intent: blanket rule makes specific patterns redundant.The AI summary indicates that multiple specific JetBrains patterns should be replaced with the blanket
.idea/rule (line 143). However, the code still contains more specific patterns (e.g.,.idea/**/mongoSettings.xml,.idea/replstate.xml,.idea/sonarlint/, etc.) on lines 149, 164, 167β168, 171, 186β188, 191, and 197.Git's hierarchical ignore semantics mean that
.idea/already ignores all contents under.idea/, rendering the specific patterns redundant. Either the summary is incomplete, or these patterns should be removed for cleaner maintenance.If the intent is to use only the blanket rule, consider removing the now-redundant specific patterns:
β»οΈ Optional cleanup to remove redundant patterns
# Mongo Explorer plugin -.idea/**/mongoSettings.xml - # File-based project format *.iws ... # Cursive Clojure plugin -.idea/replstate.xml - # SonarLint plugin -.idea/sonarlint/ -.idea/**/sonarlint/ - # SonarQube Plugin -.idea/**/sonarIssues.xml - ... # Markdown Navigator plugin -.idea/**/markdown-navigator.xml -.idea/**/markdown-navigator-enh.xml -.idea/**/markdown-navigator/ - # Cache file creation bug -.idea/$CACHE_FILE$ - ... # Azure Toolkit for IntelliJ plugin -.idea/**/azureSettings.xml -Keep only
.idea/(line 143) and the plugin-specific comments for documentation.Also applies to: 149-149, 164-164, 167-168, 171-171, 186-188, 191-191, 197-197
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)
96-103: Consider quoting the port value for consistency.Line 99 uses an unquoted port name
health, while line 107 quotes it as"health". For consistency and clarity, consider quoting the port name in the liveness probe as well.β»οΈ Proposed fix for consistency
livenessProbe: httpGet: path: /healthz - port: health + port: "health" initialDelaySeconds: 15 periodSeconds: 20fault-remediation/pkg/annotation/annotation_test.go (1)
6-6: Consider standard testing package for assertions.Per repository learnings, testify should be reserved for complex scenarios requiring richer diagnostics. While this test does verify multiple fields, consider using standard testing package assertions (t.Error, t.Errorf) for the equality checks.
Based on learnings, avoid introducing testify dependency for simple equality checks across the repository.
Also applies to: 106-106, 109-116
fault-remediation/pkg/initializer/init.go (1)
51-51: TODO comment should reference an issue.Per coding guidelines, TODO comments should reference specific issues for tracking purposes.
As per coding guidelines, TODO comments should reference issues in Go code.
fault-remediation/main.go (1)
105-105: TODO comment should reference an issue.Per coding guidelines, TODO comments should reference specific issues for tracking purposes. Also, add a space after
//TODOfor consistency with Go conventions.As per coding guidelines, TODO comments should reference issues in Go code.
fault-remediation/pkg/annotation/annotation.go (1)
40-40: TODO comment should reference an issue.Per coding guidelines, TODO comments should reference specific issues for tracking purposes.
As per coding guidelines, TODO comments should reference issues in Go code.
fault-remediation/pkg/reconciler/reconciler_test.go (1)
53-55: Consider using the eventId parameter in mock.The
RunLogCollectorJobmock ignores theeventIdparameter when calling the mock function. This could mask issues in tests where the event ID is expected to be propagated correctly.β»οΈ Suggested improvement
-func (m *MockK8sClient) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) { - return m.runLogCollectorJobFn(ctx, nodeName) +func (m *MockK8sClient) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) { + return m.runLogCollectorJobFn(ctx, nodeName, eventId) }And update the function signature:
- runLogCollectorJobFn func(ctx context.Context, nodeName string) (ctrl.Result, error) + runLogCollectorJobFn func(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)fault-remediation/pkg/reconciler/reconciler.go (1)
474-482: Consider structured error types for metric labeling.The error label determination relies on string matching of error messages, which is fragile if error messages change. Consider using structured error types or error codes for more robust metric labeling.
Example approach
Define error types in the eventutil package:
var ( ErrMarshalFailed = errors.New("marshal failed") ErrUnmarshalFailed = errors.New("unmarshal failed") )Then use
errors.Is()for classification:if errors.Is(err, eventutil.ErrMarshalFailed) { errorLabel = "marshal_error" } else if errors.Is(err, eventutil.ErrUnmarshalFailed) { errorLabel = "unmarshal_doc_error" }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
commons/go.sumis excluded by!**/*.sum
π Files selected for processing (29)
.gitignorecommons/go.moddistros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yamldistros/kubernetes/nvsentinel/values-full.yamldistros/kubernetes/nvsentinel/values-tilt-mongodb.yamldistros/kubernetes/nvsentinel/values-tilt.yamldistros/kubernetes/nvsentinel/values.yamlfault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/annotation.gofault-remediation/pkg/reconciler/annotation_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/remediation.gofault-remediation/pkg/reconciler/remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yamlfault-remediation/pkg/remediation/templates/rebootnode-template.yaml
π€ Files with no reviewable changes (8)
- distros/kubernetes/nvsentinel/values-full.yaml
- distros/kubernetes/nvsentinel/values-tilt.yaml
- fault-remediation/pkg/reconciler/remediation_test.go
- distros/kubernetes/nvsentinel/values.yaml
- fault-remediation/pkg/reconciler/annotation.go
- distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
- fault-remediation/pkg/reconciler/remediation.go
- fault-remediation/pkg/reconciler/annotation_test.go
β Files skipped from review due to trivial changes (1)
- commons/go.mod
π§ Files skipped from review as they are similar to previous changes (4)
- fault-remediation/pkg/events/health_event.go
- fault-remediation/pkg/crstatus/crstatus_interface.go
- fault-remediation/pkg/remediation/remediation_test.go
- fault-remediation/pkg/remediation/remediation.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/main.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/crstatus/checker.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/annotation/annotation_test.go
π§ Learnings (20)
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/annotation/annotation_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
fault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/crstatus/checker.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
fault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/annotation/annotation.gofault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code
Applied to files:
fault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
fault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/crstatus/checker.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/annotation/annotation.gofault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
fault-remediation/main.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
𧬠Code graph analysis (7)
fault-remediation/pkg/crstatus/crstatus_test.go (1)
fault-remediation/pkg/crstatus/checker.go (1)
NewCRStatusChecker(34-44)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-35)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)
fault-remediation/pkg/reconciler/reconciler_test.go (3)
fault-remediation/pkg/events/health_event.go (2)
HealthEventData(12-15)HealthEventDoc(6-9)fault-remediation/pkg/annotation/annotation_interface.go (3)
NodeAnnotationManagerInterface(16-21)RemediationStateAnnotation(24-26)EquivalenceGroupState(29-35)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-35)fault-remediation/pkg/annotation/annotation.go (1)
NodeAnnotationManager(16-18)
fault-remediation/main.go (2)
commons/pkg/auditlogger/roundtripper.go (1)
NewAuditingRoundTripper(42-47)fault-remediation/pkg/initializer/init.go (2)
InitializationParams(40-45)InitializeAll(52-152)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(27-44)
πͺ YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
[error] 15-15: syntax error: expected , but found ''
(syntax)
π Additional comments (28)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (2)
15-20: Static analysis false positive: YAMLlint doesn't understand Go templates.The YAMLlint syntax error is expected and can be safely ignored. This file is a Go text/template that will be rendered before YAML parsing, so the
{{.ApiGroup}},{{.Version}},{{.NodeName}}, and{{.HealthEventID}}placeholders are intentional and correct.
18-18: No validation needed for DNS-1123 compliance.The resource name
maintenance-{{.NodeName}}-{{.HealthEventID}}is already compliant with Kubernetes DNS-1123 naming rules.NodeNameis sourced from Kubernetes Node objects, which are pre-validated by the Kubernetes API server to comply with DNS-1123 rules.HealthEventIDis generated as a UUID (e.g.,68b000bd-9ce2-46f8-94f1-1e51532255f8), which inherently uses only lowercase alphanumeric characters and hyphens, making it DNS-1123 compliant. The full resource name (example:maintenance-test-node-1-68b000bd-9ce2-46f8-94f1-1e51532255f8) is well under the 253-character limit and matches the DNS-1123 subdomain pattern. No sanitization is required.Likely an incorrect or invalid review comment.
fault-remediation/pkg/metrics/metrics.go (2)
15-15: LGTM: Package rename improves organization.Renaming from
reconcilertometricsbetter reflects the package's purpose and aligns with the controller-runtime migration.
33-93: All callers have been properly updated to use exported metric names.The metric refactoring from unexported to exported (e.g.,
totalEventsReceivedβTotalEventsReceived) is complete throughout the fault-remediation module. All references inreconciler.go,remediation.go, and test files correctly use the new exported names with their respective label values.fault-remediation/pkg/crstatus/checker.go (1)
28-32: LGTM: Controller-runtime client migration.The migration from dynamic client to controller-runtime client aligns with modern Kubernetes controller patterns and improves type safety.
fault-remediation/pkg/remediation/templates/log-collector-job.yaml (1)
16-29: Job structure follows best practices.The Job manifest structure is well-formed with appropriate settings:
generateNamefor unique job instancesttlSecondsAfterFinishedfor automatic cleanuprestartPolicy: Neverappropriate for batch jobs- Service account specified for proper RBAC
The main concerns are the hardcoded values flagged in other comments.
fault-remediation/pkg/crstatus/crstatus_test.go (1)
33-34: LGTM: Constructor call updated correctly.The test has been properly updated to match the new
NewCRStatusCheckersignature, passingnilfor the client parameter which is appropriate since this test focuses on the condition checking logic that doesn't require client operations.fault-remediation/pkg/initializer/init.go (2)
85-99: Client initialization looks good.The remediation client and Kubernetes client initialization properly uses error wrapping and follows the expected controller-runtime integration pattern.
40-56: All callers have been properly updated for the breaking API changes.The single caller of
InitializeAllinfault-remediation/main.go:141has been correctly updated to pass all three required parameters:ctx,params, andmgr.GetClient(). The new signature is properly implemented and called.fault-remediation/main.go (1)
97-167: Controller-runtime integration looks solid.The
setupCtrlRuntimeManagementfunction properly implements:
- Auditing round-tripper integration
- Health and readiness checks
- Proper shutdown handling with deferred cleanup
- Context-based lifecycle management
fault-remediation/pkg/annotation/annotation.go (1)
27-166: Implementation looks solid.The annotation manager methods properly implement:
- Error wrapping with context using
fmt.Errorf- Structured logging with appropriate log levels
- Safe node mutation using
DeepCopybefore patching- Proper cleanup (clearing annotation when no groups remain)
fault-remediation/pkg/annotation/annotation_interface.go (1)
10-35: Interface and type definitions look good.The annotation interface and types are well-designed with:
- Clear interface contract for annotation management
- Appropriate JSON tags for serialization
- Good documentation for exported items
- Proper use of
time.Timefor timestamp trackingfault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
29-35: LGTM! Interface updates align with controller-runtime patterns.The interface method signatures have been properly updated to:
- Use the new
events.HealthEventDatatype for health event handling- Return
ctrl.ResultfromRunLogCollectorJobfor controller-runtime integration- Adopt new annotation and status checker interfaces
These changes are consistent with the broader refactoring to controller-runtime-based orchestration.
37-50: LGTM! Well-documented template data structure.The
TemplateDatastruct is properly documented with clear field groupings for node/event data and CRD routing metadata. The structure supports template-driven remediation workflows introduced in this PR.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (6)
167-177: LGTM! Test infrastructure aligned with controller-runtime.The global test variables are properly declared, including the new
ctrlRuntimeClientobtained from the controller-runtime manager. This setup supports the migration to controller-runtime-based testing.
206-215: LGTM! Controller-runtime manager properly configured.The manager setup correctly:
- Disables metrics server (
BindAddress: "0") for test isolation- Obtains the controller-runtime client via
mgr.GetClient()- Uses the test environment configuration
This aligns with the broader migration to controller-runtime-based architecture.
300-327: LGTM! Test client creation aligned with new remediation API.The helper function correctly:
- Returns the public
remediation.FaultRemediationClientInterface- Uses the new template configuration structure with
TemplateFileName- Constructs the client with the controller-runtime client
These changes support template-driven remediation workflows.
752-753: LGTM! Metrics properly migrated to the metrics package.The test correctly validates metric instrumentation using the new
metricspackage. Counter and histogram metrics are properly accessed and verified with appropriate label values.Also applies to: 892-900
329-606: LGTM! Comprehensive test coverage for CR-based deduplication.The test suite properly validates:
- First event creates annotations and CRs
- Subsequent events are deduplicated based on CR status
- Failed CRs allow retry after cleanup
- Cross-action deduplication within equivalence groups
Test structure follows best practices with descriptive names and clear assertions.
Based on coding guidelines.
378-378: Investigate undefinedannotation2reference in test assertions.The code uses
annotation2.AnnotationKeyin 7 assertions (lines 378, 659, 781, 838, 880, 1227, 1335), butannotation2is not imported in the imports section (lines 17-60) and no variable or constant declaration for it was found in the file. Verify thatannotation2is properly defined or imported; otherwise, the code will fail to compile.fault-remediation/pkg/reconciler/reconciler_test.go (3)
116-131: LGTM! Mock annotation manager properly implements the interface.The mock correctly implements the updated
GetRemediationStatesignature with three return values. Returningnilfor the node is acceptable for unit tests that don't require the node object.
393-399: LGTM! Tests properly convert between HealthEventData and HealthEventDoc.The tests correctly create
HealthEventDatafor internal use and wrap it intoHealthEventDocwith an ID when callingperformRemediation. This pattern aligns with the separation of concerns in the events package.Also applies to: 449-456, 494-500
637-673: LGTM! Comprehensive log collector test coverage.The test cases properly validate:
- Success path with no errors
- Failure path with error returns
- Requeue scenarios with
ctrl.Result{RequeueAfter: ...}This ensures the log collector integration works correctly with controller-runtime patterns.
fault-remediation/pkg/reconciler/reconciler.go (5)
92-127: LGTM! Reconcile method properly structured with metrics and error handling.The method correctly:
- Instruments timing with
EventHandlingDuration- Increments event counter
- Handles parsing errors gracefully
- Adds safety checks for nil pointers
- Routes to appropriate handlers (cancellation vs remediation)
171-193: LGTM! Log collector method properly integrated with controller-runtime.The method correctly:
- Returns
ctrl.Resultto support requeue behavior- Checks configuration before executing
- Propagates both result and error from the log collector job
- Uses structured logging with appropriate context
196-245: LGTM! Remediation method properly handles state transitions and errors.The method correctly:
- Uses new
events.HealthEventDoctype- Updates state labels before and after remediation
- Combines multiple errors with
errors.Join- Tracks failures with metrics
- Ensures state labels reflect the outcome even on CR creation failure
The pattern of continuing to update state labels after CR creation errors is appropriate for maintaining consistent node state.
328-330: LGTM! Log collector positioning prevents duplicate jobs.The log collector is now called only after verifying that a new CR should be created (
shouldCreateCRcheck). This correctly prevents duplicate log-collector jobs when multiple events arrive for the same node, addressing the issue mentioned in the test comments.
395-459: LGTM! CR status checking properly implements deduplication logic.The method correctly:
- Retrieves configuration and state using the updated interfaces
- Checks equivalence groups for existing CRs
- Uses the status checker to determine if CR creation should be skipped
- Removes failed/completed CR references from state to allow retries
- Wraps errors with appropriate context using
%w
| ) | ||
|
|
||
| flag.StringVar(&kubeconfigPath, "kubeconfig-path", "", "path to kubeconfig file") | ||
| flag.StringVar(&kubeconfigPath, "kubeconfig", "", "path to kubeconfig file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π§© Analysis chain
π Script executed:
#!/bin/bash
# Check all references to kubeconfigPath variable
rg -nP '\bkubeconfigPath\b' --type goRepository: NVIDIA/NVSentinel
Length of output: 3067
π Script executed:
cat -n fault-remediation/main.go | sed -n '50,120p'Repository: NVIDIA/NVSentinel
Length of output: 2658
Remove unused kubeconfig flag or wire it to config loading.
The kubeconfigPath flag is defined at line 184 but never referenced in the code. Line 100 uses ctrl.GetConfigOrDie(), which respects the KUBECONFIG environment variable and in-cluster configuration automatically, but does not use the kubeconfigPath variable. This creates a misleading flag that users may set without effect. Either remove the flag since ctrl.GetConfigOrDie() handles standard mechanisms, or wire it to use clientcmd.BuildConfigFromFlags() if explicit kubeconfig path support is intended.
π€ Prompt for AI Agents
In @fault-remediation/main.go at line 184, The kubeconfigPath flag is defined
but never used; either remove the flag or wire it into config loading. To wire
it, replace the direct call to ctrl.GetConfigOrDie() with logic that uses
clientcmd.BuildConfigFromFlags("", kubeconfigPath) when kubeconfigPath is
non-empty and falls back to ctrl.GetConfigOrDie() otherwise (ensuring you handle
and log BuildConfigFromFlags errors); alternatively, delete the flag declaration
(kubeconfigPath and flag.StringVar) so no unused flag remains.
| @@ -0,0 +1,35 @@ | |||
| package annotation | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
Package-level godoc missing.
While package-level godoc only needs to appear once per package, ensure one file in the annotation package (either this file or annotation.go) contains package-level documentation.
As per coding guidelines, package-level godoc is required for all Go packages.
π€ Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation_interface.go at line 1, Add a
package-level godoc comment for the annotation package: place a clear,
one-paragraph comment above the "package annotation" declaration (in either
annotation_interface.go or annotation.go) describing the package purpose and
high-level behavior; ensure it follows Go doc conventions (complete sentence,
starts with "Package annotation ...") so godoc tools and linters recognize it.
| @@ -0,0 +1,166 @@ | |||
| package annotation | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
Add package-level godoc.
Per coding guidelines, package-level godoc is required for all Go packages.
As per coding guidelines, package-level godoc is required for all Go packages.
π Suggested package documentation
+// Package annotation provides functionality for managing node remediation state
+// through Kubernetes node annotations. It enables tracking of ongoing remediation
+// actions across equivalence groups.
package annotationπ Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| package annotation | |
| // Package annotation provides functionality for managing node remediation state | |
| // through Kubernetes node annotations. It enables tracking of ongoing remediation | |
| // actions across equivalence groups. | |
| package annotation |
π€ Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation.go at line 1, Add a
package-level godoc comment above the "package annotation" declaration that
briefly describes the purpose and responsibilities of the annotation package
(what it provides, key concepts, and any important usage notes); ensure the
comment is a complete sentence or two starting with "Package annotation ..." so
it follows Go doc conventions and tools like godoc and go vet will recognize it.
| func NewCRStatusChecker( | ||
| dynamicClient dynamic.Interface, | ||
| restMapper *restmapper.DeferredDiscoveryRESTMapper, | ||
| client client.Client, | ||
| remediationActions map[string]config.MaintenanceResource, | ||
| dryRun bool, | ||
| ) *CRStatusChecker { | ||
| return &CRStatusChecker{ | ||
| dynamicClient: dynamicClient, | ||
| restMapper: restMapper, | ||
| client: client, | ||
| remediationActions: remediationActions, | ||
| dryRun: dryRun, | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add godoc comment for exported constructor.
As per coding guidelines, function comments are required for all exported Go functions. Add a godoc comment describing the constructor's purpose and parameters.
π Suggested godoc
+// NewCRStatusChecker creates a new CRStatusChecker with the provided client,
+// remediation action configuration, and dry-run mode setting.
func NewCRStatusChecker(
client client.Client,
remediationActions map[string]config.MaintenanceResource,Based on coding guidelines: Function comments required for all exported Go functions.
π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func NewCRStatusChecker( | |
| dynamicClient dynamic.Interface, | |
| restMapper *restmapper.DeferredDiscoveryRESTMapper, | |
| client client.Client, | |
| remediationActions map[string]config.MaintenanceResource, | |
| dryRun bool, | |
| ) *CRStatusChecker { | |
| return &CRStatusChecker{ | |
| dynamicClient: dynamicClient, | |
| restMapper: restMapper, | |
| client: client, | |
| remediationActions: remediationActions, | |
| dryRun: dryRun, | |
| } | |
| } | |
| // NewCRStatusChecker creates a new CRStatusChecker with the provided client, | |
| // remediation action configuration, and dry-run mode setting. | |
| func NewCRStatusChecker( | |
| client client.Client, | |
| remediationActions map[string]config.MaintenanceResource, | |
| dryRun bool, | |
| ) *CRStatusChecker { | |
| return &CRStatusChecker{ | |
| client: client, | |
| remediationActions: remediationActions, | |
| dryRun: dryRun, | |
| } | |
| } |
π€ Prompt for AI Agents
In @fault-remediation/pkg/crstatus/checker.go around lines 34 - 44, Add a GoDoc
comment above the exported constructor NewCRStatusChecker that briefly describes
what CRStatusChecker does and the purpose of the constructor, and document the
parameters (client client.Client, remediationActions
map[string]config.MaintenanceResource, dryRun bool) and the returned
*CRStatusChecker; place the comment in standard godoc form starting with
"NewCRStatusChecker ..." so tools like godoc and linters recognize it.
| // TODO: ignoring error otherwise need to properly walk state transitions | ||
| _, _ = stateManager.UpdateNVSentinelStateNodeLabel(ctx, nodeName, statemanager.DrainSucceededLabelValue, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comments should reference issues.
Multiple TODO comments appear throughout the test file without issue references. As per coding guidelines, TODO comments should reference issues for tracking.
Consider creating issues for:
- State transition handling (lines 368, 430, 488, 571)
- StateManager error behavior (lines 489, 572)
- Potential state machine bug (line 519)
Based on coding guidelines.
Also applies to: 430-431, 488-490, 519-520, 571-573
π€ Prompt for AI Agents
In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go around lines 368 -
369, Several TODO comments in reconciler_e2e_test.go lack issue references;
create separate tracker issues for (a) state transition handling, (b)
StateManager error behavior, and (c) potential state machine bug, then update
each TODO to reference the new issue ID. Specifically, for occurrences around
the UpdateNVSentinelStateNodeLabel usage (symbol:
stateManager.UpdateNVSentinelStateNodeLabel) and the labels like
statemanager.DrainSucceededLabelValue where errors are currently ignored, either
handle the returned error properly in the test or replace the ignore with a TODO
that references the created issue (e.g., TODO(issue-123): reason), and do the
same for other TODOs tied to state transitions and state manager behavior in the
test file.
| healthEventStore datastore.HealthEventStore | ||
| config ReconcilerConfig | ||
| annotationManager NodeAnnotationManagerInterface | ||
| Config ReconcilerConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π§© Analysis chain
π Script executed:
#!/bin/bash
# Check if Config field is accessed externally
rg -n "\.Config\." --type go -g '!*_test.go' -g '!fault-remediation/pkg/reconciler/reconciler.go' fault-remediation/Repository: NVIDIA/NVSentinel
Length of output: 43
π Script executed:
#!/bin/bash
# Read the relevant section of the reconciler.go file to see struct definition and usage
cd fault-remediation && wc -l pkg/reconciler/reconciler.goRepository: NVIDIA/NVSentinel
Length of output: 93
π Script executed:
#!/bin/bash
# View the struct definition around line 66
cd fault-remediation && sed -n '55,95p' pkg/reconciler/reconciler.goRepository: NVIDIA/NVSentinel
Length of output: 1606
π Script executed:
#!/bin/bash
# Search for imports of FaultRemediationReconciler from other packages
rg "FaultRemediationReconciler" --type go -g '!*_test.go' -g '!fault-remediation/pkg/reconciler/reconciler.go'Repository: NVIDIA/NVSentinel
Length of output: 459
π Script executed:
#!/bin/bash
# Check if Config field is accessed in main.go
rg "\.Config" fault-remediation/main.goRepository: NVIDIA/NVSentinel
Length of output: 43
π Script executed:
#!/bin/bash
# Check if Config field is accessed in init.go
rg "\.Config" fault-remediation/pkg/initializer/init.goRepository: NVIDIA/NVSentinel
Length of output: 152
π Script executed:
#!/bin/bash
# Check if Watcher field is accessed to understand the pattern of public fields
rg "\.Watcher" fault-remediation/ --type go -g '!*_test.go'Repository: NVIDIA/NVSentinel
Length of output: 880
Make Config field private to maintain encapsulation.
The Config field is exported but never accessed after initialization. Only config.RemediationClient.GetAnnotationManager() is extracted and stored in the private annotationManager field. Since the configuration is not accessed elsewhere, consider making it private (config instead of Config) to reduce the public API surface and prevent external packages from depending on internal configuration details.
π€ Prompt for AI Agents
In @fault-remediation/pkg/reconciler/reconciler.go at line 66, The Reconciler's
exported field Config is unused outside initialization and should be made
private to reduce API surface: rename the struct field from Config to config
(type ReconcilerConfig) and update all internal references (e.g., in
constructors like NewReconciler/createReconciler and methods that currently use
r.Config) to r.config; ensure any tests or other package code that relied on the
exported field are updated or, if external access is required, expose only the
needed pieces (e.g., keep annotationManager private and provide specific
accessor methods instead of exposing the whole config).
| // limitations under the License. | ||
|
|
||
| package reconciler | ||
| package remediation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add package-level documentation.
The package declaration lacks a package-level godoc comment. As per coding guidelines, all Go packages require package-level documentation.
π Suggested package documentation
+// Package remediation provides interfaces and types for fault remediation operations,
+// including maintenance resource creation, log collection, and node annotation management.
package remediationBased on coding guidelines.
π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| package remediation | |
| // Package remediation provides interfaces and types for fault remediation operations, | |
| // including maintenance resource creation, log collection, and node annotation management. | |
| package remediation |
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go at
line 15, The package declaration "package remediation" is missing a
package-level godoc comment; add a concise package comment immediately above the
package line describing the purpose and responsibility of the remediation
package (e.g., what it provides, key abstractions or behavior) so the package
"remediation" has proper package-level documentation per Go guidelines.
| metadata: | ||
| generateName: log-collector-job- | ||
| namespace: test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
Hardcoded namespace should be parameterized.
The namespace is hardcoded to test. If this template is used at runtime to create Jobs, the namespace should be parameterized. If this is only a test fixture, consider adding a comment to clarify.
β»οΈ Suggested parameterization approach
If this is a Go template, replace with a template variable:
metadata:
generateName: log-collector-job-
- namespace: test
+ namespace: {{ .Namespace }}Or if this is intended only for testing, add a comment:
metadata:
generateName: log-collector-job-
+ # Note: This is a test fixture; namespace is intentionally hardcoded
namespace: testπ Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| metadata: | |
| generateName: log-collector-job- | |
| namespace: test | |
| metadata: | |
| generateName: log-collector-job- | |
| namespace: {{ .Namespace }} |
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/templates/log-collector-job.yaml around
lines 18 - 20, The manifest hardcodes namespace: test in the log-collector job
template; replace that literal with a parameterized template variable (e.g., use
a template placeholder for the namespace in the same file) and ensure callers
that render this template supply the namespace (or provide a sensible default),
or if this YAML is only a test fixture add a clarifying comment above the
namespace field indicating it is intentionally fixed for tests; update any code
that renders or references this template to pass the new variable (the field to
change is namespace in the same resource that has generateName:
log-collector-job-).
| containers: | ||
| - name: log-collector | ||
| image: test:test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion | π Major
Hardcoded image should be parameterized and add imagePullPolicy.
The container image is hardcoded to test:test. If this template is used at runtime, the image should be parameterized. Additionally, consider adding an imagePullPolicy for better control over image pulling behavior.
β»οΈ Suggested parameterization
If this is a Go template:
containers:
- name: log-collector
- image: test:test
+ image: {{ .Image }}
+ imagePullPolicy: {{ .ImagePullPolicy }}π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| containers: | |
| - name: log-collector | |
| image: test:test | |
| containers: | |
| - name: log-collector | |
| image: {{ .Image }} | |
| imagePullPolicy: {{ .ImagePullPolicy }} |
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/templates/log-collector-job.yaml around
lines 30 - 32, The container spec for the "log-collector" container currently
hardcodes image: test:test; change it to accept a template parameter (e.g.,
replace the literal with a template/variable like {{ .Values.logCollector.image
}} or {{ .Image }} depending on your templating system) and add an explicit
imagePullPolicy (e.g., imagePullPolicy: IfNotPresent or Always) so image source
and pull behavior are configurable at runtime; update any README/values usage
accordingly to ensure the new parameter is documented and supplied.
Signed-off-by: Igor Velichkovich <[email protected]>
81189df to
48914e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
π€ Fix all issues with AI agents
In @fault-remediation/pkg/annotation/annotation.go:
- Line 40: The TODO comment "// TODO: maybe split this up so it's not returning
both node and state" is not actionable; replace it with either a linked issue
TODO (e.g., "// TODO(#1234): consider splitting ReturnNodeAndState into two
functions to avoid returning both node and state") or remove it if no issue
exists; locate the comment near the function that returns both node and state
(the function that currently returns node and state in
pkg/annotation/annotation.go) and update the comment to reference the issue
number or change it to a concrete task with owner/ETA, ensuring the TODO format
matches the repository convention (issue number or shortlink).
- Around line 42-60: After unmarshalling into RemediationStateAnnotation, ensure
state.EquivalenceGroups is non-nil to avoid panics: if json.Unmarshal succeeds
but state.EquivalenceGroups == nil, initialize it with
make(map[string]EquivalenceGroupState). Apply the same nil-check/initialization
in the other unmarshalling spot noted (also around lines 72-78). This prevents
UpdateRemediationState or any assignment into EquivalenceGroups from causing a
nil-map panic.
- Around line 63-95: UpdateRemediationState (and likewise RemoveGroupFromState)
currently performs a read-modify-patch that can lose concurrent updates; wrap
the GetβmutateβPatch sequence in a retry.RetryOnConflict loop so on a conflict
you re-read the node/state and retry the mutation, using the
node.DeepCopy()βclient.MergeFrom(patchBase) approach inside the retry body;
ensure you return errors from the retry body directly (not wrapped) so
RetryOnConflict can detect conflicts and retry, and only wrap the final returned
error if needed after the retry loop completes.
In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: The TODO comment in metrics.go ("//TODO: evaluate and remove
redundant metrics with ctrl-runtime defaults") must reference an issue; update
that comment to include the tracker reference (e.g., issue number or URL) and a
short context so it follows the Go comment guidelineβreplace the plain TODO with
a TODO that includes the issue id (or link) and the same brief action so
reviewers can find the discussion.
In @fault-remediation/pkg/reconciler/reconciler.go:
- Around line 328-330: The current handling of r.runLogCollector(ctx,
healthEvent, healthEventWithStatus.ID) wraps err regardless of whether err is
nil, producing messages like "<nil>" when result is non-zero but err==nil;
change the logic to first check if err != nil and return result,
fmt.Errorf("error running log collector: %w", err) in that branch, and otherwise
if !result.IsZero() return result, nil so a non-zero result is propagated
without a spurious wrapped nil error. Ensure you update the code block around
the call to r.runLogCollector and use the existing result and err variables.
In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go:
- Around line 29-35: Add GoDoc comments: write a descriptive comment above the
exported type FaultRemediationClientInterface explaining its purpose and
intended use, then add short GoDoc comments for each exported method
(CreateMaintenanceResource, RunLogCollectorJob, GetAnnotationManager,
GetStatusChecker, GetConfig) describing what they do, their key parameters and
return values; ensure each comment begins with the exact method/type name to
satisfy golint/go vet requirements.
In @fault-remediation/pkg/remediation/remediation.go:
- Around line 465-500: checkLogCollectorComplete can panic when computing
duration because job.Status.StartTime (and possibly job.Status.CompletionTime)
can be nil; before calling job.Status.StartTime.Time or
job.Status.CompletionTime.Sub(...) add a nil-check: only compute and call
metrics.LogCollectorJobDuration.WithLabelValues(...).Observe(duration) when both
job.Status.StartTime != nil and job.Status.CompletionTime != nil, otherwise skip
observing duration (you may still increment metrics.LogCollectorJobs count if
desired). Update the code in checkLogCollectorComplete to guard the duration
calculation and observation (referencing job.Status.StartTime,
job.Status.CompletionTime, metrics.LogCollectorJobs,
metrics.LogCollectorJobDuration, and jobMetricsAlreadyCountedAnnotation) to
avoid the nil pointer dereference.
In @fault-remediation/pkg/remediation/templates/rebootnode-template.yaml:
- Around line 15-20: The YAML template contains unquoted Go template
placeholders (e.g., {{.ApiGroup}}, {{.Version}}, metadata.name:
maintenance-{{.NodeName}}-{{.HealthEventID}}, and spec.nodeName: {{.NodeName}})
which make the file fail static YAML linters; fix by wrapping each placeholder
value in quotes so the template is valid YAML while rendered output remains the
same (quote the ApiGroup/Version values and the metadata.name and spec.nodeName
placeholders), and do not add metadata.namespace since RebootNode is
cluster-scoped.
π§Ή Nitpick comments (13)
fault-remediation/pkg/events/health_event.go (1)
5-15:json:",inline"is likely redundant/confusing for Goencoding/json. Embedded structs are already inlined by default; theinlineoption isnβt a standardencoding/jsondirective. Consider dropping the JSON tag (or confirm this is intentional for a specific serializer).fault-remediation/pkg/annotation/annotation_test.go (1)
15-20: Test style: preferrequirefor fatal checks + more descriptive test names. E.g.,require.NoErrorbefore dereferencing results, and rename toTestGetRemediationState_NodeMissing_ReturnsError-style per guidelines.Also applies to: 122-207
fault-remediation/pkg/remediation/remediation_test.go (1)
346-541: Consider envtest for deeper integration testing.While the current fake client approach is appropriate for unit testing, consider adding integration tests using
envtestfor more realistic controller behavior validation, especially for complex job lifecycle scenarios.Based on learnings, use
envtestfor testing Kubernetes controllers instead of fake clients.fault-remediation/pkg/remediation/remediation.go (5)
33-41: Missing package-level godoc.As per coding guidelines, package-level godoc is required for all Go packages.
π Add package documentation
+// Package remediation provides fault remediation client functionality for creating +// maintenance resources and orchestrating log collector jobs in Kubernetes clusters. package remediation
141-147: Missing godoc for exported functions.As per coding guidelines, function comments are required for all exported Go functions.
π Add function documentation
+// GetAnnotationManager returns the node annotation manager used for remediation state tracking. func (c *FaultRemediationClient) GetAnnotationManager() annotation.NodeAnnotationManagerInterface { return c.annotationManager } +// GetStatusChecker returns the CR status checker used for determining CR completion status. func (c *FaultRemediationClient) GetStatusChecker() crstatus.CRStatusCheckerInterface { return c.statusChecker }
149-245: CreateMaintenanceResource has comprehensive logic with proper error handling.The function correctly:
- Handles dry-run mode early
- Validates node existence before CR creation
- Handles
AlreadyExistserror gracefully- Updates annotation state after CR creation
The
nolint: cyclopcomment should reference a tracking issue per coding guidelines.π Add issue reference to TODO comment
-// nolint: cyclop // todo +// nolint: cyclop // TODO(issue-XXX): refactor to reduce cyclomatic complexity func (c *FaultRemediationClient) CreateMaintenanceResource(
358-421: Log collector job creation has potential issues.
- Line 408: The error message includes the full job list which could be verbose in logs
- The job labels are set after unmarshaling but the job name from the manifest is used, which could cause collisions
π Improve error message and consider unique job naming
if len(existingJobs.Items) > 1 { return batchv1.Job{}, ctrl.Result{}, - fmt.Errorf("expecting zero or one log collector job per event per node, found %v", existingJobs.Items) + fmt.Errorf("expecting zero or one log collector job per event per node, found %d jobs", len(existingJobs.Items)) }
590-592: Missing godoc for exported function GetConfig.π Add function documentation
+// GetConfig returns the parsed remediation TOML configuration. func (c *FaultRemediationClient) GetConfig() *config.TomlConfig { return &c.remediationConfig }fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
17-54: Import ordering could be improved.The import at line 23 (
sigs.k8s.io/controller-runtime/pkg/client) is placed incorrectly between standard library imports. Consider grouping imports properly.π Fix import grouping
import ( "context" - "log" "os" "path/filepath" - "sigs.k8s.io/controller-runtime/pkg/client" "sync" "testing" "time" "github.com/prometheus/client_golang/prometheus" dto "github.com/prometheus/client_model/go" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/runtime/schema" "k8s.io/client-go/dynamic" "k8s.io/client-go/kubernetes" "k8s.io/client-go/kubernetes/scheme" "k8s.io/client-go/rest" ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" "sigs.k8s.io/controller-runtime/pkg/envtest"
370-371: TODO comments should reference tracking issues.Multiple TODO comments exist without issue references. As per coding guidelines, TODO comments should reference issues in Go code.
Consider creating tracking issues and updating these TODOs:
- Line 370-371: State transition handling
- Line 432-433, 490-492, 521-522, 573-575: State manager error handling
Also applies to: 432-433, 490-492, 521-522, 573-575, 653-654, 708-709
1124-1125: Using time.Sleep for synchronization is fragile.At line 1124,
time.Sleep(500 * time.Millisecond)is used to wait for event processing. Consider usingassert.Eventuallyfor more reliable async waiting.π Replace sleep with Eventually
- // Allow time for second event to be processed (should be deduplicated) - time.Sleep(500 * time.Millisecond) + // Wait for second event to be processed (should be deduplicated) + time.Sleep(100 * time.Millisecond) // Brief pause for event to be queuedfault-remediation/pkg/reconciler/reconciler_test.go (1)
17-39: Import ordering needs attention.Imports at lines 21-24 are placed incorrectly between standard library and third-party imports.
π Fix import grouping
import ( "context" "errors" "fmt" + "testing" + "time" + + "github.com/google/uuid" + "github.com/stretchr/testify/assert" + corev1 "k8s.io/api/core/v1" + "k8s.io/utils/ptr" + ctrl "sigs.k8s.io/controller-runtime" + "github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation" "github.com/nvidia/nvsentinel/fault-remediation/pkg/events" - corev1 "k8s.io/api/core/v1" - ctrl "sigs.k8s.io/controller-runtime" - "testing" - "time" - - "github.com/google/uuid" - "github.com/stretchr/testify/assert" - "k8s.io/utils/ptr" - "github.com/nvidia/nvsentinel/commons/pkg/statemanager"fault-remediation/pkg/reconciler/reconciler.go (1)
277-356: handleRemediationEvent has proper error propagation.The function correctly:
- Uses
errors.Joinat line 339 to combine errors- Propagates requeue results from log collector at line 328-330
- Updates metrics appropriately
The
nolint: cyclopcomment should reference a tracking issue.π Add issue reference
-// nolint: cyclop // todo +// nolint: cyclop // TODO(issue-XXX): refactor to reduce cyclomatic complexity func (r *FaultRemediationReconciler) handleRemediationEvent(
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
commons/go.sumis excluded by!**/*.sum
π Files selected for processing (29)
.gitignorecommons/go.moddistros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yamldistros/kubernetes/nvsentinel/values-full.yamldistros/kubernetes/nvsentinel/values-tilt-mongodb.yamldistros/kubernetes/nvsentinel/values-tilt.yamldistros/kubernetes/nvsentinel/values.yamlfault-remediation/main.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_interface.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/crstatus/checker.gofault-remediation/pkg/crstatus/crstatus_interface.gofault-remediation/pkg/crstatus/crstatus_test.gofault-remediation/pkg/events/health_event.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/annotation.gofault-remediation/pkg/reconciler/annotation_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/remediation.gofault-remediation/pkg/reconciler/remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/templates/log-collector-job.yamlfault-remediation/pkg/remediation/templates/rebootnode-template.yaml
π€ Files with no reviewable changes (8)
- distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
- distros/kubernetes/nvsentinel/values-tilt.yaml
- fault-remediation/pkg/reconciler/remediation_test.go
- distros/kubernetes/nvsentinel/values-full.yaml
- fault-remediation/pkg/reconciler/remediation.go
- fault-remediation/pkg/reconciler/annotation.go
- distros/kubernetes/nvsentinel/values.yaml
- fault-remediation/pkg/reconciler/annotation_test.go
π§ Files skipped from review as they are similar to previous changes (7)
- fault-remediation/pkg/crstatus/crstatus_interface.go
- fault-remediation/main.go
- fault-remediation/pkg/remediation/templates/log-collector-job.yaml
- fault-remediation/pkg/annotation/annotation_interface.go
- commons/go.mod
- fault-remediation/pkg/crstatus/crstatus_test.go
- .gitignore
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/crstatus/checker.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π§ Learnings (19)
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
fault-remediation/pkg/events/health_event.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/crstatus/checker.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/crstatus/checker.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
fault-remediation/pkg/events/health_event.gofault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/remediation/fault_remediation_client_interface.gofault-remediation/pkg/remediation/remediation.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code
Applied to files:
fault-remediation/pkg/annotation/annotation.gofault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/remediation/remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/annotation/annotation.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-remediation/pkg/annotation/annotation_test.gofault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/remediation/remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/metrics/metrics.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
fault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/crstatus/checker.go
π Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
fault-remediation/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
𧬠Code graph analysis (10)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-35)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
RemediationStateAnnotation(24-26)AnnotationKey(12-12)EquivalenceGroupState(29-35)fault-remediation/pkg/annotation/annotation.go (1)
NodeAnnotationManager(16-18)
fault-remediation/pkg/remediation/remediation_test.go (2)
fault-remediation/pkg/config/config.go (1)
Template(47-50)fault-remediation/pkg/remediation/remediation.go (2)
NewRemediationClient(56-114)FaultRemediationClient(43-54)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
HealthEventData(12-15)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)
fault-remediation/pkg/remediation/remediation.go (1)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
HealthEventData(12-15)HealthEventDoc(6-9)fault-remediation/pkg/annotation/annotation_interface.go (3)
NodeAnnotationManagerInterface(16-21)RemediationStateAnnotation(24-26)EquivalenceGroupState(29-35)fault-remediation/pkg/crstatus/crstatus_interface.go (1)
CRStatusCheckerInterface(9-11)data-models/pkg/protos/health_event.pb.go (2)
RecommendedAction_RESTART_BM(48-48)RecommendedAction_NONE(43-43)
fault-remediation/pkg/reconciler/reconciler.go (5)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
FaultRemediationClientInterface(29-35)fault-remediation/pkg/annotation/annotation_interface.go (1)
NodeAnnotationManagerInterface(16-21)fault-remediation/pkg/metrics/metrics.go (5)
EventHandlingDuration(62-68)TotalEventsReceived(33-38)TotalUnsupportedRemediationActions(53-59)ProcessingErrors(46-52)EventsProcessed(39-45)fault-remediation/pkg/events/health_event.go (2)
HealthEventDoc(6-9)HealthEventData(12-15)data-models/pkg/model/health_event_extentions.go (1)
HealthEventWithStatus(53-57)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/remediation.go (1)
NewRemediationClient(56-114)commons/pkg/statemanager/statemanager.go (2)
NewStateManager(206-210)StateManager(197-200)store-client/pkg/datastore/config.go (1)
LoadDatastoreConfig(27-44)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (7)
fault-remediation/pkg/reconciler/reconciler.go (2)
FaultRemediationReconciler(61-69)ReconcilerConfig(47-56)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
FaultRemediationClientInterface(29-35)commons/pkg/statemanager/statemanager.go (3)
NewStateManager(206-210)StateManager(197-200)RemediatingLabelValue(171-171)fault-remediation/pkg/events/health_event.go (1)
HealthEventDoc(6-9)fault-remediation/pkg/annotation/annotation_interface.go (1)
AnnotationKey(12-12)fault-remediation/pkg/metrics/metrics.go (5)
TotalEventsReceived(33-38)EventHandlingDuration(62-68)EventsProcessed(39-45)TotalUnsupportedRemediationActions(53-59)ProcessingErrors(46-52)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
MaintenanceResource(27-44)
πͺ YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
[error] 15-15: syntax error: expected , but found ''
(syntax)
π Additional comments (32)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)
91-111: Verify probes/ports match the controller-runtime bind addresses. The chart now always probes/healthzand/readyzon thehealthport; please confirm the binary serves these endpoints on((.Values.global).healthPort)|default 9440in all deployments (and that metrics remain on the metrics port).fault-remediation/pkg/annotation/annotation_test.go (1)
3-13: Tests appropriately usefakeclient for unit testing a utility manager. The guideline to preferenvtestapplies to testing Kubernetes controllers (reconcilers);NodeAnnotationManageris a client wrapper utility. The actual controller behavior is tested withenvtestinreconciler_e2e_test.go. These unit tests withfakeclient are appropriate and follow standard Go testing patterns.Minor improvement: Consider naming tests following
TestFunctionName_Scenario_ExpectedBehaviorformat (e.g.,TestGetRemediationState_NodeNotFound_ReturnsErrorinstead ofTestGetRemediationState).fault-remediation/pkg/metrics/metrics.go (1)
33-93: LGTM! Metrics exported for cross-package usage.The metric variables have been appropriately exported (capitalized) to support the new controller-runtime-based architecture. The metric definitions, labels, and help text remain unchanged, ensuring backward compatibility.
fault-remediation/pkg/initializer/init.go (2)
52-56: LGTM! Improved dependency injection pattern.The updated signature accepts a controller-runtime client, enabling better integration with the controller-runtime manager and improving testability through dependency injection.
85-102: LGTM! Clear initialization flow with proper error handling.The initialization sequence is logical:
- Remediation client created first (with the controller-runtime client)
- Kubernetes client created from the provided config
- StateManager wired with the Kubernetes client
All error paths are properly wrapped with context.
fault-remediation/pkg/remediation/remediation_test.go (3)
29-97: LGTM! Comprehensive test coverage for client initialization.The table-driven test covers key scenarios: non-existent templates, existing templates, and dry-run mode. The test validates both error paths and successful initialization, including proper dry-run mode propagation.
99-193: LGTM! Thorough validation of template configuration edge cases.These end-to-end tests effectively validate error handling for missing and empty template file configurations, ensuring robust initialization failure paths.
195-344: LGTM! Well-structured test for CR creation flows.The test comprehensively covers:
- Error handling when nodes don't exist
- Successful CR creation
- Dry-run behavior validation
- GVK verification
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
37-50: LGTM! Well-documented template data structure.The
TemplateDatastruct is clearly documented and logically groups node/event data with CRD routing metadata for template rendering.fault-remediation/pkg/crstatus/checker.go (3)
28-44: LGTM! Clean migration to controller-runtime client.The struct and constructor have been appropriately refactored to use the controller-runtime client, simplifying the implementation by removing the need for dynamic client and REST mapper.
46-76: LGTM! Well-documented and robust status checking logic.The refactored
ShouldSkipCRCreationmethod:
- Uses controller-runtime's client.Get with ObjectKey for cleaner resource retrieval
- Includes proper GVK context in error logging
- Handles dry-run mode appropriately
- Maintains the same logical behavior with improved clarity
78-113: LGTM! Condition checking logic preserved.The condition evaluation logic remains intact and correct, with only the parameter name updated for clarity (cr β obj).
fault-remediation/pkg/remediation/remediation.go (6)
1-31: Package structure and imports look good.The package is well-organized with appropriate imports for Kubernetes API interactions, controller-runtime, and internal dependencies. The use of
sigs.k8s.io/yamlfor YAML handling is appropriate for Kubernetes resources.
43-54: FaultRemediationClient struct is well-designed.The struct properly encapsulates all dependencies needed for remediation operations. The use of interfaces (
annotation.NodeAnnotationManagerInterface) enables testability.
56-114: Constructor validates configuration appropriately.The
NewRemediationClientfunction performs proper validation of template configuration and preloads templates at startup, which is good for fail-fast behavior. The dry-run mode handling is correct.
502-545: checkLogCollectorFailed correctly handles nil CompletionTime.Good defensive coding at lines 530-534 to handle the case where CompletionTime might be nil for failed jobs.
547-588: Timeout parsing silently falls back to default on error.The warning log at line 558 is good, but consider whether a misconfigured timeout should be a hard error in production.
116-139: FunctionloadAndParseTemplateis non-exported and does not require godoc per guidelines.Regarding the path traversal concern: the function is safe because
templateMountPathoriginates from Kubernetes ConfigMap mounts (trusted) andfileNamefrom the TOML configuration file. Both are validated during initializationβconfig validation at line 112β131 inconfig.goalready checks that the constructed template path exists viaos.Stat()before templates are loaded inNewRemediationClient().fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
181-267: TestMain setup is comprehensive and uses envtest correctly.Good use of
envtestfor testing Kubernetes controllers as per coding guidelines. The manager setup with disabled metrics binding (BindAddress: "0") is appropriate for tests.
301-329: createTestRemediationClient helper is well-structured.The test helper properly configures the remediation client with test-appropriate settings.
731-939: TestFullReconcilerWithMockedMongoDB_E2E is comprehensive.This test covers the complete flow including:
- CR creation and deduplication
- Event processing and metrics tracking
- Annotation management
- Unquarantine event handling
The use of
assert.Eventuallyfor async operations is appropriate.fault-remediation/pkg/reconciler/reconciler_test.go (5)
41-76: MockK8sClient properly implements updated interface.The mock correctly implements the
FaultRemediationClientInterfacewith the new signatures includingctrl.Resultreturns.
112-131: MockNodeAnnotationManager updated correctly.The mock now returns
(*annotation.RemediationStateAnnotation, *corev1.Node, error)matching the interface. The second return value (*corev1.Node) is consistently returned asnilwhich is appropriate for these unit tests.
200-244: TestNewReconciler uses table-driven tests appropriately.Good use of table-driven tests as per coding guidelines. The test validates both dry-run and non-dry-run scenarios.
572-609: TestRunLogCollectorOnNoneActionWhenEnabled tests NONE action handling.The test correctly verifies that log collector is invoked for NONE actions when enabled. The manual simulation of the Start loop behavior at lines 604-606 is appropriate for unit testing.
611-676: TestRunLogCollectorJobErrorScenarios covers requeue scenarios.The test at lines 637-643 tests the requeue result case which aligns with the new
ctrl.Resultreturn type.fault-remediation/pkg/reconciler/reconciler.go (6)
37-45: Imports properly updated for new package structure.The imports correctly reference the new
annotation,events,metrics, andremediationpackages.
170-193: runLogCollector properly returns ctrl.Result for requeue support.The updated signature
(ctrl.Result, error)allows the caller to handle requeue scenarios from the log collector job.
195-245: performRemediation error handling improved with errors.Join.The use of
errors.Joinat line 237 properly aggregates multiple errors. However, there's an inconsistency: whencreateMaintenanceResourceErroris non-nil but label update succeeds, the function falls through to line 241 and returns the error, which is correct.
461-513: parseHealthEvent properly wraps errors and handles resume token.The error handling correctly:
- Categorizes errors with appropriate labels for metrics
- Marks events as processed even on parse failure to prevent infinite retries
- Returns wrapped errors with context
540-569: SetupWithManager correctly integrates with controller-runtime.The typed channel source and handler setup is correct for controller-runtime integration. The watcher is started before setting up the controller which ensures events are captured.
61-69: TheConfigfield exposure is intentional and necessary for testing.The field is actively accessed in test files (
reconciler_test.go) for test setup and verificationβaccessingRemediationClientandEnableLogCollector. ExposingConfigis justified and follows standard Go testing patterns where tests need direct access to configure and validate behavior. No change needed.Likely an incorrect or invalid review comment.
| if err != nil { | ||
| return nil, node, fmt.Errorf("failed to get node %s: %w", nodeName, err) | ||
| } | ||
| // TODO: maybe split this up so it's not returning both node and state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO should reference an issue. Please replace with an issue-linked TODO (or remove if not actionable). Based on learnings / coding guidelines.
π€ Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation.go at line 40, The TODO comment
"// TODO: maybe split this up so it's not returning both node and state" is not
actionable; replace it with either a linked issue TODO (e.g., "// TODO(#1234):
consider splitting ReturnNodeAndState into two functions to avoid returning both
node and state") or remove it if no issue exists; locate the comment near the
function that returns both node and state (the function that currently returns
node and state in pkg/annotation/annotation.go) and update the comment to
reference the issue number or change it to a concrete task with owner/ETA,
ensuring the TODO format matches the repository convention (issue number or
shortlink).
| annotationValue, exists := node.Annotations[AnnotationKey] | ||
| if !exists { | ||
| // No annotation means no active remediations | ||
| return &RemediationStateAnnotation{ | ||
| EquivalenceGroups: make(map[string]EquivalenceGroupState), | ||
| }, node, nil | ||
| } | ||
|
|
||
| var state RemediationStateAnnotation | ||
| if err = json.Unmarshal([]byte(annotationValue), &state); err != nil { | ||
| slog.Error("Failed to unmarshal annotation", "node", nodeName, "error", err) | ||
| // Return empty state if unmarshal fails | ||
| return &RemediationStateAnnotation{ | ||
| EquivalenceGroups: make(map[string]EquivalenceGroupState), | ||
| }, node, nil | ||
| } | ||
|
|
||
| return &state, node, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: possible nil-map panic when annotation JSON lacks equivalenceGroups. json.Unmarshal can produce state.EquivalenceGroups == nil, and UpdateRemediationState will panic on assignment.
Proposed fix
var state RemediationStateAnnotation
if err = json.Unmarshal([]byte(annotationValue), &state); err != nil {
slog.Error("Failed to unmarshal annotation", "node", nodeName, "error", err)
// Return empty state if unmarshal fails
return &RemediationStateAnnotation{
EquivalenceGroups: make(map[string]EquivalenceGroupState),
}, node, nil
}
+ if state.EquivalenceGroups == nil {
+ state.EquivalenceGroups = make(map[string]EquivalenceGroupState)
+ }
return &state, node, nilAlso applies to: 72-78
π€ Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation.go around lines 42 - 60, After
unmarshalling into RemediationStateAnnotation, ensure state.EquivalenceGroups is
non-nil to avoid panics: if json.Unmarshal succeeds but state.EquivalenceGroups
== nil, initialize it with make(map[string]EquivalenceGroupState). Apply the
same nil-check/initialization in the other unmarshalling spot noted (also around
lines 72-78). This prevents UpdateRemediationState or any assignment into
EquivalenceGroups from causing a nil-map panic.
| func (m *NodeAnnotationManager) UpdateRemediationState(ctx context.Context, nodeName string, | ||
| group string, crName string, actionName string) error { | ||
| // Get current state | ||
| state, node, err := m.GetRemediationState(ctx, nodeName) | ||
| if err != nil { | ||
| slog.Warn("Failed to get current remediation state", "node", nodeName, "error", err) | ||
| return fmt.Errorf("failed to get current remediation state: %w", err) | ||
| } | ||
|
|
||
| // Update state for the group | ||
| state.EquivalenceGroups[group] = EquivalenceGroupState{ | ||
| MaintenanceCR: crName, | ||
| CreatedAt: time.Now().UTC(), | ||
| ActionName: actionName, | ||
| } | ||
|
|
||
| // Marshal to JSON | ||
| stateJSON, err := json.Marshal(state) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to marshal remediation state: %w", err) | ||
| } | ||
|
|
||
| patchBase := node.DeepCopy() | ||
| if node.Annotations == nil { | ||
| node.Annotations = map[string]string{} | ||
| } | ||
|
|
||
| node.Annotations[AnnotationKey] = string(stateJSON) | ||
|
|
||
| if err = m.client.Patch(ctx, node, client.MergeFrom(patchBase)); err != nil { | ||
| return fmt.Errorf("failed to patch node annotation: %w", err) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π§© Analysis chain
π Script executed:
cd fault-remediation && git ls-files | grep -E "annotation\.go|annotation" | head -20Repository: NVIDIA/NVSentinel
Length of output: 162
π Script executed:
find . -name "annotation.go" -type fRepository: NVIDIA/NVSentinel
Length of output: 109
π Script executed:
wc -l fault-remediation/pkg/annotation/annotation.goRepository: NVIDIA/NVSentinel
Length of output: 111
π Script executed:
head -200 fault-remediation/pkg/annotation/annotation.goRepository: NVIDIA/NVSentinel
Length of output: 4812
π Script executed:
grep -n "retry\|RetryOnConflict" fault-remediation/pkg/annotation/annotation.goRepository: NVIDIA/NVSentinel
Length of output: 43
π Script executed:
grep -rn "RetryOnConflict" fault-remediation --include="*.go" | head -10Repository: NVIDIA/NVSentinel
Length of output: 43
π Script executed:
grep -n "AnnotationKey" fault-remediation/pkg/annotation/annotation.goRepository: NVIDIA/NVSentinel
Length of output: 281
π Script executed:
cat -n fault-remediation/pkg/annotation/annotation.go | sed -n '1,30p'Repository: NVIDIA/NVSentinel
Length of output: 983
π Script executed:
find . -name "*.go" -path "*/annotation/*" -type fRepository: NVIDIA/NVSentinel
Length of output: 297
π Script executed:
cat fault-remediation/pkg/annotation/annotation_interface.goRepository: NVIDIA/NVSentinel
Length of output: 1291
π Script executed:
grep -rn "UpdateRemediationState\|RemoveGroupFromState" fault-remediation --include="*.go" | grep -v "annotation.go"Repository: NVIDIA/NVSentinel
Length of output: 1481
π Script executed:
grep -rn "client.Patch" fault-remediation --include="*.go" -A 1 -B 1 | head -30Repository: NVIDIA/NVSentinel
Length of output: 954
π Script executed:
grep -n "retry\|Retry" fault-remediation/pkg -r --include="*.go" | head -20Repository: NVIDIA/NVSentinel
Length of output: 2214
π Script executed:
cat -n fault-remediation/pkg/annotation/annotation_test.go | sed -n '122,160p'Repository: NVIDIA/NVSentinel
Length of output: 1497
Concurrent update of equivalence groups can cause lost writes. Two callers updating different groups in UpdateRemediationState (or RemoveGroupFromState) will overwrite each other's changes due to the read-modify-write pattern on a single annotation value. The entire state is serialized to JSON and patched as an atomic unit; if two processes fetch, modify, and patch concurrently, the last patch wins regardless of which groups were updated.
Wrap both functions' Get β mutate β Patch sequences in retry.RetryOnConflict to detect conflicts on patch failure and re-read the current state before retrying. Return errors unwrapped within the retry block to preserve the retry mechanism.
π€ Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation.go around lines 63 - 95,
UpdateRemediationState (and likewise RemoveGroupFromState) currently performs a
read-modify-patch that can lose concurrent updates; wrap the GetβmutateβPatch
sequence in a retry.RetryOnConflict loop so on a conflict you re-read the
node/state and retry the mutation, using the
node.DeepCopy()βclient.MergeFrom(patchBase) approach inside the retry body;
ensure you return errors from the retry body directly (not wrapped) so
RetryOnConflict can detect conflicts and retry, and only wrap the final returned
error if needed after the retry loop completes.
| ) | ||
|
|
||
| var ( | ||
| //TODO: evaluate and remove redundant metrics with ctrl-runtime defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment should reference an issue.
As per coding guidelines, TODO comments should reference issues in Go code. Please add an issue reference to this TODO.
Based on learnings, TODO comments should reference issues in Go code.
π€ Prompt for AI Agents
In @fault-remediation/pkg/metrics/metrics.go at line 30, The TODO comment in
metrics.go ("//TODO: evaluate and remove redundant metrics with ctrl-runtime
defaults") must reference an issue; update that comment to include the tracker
reference (e.g., issue number or URL) and a short context so it follows the Go
comment guidelineβreplace the plain TODO with a TODO that includes the issue id
(or link) and the same brief action so reviewers can find the discussion.
| if result, err := r.runLogCollector(ctx, healthEvent, healthEventWithStatus.ID); err != nil || !result.IsZero() { | ||
| return result, fmt.Errorf("error running log collector: %w", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error wrapping in runLogCollector result handling could mask nil errors.
At line 329, if err is nil but result.IsZero() returns false, the error wrapping fmt.Errorf("error running log collector: %w", err) will produce a confusing message with <nil> in it.
π Fix error handling for non-zero result without error
- if result, err := r.runLogCollector(ctx, healthEvent, healthEventWithStatus.ID); err != nil || !result.IsZero() {
- return result, fmt.Errorf("error running log collector: %w", err)
+ if result, err := r.runLogCollector(ctx, healthEvent, healthEventWithStatus.ID); err != nil || !result.IsZero() {
+ if err != nil {
+ return result, fmt.Errorf("error running log collector: %w", err)
+ }
+ return result, nil
}π€ Prompt for AI Agents
In @fault-remediation/pkg/reconciler/reconciler.go around lines 328 - 330, The
current handling of r.runLogCollector(ctx, healthEvent,
healthEventWithStatus.ID) wraps err regardless of whether err is nil, producing
messages like "<nil>" when result is non-zero but err==nil; change the logic to
first check if err != nil and return result, fmt.Errorf("error running log
collector: %w", err) in that branch, and otherwise if !result.IsZero() return
result, nil so a non-zero result is propagated without a spurious wrapped nil
error. Ensure you update the code block around the call to r.runLogCollector and
use the existing result and err variables.
| type FaultRemediationClientInterface interface { | ||
| CreateMaintenanceResource(ctx context.Context, healthEventData *HealthEventData) (bool, string) | ||
| RunLogCollectorJob(ctx context.Context, nodeName string) error | ||
| GetAnnotationManager() NodeAnnotationManagerInterface | ||
| GetStatusChecker() *crstatus.CRStatusChecker | ||
| CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error) | ||
| RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) | ||
| GetAnnotationManager() annotation.NodeAnnotationManagerInterface | ||
| GetStatusChecker() crstatus.CRStatusCheckerInterface | ||
| GetConfig() *config.TomlConfig | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add godoc for the exported interface.
The FaultRemediationClientInterface and its methods lack documentation. As per coding guidelines, function comments are required for all exported Go functions and interfaces.
Based on learnings, function comments required for all exported Go functions.
π Suggested documentation
+// FaultRemediationClientInterface defines the contract for fault remediation operations,
+// including CR creation, log collection, and state management.
type FaultRemediationClientInterface interface {
+ // CreateMaintenanceResource creates a maintenance CR for the given health event and returns the CR name.
CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
+ // RunLogCollectorJob orchestrates log collection for a node and event, returning a reconcile result.
RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
+ // GetAnnotationManager returns the annotation manager for node remediation state tracking.
GetAnnotationManager() annotation.NodeAnnotationManagerInterface
+ // GetStatusChecker returns the CR status checker for determining CR creation eligibility.
GetStatusChecker() crstatus.CRStatusCheckerInterface
+ // GetConfig returns the remediation configuration.
GetConfig() *config.TomlConfig
}π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go
around lines 29 - 35, Add GoDoc comments: write a descriptive comment above the
exported type FaultRemediationClientInterface explaining its purpose and
intended use, then add short GoDoc comments for each exported method
(CreateMaintenanceResource, RunLogCollectorJob, GetAnnotationManager,
GetStatusChecker, GetConfig) describing what they do, their key parameters and
return values; ensure each comment begins with the exact method/type name to
satisfy golint/go vet requirements.
| func (c *FaultRemediationClient) checkLogCollectorComplete( | ||
| ctx context.Context, | ||
| nodeName string, | ||
| job batchv1.Job, | ||
| conditions []metav1.Condition, | ||
| ) (bool, error) { | ||
| completeCondition := meta.FindStatusCondition(conditions, string(batchv1.JobComplete)) | ||
| //nolint:nestif // todo | ||
| if completeCondition != nil && completeCondition.Status == metav1.ConditionTrue { | ||
| slog.Info("Log collector job completed successfully", "job", job.Name) | ||
| // Use job's actual duration instead of custom tracking | ||
| // reconciliation can be called multiple times so use annotation to make sure we're not duplicate recording metrics | ||
| if job.Annotations == nil || job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal { | ||
| updateJob := job.DeepCopy() | ||
| if updateJob.Annotations == nil { | ||
| updateJob.Annotations = map[string]string{} | ||
| } | ||
|
|
||
| updateJob.Annotations[jobMetricsAlreadyCountedAnnotation] = trueStringVal | ||
|
|
||
| err := c.client.Update(ctx, updateJob) | ||
| if err != nil { | ||
| return false, err | ||
| } | ||
|
|
||
| duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds() | ||
|
|
||
| metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc() | ||
| metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration) | ||
| } | ||
|
|
||
| return true, nil | ||
| } | ||
|
|
||
| return false, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential nil pointer dereference in checkLogCollectorComplete.
At line 490, job.Status.StartTime could be nil if the job hasn't started yet, which would cause a panic when calling .Time.
π Add nil check for StartTime
updateJob.Annotations[jobMetricsAlreadyCountedAnnotation] = trueStringVal
err := c.client.Update(ctx, updateJob)
if err != nil {
return false, err
}
- duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
+ var duration float64
+ if job.Status.StartTime != nil && job.Status.CompletionTime != nil {
+ duration = job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
+ }
metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc()
metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration)π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go around lines 465 - 500,
checkLogCollectorComplete can panic when computing duration because
job.Status.StartTime (and possibly job.Status.CompletionTime) can be nil; before
calling job.Status.StartTime.Time or job.Status.CompletionTime.Sub(...) add a
nil-check: only compute and call
metrics.LogCollectorJobDuration.WithLabelValues(...).Observe(duration) when both
job.Status.StartTime != nil and job.Status.CompletionTime != nil, otherwise skip
observing duration (you may still increment metrics.LogCollectorJobs count if
desired). Update the code in checkLogCollectorComplete to guard the duration
calculation and observation (referencing job.Status.StartTime,
job.Status.CompletionTime, metrics.LogCollectorJobs,
metrics.LogCollectorJobDuration, and jobMetricsAlreadyCountedAnnotation) to
avoid the nil pointer dereference.
| apiVersion: {{.ApiGroup}}/{{.Version}} | ||
| kind: RebootNode | ||
| metadata: | ||
| name: maintenance-{{.NodeName}}-{{.HealthEventID}} | ||
| spec: | ||
| nodeName: {{.NodeName}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π§© Analysis chain
π Script executed:
find . -type f -name "*rebootnode*" -o -name "*reboot*" | head -20Repository: NVIDIA/NVSentinel
Length of output: 575
π Script executed:
git ls-files | grep -i rebootnodeRepository: NVIDIA/NVSentinel
Length of output: 559
π Script executed:
git ls-files | grep -i template | grep -E "\.yaml|\.yml" | head -20Repository: NVIDIA/NVSentinel
Length of output: 1532
π Script executed:
cat -n ./fault-remediation/pkg/remediation/templates/rebootnode-template.yamlRepository: NVIDIA/NVSentinel
Length of output: 962
π Script executed:
cat -n ./janitor/api/v1alpha1/rebootnode_types.goRepository: NVIDIA/NVSentinel
Length of output: 9092
π Script executed:
cat -n ./distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_rebootnodes.yaml | head -50Repository: NVIDIA/NVSentinel
Length of output: 2019
π Script executed:
cat -n ./fault-remediation/pkg/reconciler/templates/rebootnode-template.yamlRepository: NVIDIA/NVSentinel
Length of output: 962
π Script executed:
find . -type f -name "*.go" -path "*/remediation/*" -o -path "*/reconciler/*" | xargs grep -l "template\|Template" | head -10Repository: NVIDIA/NVSentinel
Length of output: 681
π Script executed:
rg -n "ExecuteTemplate|text/template|template\.Parse" --type go | grep -i remediation | head -20Repository: NVIDIA/NVSentinel
Length of output: 203
π Script executed:
cat -n ./fault-remediation/pkg/remediation/remediation.goRepository: NVIDIA/NVSentinel
Length of output: 22499
π Script executed:
cat -n ./fault-remediation/pkg/remediation/remediation_test.goRepository: NVIDIA/NVSentinel
Length of output: 19207
Quote template placeholders to satisfy YAML static analysis.
The template file itself violates YAML syntax when unquotedβstatic linters flag this. Quoting the placeholders makes the template file itself parseable while keeping the rendered manifest identical.
Suggested fix
apiVersion: {{.ApiGroup}}/{{.Version}}
kind: RebootNode
metadata:
- name: maintenance-{{.NodeName}}-{{.HealthEventID}}
+ name: "maintenance-{{.NodeName}}-{{.HealthEventID}}"
spec:
- nodeName: {{.NodeName}}
+ nodeName: "{{.NodeName}}"Note: RebootNode is cluster-scoped, so metadata.namespace is not applicable here.
π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| apiVersion: {{.ApiGroup}}/{{.Version}} | |
| kind: RebootNode | |
| metadata: | |
| name: maintenance-{{.NodeName}}-{{.HealthEventID}} | |
| spec: | |
| nodeName: {{.NodeName}} | |
| apiVersion: "{{.ApiGroup}}/{{.Version}}" | |
| kind: RebootNode | |
| metadata: | |
| name: "maintenance-{{.NodeName}}-{{.HealthEventID}}" | |
| spec: | |
| nodeName: "{{.NodeName}}" |
π§° Tools
πͺ YAMLlint (1.37.1)
[error] 15-15: syntax error: expected , but found ''
(syntax)
π€ Prompt for AI Agents
In @fault-remediation/pkg/remediation/templates/rebootnode-template.yaml around
lines 15 - 20, The YAML template contains unquoted Go template placeholders
(e.g., {{.ApiGroup}}, {{.Version}}, metadata.name:
maintenance-{{.NodeName}}-{{.HealthEventID}}, and spec.nodeName: {{.NodeName}})
which make the file fail static YAML linters; fix by wrapping each placeholder
value in quotes so the template is valid YAML while rendered output remains the
same (quote the ApiGroup/Version values and the metadata.name and spec.nodeName
placeholders), and do not add metadata.namespace since RebootNode is
cluster-scoped.
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Improvements
Tests
βοΈ Tip: You can customize this high-level summary in your review settings.