Skip to content

Conversation

@ivelichkovich
Copy link
Contributor

@ivelichkovich ivelichkovich commented Dec 11, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • [ X ] πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • [ X ] Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Multi-template remediation with templated CR creation and per-event log-collector job orchestration.
    • Node annotation manager and public remediation/status/annotation interfaces for safer integrations.
  • Improvements

    • Unified controller-runtime startup flow and renamed kubeconfig flag to --kubeconfig.
    • Prometheus metrics exported and standardized health/ready probe behavior; helm defaults simplified.
  • Tests

    • Expanded unit and e2e tests for remediation, templates, annotations, and log-collector workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 11, 2025

πŸ“ Walkthrough

Walkthrough

Unifies startup under controller-runtime manager; introduces controller-runtime–based annotation and remediation packages, templated remediation and log-collector orchestration, controller-runtime client usage for CR status checks, exported metrics, reconciler refactor to injected clients, and Helm defaults removing ctrlRuntimeEnabled. Tests updated to use new public APIs.

Changes

Cohort / File(s) Summary
Application Initialization
fault-remediation/main.go
Startup consolidated to controller-runtime (setupCtrlRuntimeManagement), removed dual-path startup and enableControllerRuntime flag, added auditing round-tripper, unified health/metrics wiring, renamed kubeconfig flag to --kubeconfig.
Annotation Package (new)
fault-remediation/pkg/annotation/annotation_interface.go, fault-remediation/pkg/annotation/annotation.go, fault-remediation/pkg/annotation/annotation_test.go
New NodeAnnotationManagerInterface and NodeAnnotationManager managing per-node remediation state via a node annotation; JSON state structs and unit tests added.
CR Status Checking
fault-remediation/pkg/crstatus/*
Replaced dynamic client/RESTMapper usage with controller-runtime client.Client; updated CRStatusChecker constructor/signature and added CRStatusCheckerInterface; condition checks use client.Get on unstructured objects.
Remediation Package (new)
fault-remediation/pkg/remediation/*, fault-remediation/pkg/remediation/remediation.go, fault-remediation/pkg/remediation/remediation_test.go, fault-remediation/pkg/remediation/fault_remediation_client_interface.go, fault-remediation/pkg/remediation/templates/*
New FaultRemediationClient and public interface; template loading/parsing, multi-template remediation, ownerRef attachment, dry-run support; CreateMaintenanceResource, RunLogCollectorJob with job orchestration/monitoring; comprehensive tests and templates added.
Events Types
fault-remediation/pkg/events/health_event.go
Added HealthEventDoc and HealthEventData wrappers with JSON/BSON tags for health-event payloads including status.
Reconciler Updates
fault-remediation/pkg/reconciler/*
Reconciler refactored to use new public packages/interfaces (remediation, annotation, crstatus, events, metrics); many method signatures updated; reconciler now uses injected remediation client and annotation manager; tests updated to controller-runtime manager/client.
Initializer
fault-remediation/pkg/initializer/init.go
InitializeAll signature changed to accept ctrlruntimeClient.Client and *rest.Config; remediation client and state manager wired using controller-runtime client.
Metrics
fault-remediation/pkg/metrics/metrics.go
Package renamed to metrics and metric variables exported (CamelCase) for cross-package use.
Removed legacy reconciler impls/tests
fault-remediation/pkg/reconciler/annotation.go, fault-remediation/pkg/reconciler/remediation.go, fault-remediation/pkg/reconciler/*_test.go
Deleted old reconciler-local annotation/remediation implementations and associated tests (functionality migrated to new packages).
Helm / Deployment config
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml, distros/.../values*.yaml
Removed ctrlRuntimeEnabled flag and conditional args; metrics/health ports always exposed; readiness/liveness probes use fixed paths/ports.
Miscellaneous
.gitignore, commons/go.mod
Simplified .idea ignores; minor go.mod formatting tweak.

Sequence Diagram(s)

sequenceDiagram
  participant Manager as Controller-Manager
  participant Reconciler as FaultRemediationReconciler
  participant Datastore as Event Datastore
  participant RemClient as FaultRemediationClient
  participant KubeAPI as Kubernetes API

  Manager->>Reconciler: start reconcile (watch events)
  Reconciler->>Datastore: fetch health event (EventWithToken)
  Datastore-->>Reconciler: healthEventWithStatus
  Reconciler->>RemClient: CreateMaintenanceResource(ctx, healthEventData)
  RemClient->>KubeAPI: get Node / render & create CR (or dry-run)
  KubeAPI-->>RemClient: CR created / error
  RemClient->>KubeAPI: create/check LogCollector Job
  KubeAPI-->>RemClient: Job status
  RemClient-->>Reconciler: return (crName / error / ctrl.Result)
  Reconciler->>Reconciler: update node annotation via AnnotationManager
  Reconciler->>Manager: return ctrl.Result (requeue/done)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

πŸ‡
I nibble at templates, hop through code so bright,
Manager wakes, reconciles through the night.
Annotations whisper the node's little tale,
Jobs gather logs and follow the trail.
Hooray β€” new packages bloom, and tests set sail.

πŸš₯ Pre-merge checks | βœ… 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The PR title accurately describes the main refactoring: consolidating fault-remediation to use controller-runtime centrally instead of dual-path startup logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • πŸ“ Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

πŸ”§ golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

if err != nil {
slog.Error("Error updating node label to remediating", "error", err)
processingErrors.WithLabelValues("label_update_error", nodeName).Inc()
return "", err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so want to call this out:

this will have some behavior changes, now it'll pretty much always return an error when it experiences an error so there's cases where reconciliation will then fully retry rather than ignoring the error and continuing the process

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

slog.Info("CR completed or failed, allowing retry", "node", nodeName, "crName", groupState.MaintenanceCR)

if err := r.annotationManager.RemoveGroupFromState(ctx, nodeName, group); err != nil {
//TODO: this also has built in client retry on the api request
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so it looks like some of the managers like annotationManager have built in retries. I'm planning to have a flag set on the manager struct that tells it to retry or not so that we don't double up on retries when following this "error for retry" pattern

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

Copy link
Collaborator

@lalitadithya lalitadithya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, I like the direction! Let's keep pushing forward on this

if err != nil {
slog.Error("Error updating node label to remediating", "error", err)
processingErrors.WithLabelValues("label_update_error", nodeName).Inc()
return "", err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

_ = r.runLogCollector(ctx, healthEvent)

nodeRemediatedStatus, _, err := r.performRemediation(ctx, healthEventWithStatus)
err = r.runLogCollector(ctx, healthEvent)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't want to retry on this one. If the log collection fails, that is fine, we can proceed with remediation. There can be cases when the log collection will never to completion due to the node being in a very bad shape. If we retry on this then we may end up in an infinite loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have some future refactors planned for the log collector anyway but for now I'll make it not throw an error if the log collector job fails to complete in time but still throw an error and retry if there's i.e. a failure on the api request to create the job resource.

Eventually I want to get the flow of the log collector to be something like this:

  1. Creates log collector Job (retry on error here)
  2. Requeues event to be re-processed after X min, this allows other events to be processed
  3. Checks if job finished, requeue if not (until timeout or complete)
  4. Eventually after completed or the timeout has passed, continue on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would likely come later once we're able to remove the "legacy" reconciler mode because it would depend on the ctrl-runtime queuing functionality

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up doing this in this PR, created a ctrl runtime version of each of the interfaces so it seemed to fit here

slog.Info("CR completed or failed, allowing retry", "node", nodeName, "crName", groupState.MaintenanceCR)

if err := r.annotationManager.RemoveGroupFromState(ctx, nodeName, group); err != nil {
//TODO: this also has built in client retry on the api request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

@ivelichkovich ivelichkovich force-pushed the remediationerrors branch 4 times, most recently from 5d99015 to f0f60b1 Compare January 6, 2026 02:01
@ivelichkovich ivelichkovich marked this pull request as ready for review January 6, 2026 02:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Fix all issues with AI Agents πŸ€–
In @.idea/NVSentinel.iml:
- Around line 1-4: The repository is tracking IDE files .idea/NVSentinel.iml and
.idea/vcs.xml; update .gitignore to add a blanket ignore for .idea/ (or
uncomment/add the β€œ.idea/” entry) so .iml and VCS configs are excluded, then
remove the tracked files from Git with git rm --cached .idea/NVSentinel.iml
.idea/vcs.xml and commit the change to stop committing IDE-specific config
moving forward.

In @fault-remediation/main.go:
- Around line 171-182: The retry loop around
components.FaultRemediationReconciler.Reconcile does not check gCtx cancellation
and can block shutdown; modify the loop to respect gCtx.Done() by exiting early
when the context is cancelled (check gCtx.Err() or select on gCtx.Done() before
each retry and before sleeping), return or break out of the loop when cancelled,
and avoid sleeping unconditionally by using a context-aware wait (select on
time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay) and
gCtx.Done()). Ensure you still log the last error but stop retrying if gCtx is
cancelled.

In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-74: GetRemediationState can fail leaving node nil which later
causes a panic when node.DeepCopy() is called; update the error path in
GetRemediationState handling so that when err != nil you either return the error
immediately or create a safe non-nil node placeholder before continuing (e.g.,
instantiate a new corev1.Node or ensure callers check for nil), and ensure
callers of RemediationStateAnnotation logic (where node.DeepCopy() is invoked)
only call DeepCopy on a non-nil node; reference GetRemediationState,
RemediationStateAnnotation, and node.DeepCopy to locate and fix the nil check
and handling.

In @fault-remediation/pkg/crstatus/checker.go:
- Around line 58-66: The ObjectKey for the GET only sets Name and omits
Namespace, which breaks lookups for namespaced CRs; update the client.ObjectKey
creation to include the CR namespace (e.g., client.ObjectKey{Name: crName,
Namespace: maintenanceResource.Namespace} or the appropriate local variable
holding the namespace) before calling c.client.Get, and ensure the warning log
(slog.Warn) also includes the namespace for clearer diagnostics; if the resource
is cluster-scoped, allow the Namespace to be empty when constructing the
ObjectKey.

In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code calls AnnotationManager.UpdateRemediationState
twice: first using crName and then again using actualCRName after computing
actualCRName := createdCR.GetName(); remove the redundant first update (the
block that uses crName) and keep only the second update that sets the annotation
to actualCRName; ensure the logic still checks group :=
common.GetRemediationGroupForAction(healthEvent.RecommendedAction) and
c.AnnotationManager != nil before calling
AnnotationManager.UpdateRemediationState so the annotation is updated once with
the real CR name (refer to createdCR.GetName, crName, actualCRName, and
AnnotationManager.UpdateRemediationState to locate the code).

In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 100-127: Two test cases in remediation_test.go use the same name
"Successful rebootnode creation", causing ambiguous test output; change the
`name` field for one or both cases to be unique (e.g. include dryRun state) so
they read distinct names like "Successful rebootnode creation - non-dry-run" and
"Successful rebootnode creation - dry-run"; update the `name` values inside the
table-driven test entries that contain `nodeName: "test-node-1"` and `dryRun:
false`/`true` (the test case structs shown in the diff) to avoid duplicate test
case names.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 304-317: Between checking existingJobs.Items and calling
c.client.Create(ctx, job) there is a race: another reconcile may create the job
causing Create to return an IsAlreadyExists error; update the creation path in
the function handling existingJobs/ Create so that after err :=
c.client.Create(ctx, job) you check for apierrors.IsAlreadyExists(err) (or the
equivalent IsAlreadyExists helper your codebase uses) and treat that case as
success by returning ctrl.Result{RequeueAfter: 10 * time.Second}, nil (or
proceed as if job was created), while still returning other errors unchanged;
reference symbols: existingJobs.Items, c.client.Create, job, and
apierrors.IsAlreadyExists.
- Around line 183-208: The first annotation update block duplicates the later,
authoritative update and uses the provisional crName; remove the initial call
that checks common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
and calls c.annotationManager.UpdateRemediationState with crName, and keep only
the second update after obtaining actualCRName from maintenance.GetName so the
node annotation is set once with the real CR name (use the existing
healthEvent.NodeName, group and actualCRName with
c.annotationManager.UpdateRemediationState).
🧹 Nitpick comments (15)
.idea/vcs.xml (1)

1-6: IDE configuration files should be gitignored, not committed.

The .idea/vcs.xml file is an IntelliJ IDEA configuration artifact that is local to each developer's environment and should not be committed to version control. These files are typically auto-generated by the IDE, vary across developers, and create unnecessary noise and merge conflicts.

Add .idea/ (or at minimum .idea/vcs.xml) to .gitignore instead of committing this file. If the file was committed inadvertently, consider removing it from the repository with git rm --cached .idea/vcs.xml.

fault-remediation/pkg/annotation/annotation.go (2)

40-40: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference issues for tracking.


51-57: Silent error recovery may mask annotation data corruption.

Returning an empty state when JSON unmarshaling fails is defensive, but it silently discards potentially corrupted data. Consider returning the error to allow callers to decide how to handle it, or at minimum add a metric/alert for this condition.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

29-34: Add godoc comment for exported interface.

Per coding guidelines, exported functions and types require documentation comments. The interface methods would benefit from brief descriptions of their behavior and return semantics.

πŸ”Ž Proposed fix
+// FaultRemediationClientInterface defines the contract for fault remediation operations
+// including maintenance resource creation, log collection, and access to annotation/status components.
 type FaultRemediationClientInterface interface {
 	CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
 	RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
 	GetAnnotationManager() annotation.NodeAnnotationManagerInterface
 	GetStatusChecker() crstatus.CRStatusCheckerInterface
 }

31-31: Use eventID instead of eventId for Go naming conventions.

Go convention for acronyms in identifiers is to use all caps (e.g., eventID, httpURL).

fault-remediation/pkg/crstatus/crstatus_interface.go (1)

7-9: Add godoc and use named parameters for clarity.

The interface lacks documentation. The unnamed string parameter is unclear - is it a CR name, node name, or identifier? Named parameters improve readability and self-documentation.

πŸ”Ž Proposed fix
+// CRStatusCheckerInterface defines the contract for checking CR status
+// to determine if creation should be skipped (e.g., when remediation is in progress).
 type CRStatusCheckerInterface interface {
-	ShouldSkipCRCreation(context.Context, string) bool
+	ShouldSkipCRCreation(ctx context.Context, crName string) bool
 }
fault-remediation/pkg/crstatus/crstatus_test.go (1)

120-212: Consider consolidating duplicate test cases.

TestCheckConditionCtrlRuntime duplicates the exact same test cases as TestCheckCondition. Consider extracting the shared test cases into a variable and reusing them, or using a parameterized approach that tests both checker types.

πŸ”Ž Example consolidation
var checkConditionTestCases = []struct {
    name     string
    cr       *unstructured.Unstructured
    expected bool
}{
    // ... shared test cases
}

func TestCheckCondition(t *testing.T) {
    cfg := &config.MaintenanceResource{CompleteConditionType: "Completed"}
    checker := NewCRStatusChecker(nil, nil, cfg, false)
    for _, tt := range checkConditionTestCases {
        t.Run(tt.name, func(t *testing.T) {
            assert.Equal(t, tt.expected, checker.checkCondition(tt.cr))
        })
    }
}

func TestCheckConditionCtrlRuntime(t *testing.T) {
    cfg := &config.MaintenanceResource{CompleteConditionType: "Completed"}
    checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false)
    for _, tt := range checkConditionTestCases {
        t.Run(tt.name, func(t *testing.T) {
            assert.Equal(t, tt.expected, checker.checkCondition(tt.cr))
        })
    }
}
fault-remediation/pkg/annotation/annotation_interface.go (1)

10-13: Consider adding a domain prefix to the annotation key.

Kubernetes best practices recommend using a domain prefix for custom annotations (e.g., nvsentinel.nvidia.com/latestFaultRemediationState) to avoid collisions with other tools and clearly indicate ownership.

πŸ”Ž Proposed fix
 const (
 	// AnnotationKey is the key for the node annotation that tracks remediation state
-	AnnotationKey = "latestFaultRemediationState"
+	AnnotationKey = "nvsentinel.nvidia.com/latestFaultRemediationState"
 )
fault-remediation/pkg/events/health_event.go (1)

1-3: Add package-level documentation.

Per coding guidelines, package-level godoc is required for all Go packages. Consider adding a brief description of what the events package provides.

Suggested documentation
+// Package events provides health event data types for fault remediation workflows.
 package events
 
 import "github.com/nvidia/nvsentinel/data-models/pkg/model"
fault-remediation/pkg/crstatus/deprecated_checker.go (3)

15-15: Missing package-level documentation.

Per coding guidelines, package-level godoc is required for all Go packages. Add a package comment describing the purpose of this package.

πŸ”Ž Suggested fix
+// Package crstatus provides functionality for checking the status of Custom Resources
+// to determine whether maintenance operations should be skipped based on existing CR state.
 package crstatus

50-75: Consider returning error for REST mapping failures instead of silently allowing creation.

When RESTMapping fails (line 62-66), the method logs an error but returns false, which allows CR creation to proceed. This could mask configuration issues. For the PR's goal of "throw errors to trigger retries," consider propagating this error.


77-91: Clarify the return value semantics.

When status or conditions are not found (lines 79-86), the method returns true (meaning "skip creation"). This seems counterintuitiveβ€”if the CR exists but has no status yet, skipping creation is correct. However, a brief comment explaining this logic would improve maintainability.

fault-remediation/pkg/remediation/remediation.go (1)

362-387: Potential nil pointer dereference when checking job annotations.

Line 367 checks job.Annotations != nil && job.Annotations[...], but if the job was just created, Annotations might be nil, causing the metrics to be recorded. However, if Annotations is nil, the annotation won't exist, so the condition is correct. The issue is that after updating, if the update fails (line 375-378), the function returns false, err, potentially causing duplicate metrics on retry. Consider moving the metrics recording after the successful annotation update.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

356-358: TODO comments indicate incomplete state transition handling.

Multiple TODO comments (lines 356-357, 418-419, 476-478, 507-508, 559-561) indicate that error handling for state transitions is being ignored. These should be tracked as follow-up work to ensure proper state management.

Do you want me to open a new issue to track these TODO items for proper state transition error handling?

fault-remediation/pkg/remediation/deprecated_remediation.go (1)

379-391: Missing labels on Job template spec.

Labels are set on job.Labels (line 384) but not on job.Spec.Template.Labels. This could affect label-based job selection in some scenarios, though the current List with MatchingLabels queries Job objects directly.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between fd4466e and f0f60b1.

πŸ“’ Files selected for processing (24)
  • .idea/NVSentinel.iml
  • .idea/vcs.xml
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (1)
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
🧠 Learnings (16)
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
🧬 Code graph analysis (9)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (4)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
  • FaultRemediationClientInterface (29-34)
  • TemplateData (37-44)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (197-200)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/main.go (3)
fault-remediation/pkg/initializer/init.go (2)
  • InitializationParams (38-44)
  • InitializeAll (51-165)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (61-69)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCtrlRuntimeCRStatusChecker (34-44)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/remediation/deprecated_remediation.go (8)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/deprecated_checker.go (2)
  • CRStatusChecker (29-34)
  • NewCRStatusChecker (36-48)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
  • NewNodeAnnotationManager (38-42)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/common/equivalence_groups.go (1)
  • GetRemediationGroupForAction (35-45)
fault-remediation/pkg/metrics/metrics.go (3)
  • LogCollectorErrors (86-92)
  • LogCollectorJobs (71-77)
  • LogCollectorJobDuration (78-85)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (42)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

15-20: LGTM - Template structure is correct.

The Go template syntax with {{.ApiGroup}}/{{.Version}} is appropriate for template rendering. The YAMLlint error is expected since YAML linters cannot parse Go template placeholders until they are rendered.

Consider whether a namespace field should be added to metadata if this is a namespaced resource, or document that it's cluster-scoped.

fault-remediation/pkg/remediation/templates/log-collector-job.yaml (1)

16-32: Hardcoded test values - clarify if this is a test fixture or production template.

The namespace test and image test:test appear to be placeholder values. If this template is intended for production use (like rebootnode-template.yaml), these should be Go template placeholders (e.g., {{.Namespace}}, {{.Image}}). If this is purely a test fixture, consider moving it to a test data directory or adding a comment to clarify its purpose.

fault-remediation/pkg/annotation/annotation_interface.go (1)

15-32: LGTM - Clean interface and type definitions.

The interface is well-designed with clear method signatures. The data structures use appropriate JSON tags for serialization. Returning *corev1.Node alongside the state from GetRemediationState is a pragmatic choice to avoid redundant API calls in callers.

fault-remediation/pkg/events/health_event.go (1)

5-14: LGTM on struct definitions.

The two types appropriately separate JSON and BSON serialization concerns. Consider adding a comment for HealthEventDoc similar to the one on HealthEventData for consistency.

fault-remediation/main.go (2)

194-197: LGTM on auditing round tripper integration.

The HTTP transport is correctly wrapped with the auditing round tripper before manager creation, ensuring all Kubernetes API calls are audited.


228-239: LGTM on initialization and cleanup flow.

Components are properly initialized with the manager's client, and cleanup is correctly deferred to ensure datastore resources are released on exit.

fault-remediation/pkg/initializer/init.go (2)

58-60: LGTM on validation logic.

The guard correctly ensures a ctrl-runtime client is provided when ctrl-runtime mode is enabled, preventing runtime errors from missing dependencies.


88-112: LGTM on dual-mode client initialization.

The branching logic cleanly separates ctrl-runtime and Kubernetes client initialization paths, with appropriate error handling for each.

fault-remediation/pkg/remediation/remediation_test.go (2)

26-81: LGTM on client creation tests.

Good table-driven test coverage for template validation scenarios, including file existence checks and dry-run mode configuration.


217-413: LGTM on log collector job tests.

Comprehensive coverage of job lifecycle scenarios including creation, completion, failure, timeout, and duplicate job handling. The assertions properly verify both error conditions and expected job counts.

fault-remediation/pkg/metrics/metrics.go (2)

15-21: LGTM on package refactor.

Moving metrics to a dedicated package with exported identifiers enables cleaner separation of concerns and allows multiple packages to record metrics.


29-92: LGTM on exported metric variables.

Metrics are correctly exported while preserving the Prometheus metric names for backward compatibility with existing dashboards and alerts.

fault-remediation/pkg/annotation/deprecated_annotation.go (2)

82-125: LGTM on GetRemediationState signature change.

Returning the fetched node enables callers to avoid redundant API calls when they need both the annotation state and node object. The retry logic with isRetryableError properly handles transient failures.


44-60: LGTM on patchNodeWithRetry implementation.

The retry logic correctly uses exponential backoff via retry.DefaultRetry and appropriately logs retryable errors before continuing.

fault-remediation/pkg/reconciler/reconciler.go (5)

66-67: LGTM on public Config field and interface-based annotation manager.

Making Config public enables access from main.go for retry configuration. Using annotation.NodeAnnotationManagerInterface improves testability and supports both legacy and ctrl-runtime implementations.


170-193: LGTM on runLogCollector refactor.

Returning ctrl.Result and error allows proper propagation of requeue requests and errors for retry handling, aligning with the ctrl-runtime pattern.


237-244: Good use of errors.Join for combining errors.

Using errors.Join to combine createMaintenanceResourceError and label update errors ensures both failures are visible in logs and upstream error handling.


406-410: Verify error handling change aligns with retry intent.

Line 409 now returns the error instead of continuing, which will trigger retries. Per the coding guidelines, within retry.RetryOnConflict blocks, errors should not be wrapped to preserve retry behavior. However, this is outside such a block, so the unwrapped error is appropriate here.


438-441: Error propagation on RemoveGroupFromState failure.

Returning an error here will trigger retries when annotation cleanup fails. This is the intended behavior per the PR discussion to "throw error to trigger retry." The empty CR name return prevents false positive CR existence checks on retry.

fault-remediation/pkg/crstatus/deprecated_checker.go (2)

29-48: LGTM!

The struct definition and constructor follow Go conventions with proper field initialization. The use of *restmapper.DeferredDiscoveryRESTMapper aligns with the k8s.io/client-go patterns for dynamic resource mapping.


93-112: LGTM!

The findConditionStatus and isTerminal methods correctly implement the condition-checking logic. Terminal states ("True" or "False") properly indicate completion, while empty or "Unknown" states allow for CR creation retry.

fault-remediation/pkg/remediation/deprecated_remediation_test.go (5)

15-40: LGTM!

The package rename to remediation and import updates align with the PR's restructuring. The test file properly imports the new events package for HealthEventData.


197-251: Inconsistent capitalization in test strings.

Lines 198, 200, 210, 217, and 227 use inconsistent capitalization for "Config" (e.g., "in-cluster Config" vs typical "config"). This appears intentional per AI summary, but verify this matches actual error messages from the Kubernetes client libraries.


306-344: LGTM!

The test setup correctly uses the renamed public fields (Clientset, KubeClient, RestMapper, DryRunMode, Template, TemplateData) and properly configures the mock client for testing CR creation.


346-369: LGTM!

The test correctly uses events.HealthEventData and validates the updated CreateMaintenanceResource signature returning (string, error). The assertion logic properly handles both success and failure cases.


373-418: LGTM!

Tests for RunLogCollectorJob correctly use the updated signature with eventId parameter and handle the (interface{}, error) return pattern.

fault-remediation/pkg/crstatus/checker.go (2)

28-44: LGTM!

The struct and constructor properly implement the controller-runtime based CR status checker with appropriate field initialization.


71-105: LGTM!

The condition-checking methods are identical to the deprecated version, maintaining behavioral parity between the two implementations.

fault-remediation/pkg/remediation/remediation.go (2)

38-95: LGTM!

The CtrlRuntimeRemediationClient struct and constructor are well-structured. Template loading, dry-run configuration, and dependency initialization (annotation manager, status checker) are properly handled with appropriate error checking.


437-478: LGTM!

The timeout checking logic with configurable LOG_COLLECTOR_TIMEOUT environment variable and proper fallback to default is well implemented. The annotation-based guard against duplicate metrics recording is a good pattern.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

208-217: LGTM!

The controller-runtime Manager setup with envtest follows best practices. Disabling metrics binding with BindAddress: "0" is appropriate for tests.


302-315: LGTM!

The createTestRemediationClient helper properly constructs remediation.TemplateData and uses NewCtrlRuntimeRemediationClient with the controller-runtime client.


880-888: LGTM!

The metrics assertions correctly use the dedicated metrics package constants and verify that events are properly counted across different status types (created, skipped).

fault-remediation/pkg/remediation/deprecated_remediation.go (3)

62-74: LGTM!

The struct field exports align with the API surface changes documented in the AI summary. The nodeExistsFunc override for testing is a good pattern.


165-171: LGTM!

The accessor methods properly return the interface types, enabling dependency injection and testing.


506-518: Intentional nil error return for non-fatal log collector failures.

The code explicitly returns nil error for timeout (line 509) and job completion/failure (line 517) to allow reconciliation to continue. This aligns with the PR description's intent. The slog.Error calls ensure visibility into these issues.

fault-remediation/pkg/reconciler/reconciler_test.go (6)

40-66: LGTM!

The MockK8sClient properly implements the updated interface signatures with events.HealthEventData, ctrl.Result, and the accessor methods returning interface types.


102-134: LGTM!

The MockNodeAnnotationManager correctly implements the updated GetRemediationState signature returning (*annotation.RemediationStateAnnotation, *corev1.Node, error) and uses the annotation package types.


190-234: LGTM!

The TestNewReconciler test properly uses table-driven testing and validates both dry-run enabled and disabled scenarios with the updated return signature.


336-390: LGTM!

The TestPerformRemediationWithSuccess test correctly validates the success path with the updated API, including the HealthEventDoc conversion and CR name assertion.


601-666: LGTM!

The TestRunLogCollectorJobErrorScenarios test properly validates the ctrl.Result return pattern including the requeue scenario with RequeueAfter.


946-1009: LGTM!

The TestLogCollectorOnlyCalledWhenShouldCreateCR test validates the fix for Issue #441, ensuring log collector is only called when shouldCreateCR is true to prevent duplicate jobs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

πŸ€– Fix all issues with AI Agents
In @.gitignore:
- Line 143: The .gitignore now contains a blanket `.idea/` entry but still keeps
redundant granular `.idea/*` patterns (e.g., `.idea/mongoSettings.xml`,
`.idea/replstate.xml`, `.idea/**/sonarlint/`) which are unnecessary; remove all
specific `.idea/` file and subdirectory patterns that are subsumed by the
`.idea/` rule so the file is clean and maintainable while keeping the single
`.idea/` line.

In @fault-remediation/main.go:
- Around line 171-182: The retry loop around
components.FaultRemediationReconciler.Reconcile does not observe gCtx
cancellation; modify the loop that uses UpdateMaxRetries and UpdateRetryDelay to
break/return when gCtx is done by selecting on gCtx.Done() before retrying:
after a failed Reconcile, replace the unconditional time.Sleep(...) with a
context-aware wait using select { case
<-time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay): /*
continue */ case <-gCtx.Done(): /* exit loop/return to allow graceful shutdown
*/ }, and also check <-gCtx.Done() at the top of the retry iteration to avoid
starting another attempt when the context is cancelled.

In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-74: GetRemediationState may fail and leave `node` empty; do not
swallow that error and continue using `node.DeepCopy()`. In the block handling
`state, node, err := m.GetRemediationState(ctx, nodeName)` remove the fallback
that creates an empty `RemediationStateAnnotation` and instead return the error
immediately (propagate `err`) so callers won't operate on an invalid `node`;
keep `RemediationStateAnnotation` initialization only where a valid node/state
is present and ensure any subsequent use of `node.DeepCopy()` occurs after a
successful GetRemediationState.

In @fault-remediation/pkg/crstatus/checker.go:
- Around line 58-66: The lookup uses client.ObjectKey with only Name, causing
failures for namespaced CRs; include the resource namespace when building the
key so c.client.Get(ctx, key, obj) can find namespaced objects. Update the
ObjectKey construction (used before calling c.client.Get) to set Namespace from
the MaintenanceResource (e.g., maintenanceResource.Namespace or the variable
holding the config.Namespace) while keeping Name as crName, so both namespaced
and cluster-scoped CRs are handled.

In @fault-remediation/pkg/crstatus/deprecated_checker.go:
- Around line 77-91: The checkCondition logic is inverted: in
CRStatusChecker.checkCondition you should treat missing status/conditions as
non-terminal (do not skip) and consider terminal condition statuses as reasons
to skip; change the early returns for failed/unfound unstructured.NestedMap and
NestedSlice to return false, and replace the final return of
"!c.isTerminal(conditionStatus)" with "c.isTerminal(conditionStatus)"; keep
using findConditionStatus and isTerminal to compute the conditionStatus and
determine terminality.

In @fault-remediation/pkg/events/health_event.go:
- Around line 5-8: Add a godoc comment immediately above the exported type
HealthEventDoc that briefly describes what the struct represents (e.g., a
persistent/document representation of a health event), mentions its ID field and
embedded model.HealthEventWithStatus, and any important JSON serialization
behavior; place the comment directly above the HealthEventDoc declaration so it
satisfies Go documentation guidelines.
- Around line 1-3: Add a package-level godoc comment at the top of
health_event.go describing the purpose and responsibilities of the events
package (e.g., what types of events it models and how callers should use it).
Insert a single-line or multi-line comment beginning with "Package events"
immediately above the package events declaration so the package-level
documentation appears in godoc and satisfies the coding guidelines.

In @fault-remediation/pkg/initializer/init.go:
- Line 114: The current slog.Info("Successfully initialized k8s client") is
misleading when UseCtrlRuntime is true because no k8s client is created in that
mode; change the logging around where UseCtrlRuntime is checked (referencing the
UseCtrlRuntime flag/variable and the slog.Info call) to emit a conditional
message: if UseCtrlRuntime is true log something like "Running in
controller-runtime mode; no standalone k8s client initialized", otherwise keep
"Successfully initialized k8s client". Ensure you update only the message logic
near the existing slog.Info invocation so it accurately reflects which path ran.

In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: Update the TODO in metrics.go to reference a tracking issue: replace
the existing "//TODO: evaluate and remove redundant metrics with ctrl-runtime
defaults" with a TODO that includes the issue ID and brief context (e.g. "//
TODO(issue-1234): evaluate and remove redundant metrics with ctrl-runtime
defaults") so the task is traceable; ensure the issue ID matches the repo's
issue tracker and keep the explanatory text unchanged beyond adding the issue
reference.

In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code updates the node annotation twice: once using
crName and again using actualCRName; remove the first UpdateRemediationState
call (the block that uses crName immediately after computing group :=
common.GetRemediationGroupForAction(...)) so only the subsequent update that
uses actualCRName := createdCR.GetName() remains; ensure you keep the group
computation and the second UpdateRemediationState call (and its error handling)
intact and remove the duplicate slog.Warn/return branch associated with the
first update.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 310-317: The create path in the reconciler currently calls
c.client.Create(ctx, job) after checking len(existingJobs.Items) == 0 and does
not handle the race where another reconciler created the job and Create returns
an already-exists error; update the c.client.Create error handling to detect
apierrors.IsAlreadyExists(err) (k8s.io/apimachinery/pkg/api/errors) and treat
that case as success by returning ctrl.Result{RequeueAfter: 10 * time.Second},
nil (same behavior as successful create) instead of returning the error, while
still returning real errors unchanged.
- Around line 183-208: The first call that updates the node annotation using the
provisional crName is redundant; remove the initial block that calls
annotationManager.UpdateRemediationState with crName (the block after group :=
common.GetRemediationGroupForAction(healthEvent.RecommendedAction) that uses
crName) and keep only the later update that uses the authoritative actualCRName
from maintenance.GetName(); ensure you still check group != "" and
c.annotationManager != nil and return the existing error handling (slog.Warn and
return err) when the single UpdateRemediationState call fails.

In @fault-remediation/pkg/remediation/templates/log-collector-job.yaml:
- Line 20: Replace the hardcoded namespace value "test" in the
log-collector-job.yaml template with a parameterized Go template variable (e.g.,
change the "namespace: test" entry to use "{{.Namespace}}"), update the
template's expected data model to include Namespace, and ensure any code or
deployment that renders this template (the caller that supplies template data)
passes the Namespace value; this mirrors the approach used in
rebootnode-template.yaml so the template becomes environment-agnostic.
- Line 32: The job template currently hardcodes the placeholder image string
"test:test"; replace this with a real image reference or a templated parameter
so deployments use the correct log-collector image. Update the image field in
the log-collector job template (the line containing "image: test:test") to
reference a configurable variable (e.g., a Helm value or environment variable
like logCollector.image / .Values.logCollector.image) or the intended production
image name, and ensure any charts/values or deployment tooling are updated to
provide that value.
🧹 Nitpick comments (6)
fault-remediation/pkg/events/health_event.go (1)

10-14: Clarify godoc comment to distinguish HealthEventData from HealthEventDoc.

The godoc comment doesn't explain the distinction between HealthEventData (BSON-tagged) and HealthEventDoc (JSON-tagged). Consider documenting the intended use case for each type to improve maintainability.

πŸ”Ž Proposed improvement
-// HealthEventData represents health event data with string ID for compatibility
+// HealthEventData represents health event data with BSON "_id,omitempty" tag for MongoDB storage.
+// Use HealthEventDoc for JSON-based representations.
 type HealthEventData struct {
 	ID                          string `bson:"_id,omitempty"`
 	model.HealthEventWithStatus `bson:",inline"`
 }
fault-remediation/pkg/crstatus/crstatus_test.go (1)

120-212: Consider reducing test duplication with a helper function.

TestCheckConditionCtrlRuntime duplicates ~90 lines from TestCheckCondition. Consider extracting a helper function that accepts the checker as a parameter to test both implementations with the same test cases, improving maintainability and ensuring test case parity.

πŸ”Ž Proposed refactor
+func testCheckCondition(t *testing.T, checker interface {
+	checkCondition(*unstructured.Unstructured) bool
+}) {
+	tests := []struct {
+		name     string
+		cr       *unstructured.Unstructured
+		expected bool
+	}{
+		{
+			name: "no status returns skip - in progress",
+			cr: &unstructured.Unstructured{
+				Object: map[string]any{
+					"metadata": map[string]any{"name": "test-cr"},
+				},
+			},
+			expected: true,
+		},
+		// ... rest of test cases
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			result := checker.checkCondition(tt.cr)
+			assert.Equal(t, tt.expected, result)
+		})
+	}
+}
+
 func TestCheckCondition(t *testing.T) {
 	cfg := &config.MaintenanceResource{
 		CompleteConditionType: "Completed",
 	}
 	checker := NewCRStatusChecker(nil, nil, cfg, false)
-
-	tests := []struct {
-		// ... test cases
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			result := checker.checkCondition(tt.cr)
-			assert.Equal(t, tt.expected, result)
-		})
-	}
+	testCheckCondition(t, checker)
 }
 
 func TestCheckConditionCtrlRuntime(t *testing.T) {
 	cfg := &config.MaintenanceResource{
 		CompleteConditionType: "Completed",
 	}
 	checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false)
-
-	tests := []struct {
-		// ... duplicate test cases
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			result := checker.checkCondition(tt.cr)
-			assert.Equal(t, tt.expected, result)
-		})
-	}
+	testCheckCondition(t, checker)
 }
fault-remediation/pkg/initializer/init.go (1)

88-112: Consider extracting TemplateData initialization to reduce duplication.

The TemplateData struct initialization (lines 91-95 and 103-106) is duplicated across both initialization paths. Extract it to a variable before the conditional to improve maintainability.

πŸ”Ž Proposed refactor
+	templateData := remediation.TemplateData{
+		TemplateMountPath:   tomlConfig.Template.MountPath,
+		TemplateFileName:    tomlConfig.Template.FileName,
+		MaintenanceResource: tomlConfig.MaintenanceResource,
+	}
+
 	if params.UseCtrlRuntime {
 		remediationClient, err = remediation.NewCtrlRuntimeRemediationClient(
 			ctrlruntimeClient,
-			params.DryRun, remediation.TemplateData{
-				TemplateMountPath:   tomlConfig.Template.MountPath,
-				TemplateFileName:    tomlConfig.Template.FileName,
-				MaintenanceResource: tomlConfig.MaintenanceResource,
-			})
+			params.DryRun,
+			templateData)
 		if err != nil {
 			return nil, fmt.Errorf("error while initializing ctrl runtime client: %w", err)
 		}
 	} else {
 		remediationClient, clientSet, err = remediation.NewK8sClient(
 			params.KubeconfigPath,
 			params.DryRun,
-			remediation.TemplateData{
-				TemplateMountPath:   tomlConfig.Template.MountPath,
-				TemplateFileName:    tomlConfig.Template.FileName,
-				MaintenanceResource: tomlConfig.MaintenanceResource,
-			},
+			templateData,
 		)
 		if err != nil {
 			return nil, fmt.Errorf("error while initializing kubernetes client: %w", err)
 		}
 	}
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

17-54: Consider grouping imports according to Go conventions.

The imports have non-standard ordering with context followed by a blank line, then log. Standard Go convention groups stdlib imports together, then a blank line, then external packages, then a blank line, then internal packages.


1109-1111: Avoid using time.Sleep for synchronization in tests.

Using time.Sleep(500 * time.Millisecond) for waiting on event processing is fragile. Consider using assert.Eventually consistently for better test reliability.

πŸ”Ž Suggested approach
-	// Allow time for second event to be processed (should be deduplicated)
-	time.Sleep(500 * time.Millisecond)
+	// Wait for second event to be processed (should be deduplicated)
+	time.Sleep(100 * time.Millisecond) // Brief pause to allow processing

Or better, add an Eventually check for a specific condition that indicates processing completed.

fault-remediation/pkg/remediation/templates/log-collector-job.yaml (1)

28-32: Consider adding resource limits and additional parameterization.

To improve cluster stability and operational flexibility:

  1. Define CPU and memory resource requests and limits for the log-collector container
  2. Consider parameterizing additional fields such as ttlSecondsAfterFinished, serviceAccountName, and container arguments/env vars if they vary across deployments
πŸ”Ž Example resource limits
       containers:
         - name: log-collector
           image: {{.Image}}
+          resources:
+            requests:
+              memory: "128Mi"
+              cpu: "100m"
+            limits:
+              memory: "256Mi"
+              cpu: "200m"
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between f0f60b1 and b05f1bf.

πŸ“’ Files selected for processing (23)
  • .gitignore
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (1)
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/annotation/annotation_interface.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
🧠 Learnings (17)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
🧬 Code graph analysis (11)
fault-remediation/pkg/initializer/init.go (4)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
  • FaultRemediationClientInterface (29-34)
  • TemplateData (37-44)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
fault-remediation/pkg/config/config.go (2)
  • Template (27-30)
  • MaintenanceResource (18-24)
commons/pkg/statemanager/statemanager.go (2)
  • StateManager (197-200)
  • NewStateManager (206-210)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCtrlRuntimeCRStatusChecker (34-44)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (8)
health-monitors/kubernetes-object-monitor/pkg/cel/environment.go (1)
  • Environment (30-35)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
  • FaultRemediationClientInterface (29-34)
  • TemplateData (37-44)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
commons/pkg/statemanager/statemanager.go (2)
  • StateManager (197-200)
  • RemediatingLabelValue (171-171)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventDoc (5-8)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • AnnotationKey (12-12)
fault-remediation/pkg/metrics/metrics.go (4)
  • TotalEventsReceived (33-38)
  • EventHandlingDuration (62-68)
  • EventsProcessed (39-45)
  • ProcessingErrors (46-52)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/remediation/deprecated_remediation_test.go (4)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/remediation/deprecated_remediation.go (1)
  • FaultRemediationClient (62-74)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (11-14)
  • HealthEventDoc (5-8)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (16-21)
  • RemediationStateAnnotation (24-26)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (48-48)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/reconciler/reconciler.go (8)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • FaultRemediationClientInterface (29-34)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/metrics/metrics.go (7)
  • EventHandlingDuration (62-68)
  • TotalEventsReceived (33-38)
  • TotalUnsupportedRemediationActions (53-59)
  • ProcessingErrors (46-52)
  • EventsProcessed (39-45)
  • CRStatusSkipped (26-26)
  • CRStatusCreated (25-25)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (197-200)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventDoc (5-8)
  • HealthEventData (11-14)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventWithStatus (53-57)
store-client/pkg/client/convenience.go (1)
  • UpdateHealthEventStatus (31-35)
fault-remediation/pkg/remediation/deprecated_remediation.go (6)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/common/equivalence_groups.go (1)
  • GetRemediationGroupForAction (35-45)
fault-remediation/pkg/metrics/metrics.go (2)
  • LogCollectorJobs (71-77)
  • LogCollectorJobDuration (78-85)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (29)
.gitignore (1)

143-143: ⚠️ Scope mismatch: .gitignore changes don't align with PR objectives.

This file's changes focus on consolidating IDE configuration patterns, which is unrelated to the PR's stated objectives around remediation retry behavior and error handling. The PR summary describes refactoring fault remediation with dual execution modes and controller-runtime integration, not .gitignore cleanup.

If this .gitignore consolidation is intentional as part of broader housekeeping, clarify this in the PR description. Alternatively, consider submitting it as a separate, focused PR to keep scopes distinct.

fault-remediation/pkg/metrics/metrics.go (1)

33-92: LGTM: Metric variable exports are well-structured.

The renamed metric variables follow Go naming conventions and properly integrate with the controller-runtime metrics registry. The metric definitions (names, help texts, labels) are preserved correctly.

fault-remediation/pkg/annotation/deprecated_annotation.go (3)

83-125: LGTM: GetRemediationState signature change properly implemented.

The updated method signature correctly returns *corev1.Node alongside the remediation state. All return paths (success and error) properly handle the additional return value, and error wrapping follows Go conventions with %w.


131-131: LGTM: Call sites correctly updated for new GetRemediationState signature.

Both UpdateRemediationState and RemoveGroupFromState properly handle the additional *corev1.Node return value by using _ to discard it, which is appropriate since these methods don't require the node reference.

Also applies to: 186-186


45-80: LGTM: Robust retry logic with appropriate error classification.

The retry implementation correctly handles common transient Kubernetes API errors (conflicts, timeouts, rate limiting, service unavailability) with exponential backoff. Error wrapping follows Go conventions with %w to preserve error chains.

fault-remediation/pkg/initializer/init.go (3)

58-60: LGTM: Proper validation of ctrl-runtime client.

The validation correctly ensures that a ctrl-runtime client is provided when UseCtrlRuntime is enabled, preventing nil pointer errors in the ctrl-runtime initialization path.


148-157: LGTM: Reconciler configuration properly wired.

The reconciler configuration correctly integrates the remediation client abstraction, retry settings, and log collector flag. (Note: StateManager nil issue addressed in separate comment.)


152-153: StateManager will panic with nil clientSet in ctrl-runtime mode.

When UseCtrlRuntime is true, clientSet remains nil but is passed to statemanager.NewStateManager(clientSet). StateManager methods directly call clientSet.CoreV1().Nodes().Get() without nil checks, causing a panic at runtime. The TODO acknowledges this but doesn't prevent the runtime error.

Either implement a ctrl-runtime version of StateManager or add nil checks to guard StateManager method calls in ctrl-runtime mode.

Likely an incorrect or invalid review comment.

fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)

15-476: Test updates correctly reflect the refactored API.

The test changes appropriately adapt to the new public API surface, including:

  • Field capitalizations (Clientset, KubeClient, RestMapper, DryRunMode, Template, TemplateData)
  • Updated CreateMaintenanceResource signature returning (string, error) and accepting events.HealthEventData
  • Updated RunLogCollectorJob signature including eventId parameter and returning (ctrl.Result, error)

The test logic and assertions remain sound.

fault-remediation/pkg/remediation/remediation.go (2)

47-95: LGTM: Well-structured constructor with proper validation.

The constructor correctly:

  • Validates template file existence before reading
  • Handles template parsing errors
  • Initializes dry-run mode appropriately
  • Sets up annotation manager and status checker components

322-478: LGTM: Robust log collector status checking with metric guards.

The status checking implementation correctly:

  • Handles complete, failed, and timeout states separately
  • Uses annotation-based guards to prevent duplicate metric recording across reconciliations
  • Configures timeout via environment variable with safe fallback
  • Requeues appropriately when job is still running
fault-remediation/pkg/crstatus/checker.go (2)

28-44: LGTM - Clean refactoring to controller-runtime client.

The struct and constructor are well-structured with the controller-runtime client integration. The simplified field set improves maintainability.


71-85: LGTM - Condition checking logic is correct.

The checkCondition method properly handles missing status/conditions by returning true (allowing CR creation), and correctly delegates to isTerminal for status evaluation.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

208-217: LGTM - Controller-runtime manager setup is correct.

The manager is properly configured with the test environment config, scheme, and disabled metrics server (BindAddress: "0"). The client is correctly obtained from the manager.


302-315: LGTM - Test remediation client creation properly uses new API.

The createTestRemediationClient correctly constructs remediation.TemplateData and uses remediation.NewCtrlRuntimeRemediationClient with the controller-runtime client.

fault-remediation/pkg/remediation/deprecated_remediation.go (2)

506-517: Verify intentional error suppression for log collector failures.

The code logs errors but returns nil for both timeout and job failure cases. Based on the past review discussion, this is intentional to allow reconciliation to continue. However, consider whether returning ctrl.Result{Requeue: true} might be more appropriate for transient failures.

Confirm that swallowing errors here aligns with the intended behavior discussed in past reviews - allowing remediation to proceed even if log collection fails.


62-74: LGTM - Struct fields properly exposed with consistent naming.

The FaultRemediationClient struct fields are well-organized with clear naming conventions. The nodeExistsFunc allows for test overrides.

fault-remediation/pkg/reconciler/reconciler.go (6)

203-211: Error propagation enables retry behavior as intended.

Returning the error from UpdateNVSentinelStateNodeLabel allows controller-runtime to retry the reconciliation. This aligns with the PR objective.


220-242: Good error aggregation pattern with errors.Join.

The code properly handles the case where CR creation fails by:

  1. Recording the error but continuing to update state
  2. Using errors.Join to combine errors when both operations fail
  3. Returning the CR creation error after state update

This ensures state is updated even on failure while still propagating errors for retry.


406-410: Error propagation from GetRemediationState enables retry.

Previously this may have silently continued; now it returns the error to trigger a retry. The underscore for the unused node return value is appropriate.


438-441: Error propagation from RemoveGroupFromState enables retry.

Returning false, "", err instead of just logging ensures the reconciler will retry on annotation update failures.


328-330: Verify log collector error handling aligns with intended behavior.

The code returns early if err != nil or !result.IsZero(). Per past review discussions, log collector failures should not block remediation. Confirm this matches the current implementation in RunLogCollectorJob.

Based on the deprecated_remediation.go changes, RunLogCollectorJob now returns nil error on timeout/failure, so this early return would only trigger on actual API errors during job creation.


332-344: Good error handling with status propagation.

The pattern of:

  1. Attempting remediation
  2. Updating status regardless of outcome
  3. Joining errors if both fail
  4. Returning the original error

Ensures status is always updated while still triggering retries on failure.

fault-remediation/pkg/reconciler/reconciler_test.go (5)

40-66: LGTM - Mock implementations properly updated for new interfaces.

The MockK8sClient correctly implements the updated FaultRemediationClientInterface with the new return types for CreateMaintenanceResource and RunLogCollectorJob.


102-134: LGTM - MockNodeAnnotationManager implements the new interface correctly.

The mock properly returns the 3-tuple (*RemediationStateAnnotation, *corev1.Node, error) from GetRemediationState, with nil for the node which is appropriate for unit tests.


383-390: LGTM - Test properly constructs HealthEventDoc from HealthEventData.

The conversion pattern from events.HealthEventData to events.HealthEventDoc is correctly applied in the test.


443-446: Test correctly expects error on remediation failure.

The test verifies that performRemediation returns an error when CreateMaintenanceResource fails, confirming the retry behavior.


489-491: Test correctly expects error on label update failure.

The test now expects an error when UpdateNVSentinelStateNodeLabel fails, validating the new retry-on-error behavior introduced by this PR.

fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

1-20: Template is missing required force field in spec.

The RebootNode CRD (janitor.dgxc.nvidia.com/v1alpha1) requires both force (boolean, defaults to false) and nodeName fields in the spec. Your template only includes nodeName, which will cause validation errors when creating RebootNode resources.

Add the missing field:

Required fix
apiVersion: {{.ApiGroup}}/{{.Version}}
kind: RebootNode
metadata:
  name: maintenance-{{.NodeName}}-{{.HealthEventID}}
spec:
  nodeName: {{.NodeName}}
  force: false

Also verify that ApiGroup and Version placeholders are populated with janitor.dgxc.nvidia.com and v1alpha1 respectively during template rendering.

β›” Skipped due to learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

πŸ€– Fix all issues with AI agents
In @commons/pkg/statemanager/statemanager_test.go:
- Around line 281-302: Remove the dead commented-out test block for
TestUpdateNVSentinelStateNodeLabelWithUpdateFailureCtrlRuntime: delete the
entire commented function (including references to ctrlRuntimeStateManager,
fakeClient, and the UpdateNVSentinelStateNodeLabel calls) from
statemanager_test.go; if the test work is intended later, open a tracking issue
referencing this test name and move any needed notes there instead of leaving
commented code.

In @commons/pkg/statemanager/statemanager.go:
- Around line 310-388: The
ctrlRuntimeStateManager.UpdateNVSentinelStateNodeLabel implementation lacks
retry-on-conflict handling; wrap the Get/Modify/Update sequence in a retry loop
(e.g., retry.OnError with errors.IsConflict) so transient optimistic-concurrency
conflicts are retried, introduce a local nodeModified bool to track whether the
label was actually changed/removed inside the retry closure, and ensure the
function returns that nodeModified and the final error from the retry call; keep
using validateStateTransition, manager.client.Get and manager.client.Update
inside the retry closure and return validationErr only after a successful
update.

In @fault-remediation/main.go:
- Around line 171-182: The retry loop around
components.FaultRemediationReconciler.Reconcile does not respect gCtx
cancellation because it uses time.Sleep; change the backoff to a context-aware
wait by replacing the
time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay) with a
select that waits on
time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay) and on
gCtx.Done(), and if gCtx is cancelled return or break out of the loop; ensure
you check gCtx.Done() before each retry and abort retries when the context is
done so graceful shutdown on SIGTERM is honored.

In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-74: GetRemediationState can fail and currently the code
swallows the error and proceeds with an empty RemediationStateAnnotation while
using the invalid node value returned from the failed call later (e.g.,
node.DeepCopy()), which can cause incorrect behavior; change the handling in the
caller so that when m.GetRemediationState(ctx, nodeName) returns an error you
immediately return that error (or wrap and return it) instead of continuing with
an empty state, ensuring you do not call methods like node.DeepCopy() on the
invalid node; update the function containing this logic to propagate the error
from GetRemediationState rather than creating a placeholder
RemediationStateAnnotation.

In @fault-remediation/pkg/crstatus/checker.go:
- Around line 58-66: The ObjectKey used for c.client.Get is missing the
Namespace, so lookups for namespaced CRs fail; update the key construction (the
client.ObjectKey passed to c.client.Get where obj, gvk and crName are used) to
include the namespace from the MaintenanceResource (e.g., use the resource's
Namespace field or crNamespace variable) so the lookup uses both Name and
Namespace before calling c.client.Get; ensure the Namespace is set only when
non-empty for cluster-scoped resources.

In @fault-remediation/pkg/crstatus/crstatus_interface.go:
- Around line 1-9: The package crstatus lacks a package-level godoc comment; add
a brief package comment immediately above "package crstatus" describing the
package's purpose (e.g., utilities for checking/handling CR status), mention the
exported interface CRStatusCheckerInterface and its method
ShouldSkipCRCreation(context.Context, string) so docs are clear, keeping the
comment concise and in godoc style.

In @fault-remediation/pkg/crstatus/deprecated_checker.go:
- Around line 77-91: The checkCondition method in CRStatusChecker has inverted
skip logic: when status or conditions are missing (in the nested map/slice
checks in checkCondition) it currently returns true but should return false
(treat missing as non-terminal), and the final return should not negate
isTerminal β€” replace "return !c.isTerminal(conditionStatus)" with "return
c.isTerminal(conditionStatus)"; locate these in checkCondition (calls to
unstructured.NestedMap, unstructured.NestedSlice, findConditionStatus and
isTerminal) and invert those boolean returns accordingly.

In @fault-remediation/pkg/events/health_event.go:
- Around line 5-8: Add a godoc comment immediately above the exported type
HealthEventDoc describing its purpose and fields; mention that it represents a
HealthEvent document with an ID and inlined model.HealthEventWithStatus (so
readers understand the JSON tags and inline embedding). Ensure the comment
starts with "HealthEventDoc" and is a complete sentence per Go conventions.
- Around line 1-3: Add a package-level godoc comment above the package
declaration for package events in health_event.go that briefly documents the
package purpose and intended usage (e.g., what health events are represented and
how consumers should use this package). Ensure the comment is a full sentence
starting with "Package events ..." and sits immediately above the existing
"package events" line so godoc tools pick it up; update any existing top-of-file
comments if present to follow the "Package events ..." convention.

In @fault-remediation/pkg/initializer/init.go:
- Line 117: The log message "Successfully initialized k8s client" is misleading
when UseCtrlRuntime is true; update the logging in the client initialization
(where slog.Info is called) to check the UseCtrlRuntime flag and log a precise
message (e.g., "Successfully initialized ctrl-runtime remediation client" when
UseCtrlRuntime is true, otherwise "Successfully initialized k8s client"),
referencing the UseCtrlRuntime boolean and the existing slog.Info call to locate
the spot to change.

In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code calls UpdateRemediationState twiceβ€”first using
crName and then again using actualCRName from createdCR.GetName(); remove the
first redundant update (the block that uses crName) so only the second
UpdateRemediationState call remains, keeping the group lookup via
common.GetRemediationGroupForAction(healthEvent.RecommendedAction) and the
c.AnnotationManager checks intact and returning errors as currently done in the
second block.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 183-192: Remove the redundant provisional annotation update:
delete the block that calls c.annotationManager.UpdateRemediationState(ctx,
healthEvent.NodeName, group, crName) (the update using the provisional crName)
and its error handling; keep the later authoritative update that uses
actualCRName retrieved from maintenance.GetName() so only the single, final
UpdateRemediationState call remains. Ensure any related variables (crName) are
still set if needed elsewhere, but do not perform the early
UpdateRemediationState call with the provisional name.
🧹 Nitpick comments (9)
fault-remediation/pkg/crstatus/crstatus_test.go (1)

120-212: Consider extracting shared test cases to reduce duplication.

The test logic is correct and comprehensive. However, the test cases are identical to TestCheckCondition (lines 26-118). Consider extracting the shared test cases into a variable or helper function to reduce duplication and improve maintainability.

♻️ Proposed refactor to share test cases
+// Shared test cases for both checker variants
+func getConditionTestCases() []struct {
+	name     string
+	cr       *unstructured.Unstructured
+	expected bool
+} {
+	return []struct {
+		name     string
+		cr       *unstructured.Unstructured
+		expected bool
+	}{
+		{
+			name: "no status returns skip - in progress",
+			cr: &unstructured.Unstructured{
+				Object: map[string]any{
+					"metadata": map[string]any{"name": "test-cr"},
+				},
+			},
+			expected: true,
+		},
+		// ... rest of test cases
+	}
+}
+
 func TestCheckCondition(t *testing.T) {
 	cfg := &config.MaintenanceResource{
 		CompleteConditionType: "Completed",
 	}
 	checker := NewCRStatusChecker(nil, nil, cfg, false)
-
-	tests := []struct {
-		name     string
-		cr       *unstructured.Unstructured
-		expected bool
-	}{
-		// ... test cases
-	}
+	tests := getConditionTestCases()
 
 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
@@ -113,93 +115,7 @@
 
 func TestCheckConditionCtrlRuntime(t *testing.T) {
 	// ... same setup
 	checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false)
-
-	tests := []struct {
-		name     string
-		cr       *unstructured.Unstructured
-		expected bool
-	}{
-		// ... identical test cases
-	}
+	tests := getConditionTestCases()
 
 	for _, tt := range tests {
 		// ... same loop
commons/pkg/statemanager/statemanager_test.go (1)

17-32: Import ordering does not follow Go conventions.

Standard library imports should be grouped separately from third-party imports. The k8s.io/apimachinery/pkg/types import on line 20 is mixed with the standard library block.

♻️ Suggested fix
 import (
 	"context"
 	"fmt"
-	"k8s.io/apimachinery/pkg/types"
 	"testing"
 
 	"github.com/stretchr/testify/assert"
 	v1 "k8s.io/api/core/v1"
 	"k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/runtime"
 	"k8s.io/apimachinery/pkg/runtime/schema"
+	"k8s.io/apimachinery/pkg/types"
 	"k8s.io/client-go/kubernetes/fake"
 	ktesting "k8s.io/client-go/testing"
 	ctrlFake "sigs.k8s.io/controller-runtime/pkg/client/fake"
 )
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

15-15: Missing package-level godoc.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ“ Proposed fix
+// Package remediation provides interfaces and implementations for fault remediation
+// operations, including maintenance resource creation and log collection.
 package remediation

31-31: Parameter name should use Go acronym convention.

Go convention is to use ID (all caps) for acronyms, not Id.

♻️ Proposed fix
-	RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
+	RunLogCollectorJob(ctx context.Context, nodeName string, eventID string) (ctrl.Result, error)
commons/pkg/statemanager/statemanager.go (1)

144-156: Import ordering does not follow Go conventions.

Standard library and third-party imports are mixed. Group standard library imports together, separated from third-party imports.

♻️ Proposed fix
 import (
 	"context"
 	"fmt"
-	corev1 "k8s.io/api/core/v1"
-	"k8s.io/apimachinery/pkg/types"
 	"log/slog"
-	"sigs.k8s.io/controller-runtime/pkg/client"
 
+	corev1 "k8s.io/api/core/v1"
 	"k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/types"
 	"k8s.io/client-go/kubernetes"
 	"k8s.io/client-go/util/retry"
+	"sigs.k8s.io/controller-runtime/pkg/client"
 )
fault-remediation/pkg/annotation/annotation.go (1)

1-1: Missing package-level godoc.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ“ Proposed fix
+// Package annotation provides node annotation management for tracking remediation state
+// across fault remediation operations.
 package annotation
fault-remediation/main.go (1)

199-199: TODO comment should reference an issue.

As per coding guidelines, TODO comments in Go code should reference issues for tracking.

πŸ“ Proposed fix
-	//TODO: setup informers for node and job
+	//TODO(#issue_number): setup informers for node and job
fault-remediation/pkg/remediation/deprecated_remediation.go (1)

208-210: Add context when returning template execution error.

Per coding guidelines, wrap errors with context for better traceability.

πŸ“ Suggested enhancement
 	if err = c.Template.Execute(&buf, c.TemplateData); err != nil {
 		slog.Error("Failed to execute maintenance Template", "error", err)
-		return "", err
+		return "", fmt.Errorf("failed to execute maintenance template: %w", err)
 	}
fault-remediation/pkg/reconciler/reconciler.go (1)

203-211: Consider adding context when returning label update error.

Per coding guidelines, wrapping errors with context improves traceability.

πŸ“ Suggested enhancement
 	_, err := r.Config.StateManager.UpdateNVSentinelStateNodeLabel(ctx,
 		healthEventWithStatus.HealthEvent.NodeName,
 		statemanager.RemediatingLabelValue, false)
 	if err != nil {
 		slog.Error("Error updating node label to remediating", "error", err)
 		metrics.ProcessingErrors.WithLabelValues("label_update_error", nodeName).Inc()
 
-		return "", err
+		return "", fmt.Errorf("failed to update node label to remediating: %w", err)
 	}
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between b05f1bf and 8788f69.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (26)
  • .gitignore
  • commons/go.mod
  • commons/pkg/statemanager/statemanager.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (1)
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • fault-remediation/pkg/remediation/remediation_test.go
  • .gitignore
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/annotation/annotation_interface.go
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/main.go
  • commons/pkg/statemanager/statemanager.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • commons/go.mod
🧠 Learnings (25)
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/events/health_event.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • commons/go.mod
  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • commons/pkg/statemanager/statemanager.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (10)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCtrlRuntimeCRStatusChecker (34-44)
commons/pkg/statemanager/statemanager_test.go (1)
commons/pkg/statemanager/statemanager.go (9)
  • QuarantinedLabelValue (166-166)
  • NVSentinelStateLabelKey (159-159)
  • DrainingLabelValue (169-169)
  • NVSentinelStateLabelValue (162-162)
  • DrainSucceededLabelValue (170-170)
  • DrainFailedLabelValue (171-171)
  • RemediatingLabelValue (174-174)
  • RemediationSucceededLabelValue (175-175)
  • RemediationFailedLabelValue (176-176)
fault-remediation/main.go (3)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (61-69)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/remediation/deprecated_remediation_test.go (4)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (11-14)
fault-remediation/pkg/remediation/deprecated_remediation.go (1)
  • FaultRemediationClient (62-74)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (11-14)
  • HealthEventDoc (5-8)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (16-21)
  • RemediationStateAnnotation (24-26)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (7-9)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (48-48)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/reconciler/reconciler.go (8)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • FaultRemediationClientInterface (29-34)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/metrics/metrics.go (5)
  • EventHandlingDuration (62-68)
  • TotalEventsReceived (33-38)
  • TotalUnsupportedRemediationActions (53-59)
  • ProcessingErrors (46-52)
  • EventsProcessed (39-45)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (200-203)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventDoc (5-8)
  • HealthEventData (11-14)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventWithStatus (53-57)
store-client/pkg/storewatcher/watch_store.go (1)
  • ChangeStreamWatcher (43-57)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (37)
commons/go.mod (1)

26-26: Dual json-patch versions are expected transitive dependencies and require no action.

Both github.com/evanphx/json-patch/v5 v5.9.11 and gopkg.in/evanphx/json-patch.v4 v4.13.0 are indirect (transitive) dependencies pulled in by different upstream packages. The presence of both versions is normal in Go module management when different dependencies require different versions. Neither is directly used in the commons module code, and consolidation is not possible without changing upstream dependencies. The v5.9.11 version is stable and actively maintained.

Likely an incorrect or invalid review comment.

fault-remediation/pkg/crstatus/checker.go (3)

28-44: LGTM!

The type definition and constructor are well-structured, correctly using the controller-runtime client and properly initializing all fields.


71-85: LGTM!

The condition checking logic correctly handles missing status/conditions and delegates to helper methods for status evaluation.


87-106: LGTM!

The helper methods correctly implement condition status lookup and terminal state checking with appropriate type assertions and fallback behavior.

fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

1-20: LGTM! Static analysis error is a false positive.

The template structure is correct. The YAMLlint syntax error on line 15 is a false positiveβ€”the file uses Go template placeholders (e.g., {{.ApiGroup}}) that will be rendered before being parsed as YAML.

fault-remediation/pkg/events/health_event.go (1)

10-14: LGTM!

The HealthEventData type is properly documented and the struct definition is correct with appropriate BSON tags for MongoDB compatibility.

commons/pkg/statemanager/statemanager_test.go (5)

269-279: LGTM!

Good test coverage for the ctrl-runtime path's Get failure scenario, using the fake client without pre-created objects to simulate a node not found error.


304-330: LGTM!

Comprehensive test for adding a label via the ctrl-runtime path with proper verification of the final node state.


332-361: LGTM!

Proper test for label removal with verification that the label no longer exists on the node.


449-495: LGTM!

Good table-driven test pattern for verifying label removal from all possible states works without validation errors in the ctrl-runtime path.


497-572: LGTM!

Comprehensive state transition test covering both valid and invalid transitions, with proper verification that labels are set even for unexpected transitions.

fault-remediation/pkg/initializer/init.go (3)

51-60: LGTM!

Good defensive validation ensuring the ctrl-runtime client is provided when UseCtrlRuntime is true. This prevents nil pointer panics downstream.


83-115: LGTM!

The dual-mode initialization properly separates the ctrl-runtime and k8s client paths, fixing the previous issue where clientSet could be nil in ctrl-runtime mode. Each path now correctly initializes its required dependencies.


151-160: LGTM!

The reconciler configuration now correctly receives the abstracted RemediationClient and StateManager, enabling both initialization paths to work with the same reconciler interface.

fault-remediation/pkg/metrics/metrics.go (2)

15-27: LGTM!

Clean package reorganization with appropriate exported constants for CR status tracking.


33-92: LGTM!

Metrics properly exported and registered with the controller-runtime metrics registry. The naming follows Prometheus conventions with the fault_remediation_ prefix.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)

29-44: LGTM on interface design.

The interface provides a clean abstraction over remediation operations with appropriate accessors for annotation management and status checking. The TemplateData struct properly embeds config.MaintenanceResource for template rendering.

commons/pkg/statemanager/statemanager.go (1)

298-306: LGTM on struct and constructor.

Clean implementation following the same pattern as NewStateManager.

fault-remediation/pkg/annotation/annotation.go (2)

107-132: LGTM!

Clean implementation for clearing the remediation state annotation with proper nil-check and patch-based update.


134-169: LGTM!

Well-structured logic to remove a specific group, with automatic cleanup when no groups remain.

fault-remediation/main.go (3)

194-197: LGTM!

Good integration of the auditing round-tripper wrapper for request auditing in the ctrl-runtime path.


228-244: LGTM!

Proper initialization flow with deferred cleanup for the datastore components in the ctrl-runtime path.


122-136: LGTM!

Clean separation of the non-ctrl-runtime initialization with proper deferred cleanup.

fault-remediation/pkg/annotation/deprecated_annotation.go (2)

46-59: LGTM!

The retry logic with retry.OnError is correctly implemented, and wrapping errors with %w preserves the error chain for retry detection.


83-124: LGTM!

The signature update to return the Node object alongside the state is correctly implemented and aligns with the new interface definition. This enables owner-reference-based operations in remediation flows.

fault-remediation/pkg/remediation/remediation.go (1)

375-377: LGTM: Error handling aligns with PR objectives.

The Update calls at lines 375-377, 413-416, and 465-468 correctly return errors without wrapping, which will trigger retries in the reconciliation loop as intended by the PR title "retry on errors and throw errors to trigger retries."

Also applies to: 413-416, 465-468

fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)

15-418: LGTM!

The test updates correctly reflect the refactoring from private to exported fields in the FaultRemediationClient struct, and the usage of the new events.HealthEventData type.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

1-1380: LGTM!

The e2e test suite is well-structured and follows coding guidelines:

  • Uses envtest for testing Kubernetes controllers as recommended
  • Correctly handles the updated GetRemediationState signature throughout
  • Comprehensive test coverage for various reconciliation scenarios
  • Proper metrics validation
fault-remediation/pkg/remediation/deprecated_remediation.go (2)

365-504: LGTM: Comprehensive metrics instrumentation.

The metrics instrumentation throughout RunLogCollectorJob provides good observability with appropriate labels (error types, node names, job outcomes).


254-264: LGTM: Proper AlreadyExists error handling.

The extraction of AlreadyExists handling into handleCRCreateAlreadyExists improves code organization and ensures consistent annotation updates when a CR already exists.

fault-remediation/pkg/reconciler/reconciler.go (4)

407-441: Verify error handling strategy is intentional.

The function returns different shouldCreateCR values depending on the error type:

  • Line 409: Returns true (allow creation) when GetRemediationState fails
  • Line 440: Returns false (prevent creation) when RemoveGroupFromState fails

This appears intentional (fail-open for read errors, fail-closed for write errors), but could benefit from inline comments explaining the reasoning.


170-193: LGTM: runLogCollector signature updated for controller-runtime integration.

The signature changes enable proper requeue handling via ctrl.Result and improve log collector job labeling with eventUID.


328-344: LGTM: Proper error aggregation with errors.Join.

The error handling ensures status updates are always attempted even when remediation fails, and properly aggregates multiple errors using errors.Join for comprehensive error reporting.


101-352: LGTM: Comprehensive metrics instrumentation.

Metrics are consistently recorded throughout the reconciliation flow with appropriate labels for error types, node names, and status values, providing good observability.

fault-remediation/pkg/reconciler/reconciler_test.go (3)

40-66: LGTM: Mock interfaces updated to match new signatures.

The mock implementations correctly reflect the updated interface methods with events.HealthEventData, ctrl.Result return types, and new annotation/status checker interfaces.


102-134: LGTM: Mock annotation manager correctly implements new interface.

The mock properly returns the expanded 3-tuple from GetRemediationState and uses the correct types from the annotation package.


190-1009: LGTM: Test cases comprehensively updated for new interfaces.

All test cases correctly use events.HealthEventData and events.HealthEventDoc types, mock the new return signatures, and validate the updated error handling and return patterns.

Comment on lines 171 to 182
for i := 1; i <= components.FaultRemediationReconciler.Config.UpdateMaxRetries; i++ {
_, err = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
if err == nil {
break
}

slog.Error("Error processing event", "event", event, "error", err)

if i < components.FaultRemediationReconciler.Config.UpdateMaxRetries {
time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Retry loop does not respect context cancellation.

The retry loop will continue even if the context is cancelled (e.g., on SIGTERM). Replace time.Sleep with a context-aware wait to allow graceful shutdown during retries.

πŸ”§ Proposed fix
 		for i := 1; i <= components.FaultRemediationReconciler.Config.UpdateMaxRetries; i++ {
+			select {
+			case <-gCtx.Done():
+				slog.Info("Context cancelled, stopping retry loop")
+				return gCtx.Err()
+			default:
+			}
+
 			_, err = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
 			if err == nil {
 				break
 			}

 			slog.Error("Error processing event", "event", event, "error", err)

 			if i < components.FaultRemediationReconciler.Config.UpdateMaxRetries {
-				time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay)
+				select {
+				case <-time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay):
+				case <-gCtx.Done():
+					slog.Info("Context cancelled during retry delay")
+					return gCtx.Err()
+				}
 			}
 		}
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i := 1; i <= components.FaultRemediationReconciler.Config.UpdateMaxRetries; i++ {
_, err = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
if err == nil {
break
}
slog.Error("Error processing event", "event", event, "error", err)
if i < components.FaultRemediationReconciler.Config.UpdateMaxRetries {
time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay)
}
}
for i := 1; i <= components.FaultRemediationReconciler.Config.UpdateMaxRetries; i++ {
select {
case <-gCtx.Done():
slog.Info("Context cancelled, stopping retry loop")
return gCtx.Err()
default:
}
_, err = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
if err == nil {
break
}
slog.Error("Error processing event", "event", event, "error", err)
if i < components.FaultRemediationReconciler.Config.UpdateMaxRetries {
select {
case <-time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay):
case <-gCtx.Done():
slog.Info("Context cancelled during retry delay")
return gCtx.Err()
}
}
}
πŸ€– Prompt for AI Agents
In @fault-remediation/main.go around lines 171 - 182, The retry loop around
components.FaultRemediationReconciler.Reconcile does not respect gCtx
cancellation because it uses time.Sleep; change the backoff to a context-aware
wait by replacing the
time.Sleep(components.FaultRemediationReconciler.Config.UpdateRetryDelay) with a
select that waits on
time.After(components.FaultRemediationReconciler.Config.UpdateRetryDelay) and on
gCtx.Done(), and if gCtx is cancelled return or break out of the loop; ensure
you check gCtx.Done() before each retry and abort retries when the context is
done so graceful shutdown on SIGTERM is honored.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)

45-60: Return unwrapped errors in retry blocks to preserve retry behavior.

Line 55 wraps the error inside a retry.OnError block. Per coding guidelines, errors should be returned without wrapping within retry blocks to preserve retry behavior, as wrapping can interfere with error type checking.

πŸ”§ Proposed fix
 func (m *NodeAnnotationManager) patchNodeWithRetry(ctx context.Context, nodeName string, patch []byte) error {
-	return retry.OnError(retry.DefaultRetry, isRetryableError, func() error {
+	err := retry.OnError(retry.DefaultRetry, isRetryableError, func() error {
 		_, err := m.kubeClient.CoreV1().Nodes().Patch(ctx, nodeName, types.MergePatchType, patch, metav1.PatchOptions{})
 		if err != nil && isRetryableError(err) {
 			slog.Warn("Retryable error patching node annotation. Retrying...",
 				"node", nodeName,
 				"error", err)
 		}
 
-		if err != nil {
-			return fmt.Errorf("failed to patch node %s: %w", nodeName, err)
-		}
-
-		return nil
+		return err
 	})
+	if err != nil {
+		return fmt.Errorf("failed to patch node %s: %w", nodeName, err)
+	}
+	return nil
 }

Based on coding guidelines.

fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)

225-235: Correct template filename capitalization to match actual file.

The code references "rebootnode-Template.yaml" (capital T), but the actual template file is rebootnode-template.yaml (lowercase). Update line 227 to use the correct lowercase filename.

πŸ€– Fix all issues with AI agents
In @fault-remediation/pkg/annotation/annotation.go:
- Around line 66-73: The code after the GetRemediationState error return is
dead: remove the unreachable initialization of state (the
RemediationStateAnnotation with EquivalenceGroups) that appears after "return
err"; if your intent was to recover and use a default state instead of returning
an error, assign state = &RemediationStateAnnotation{EquivalenceGroups:
make(map[string]EquivalenceGroupState)} before returning and change the "return
err" to "return nil" (and update callers accordingly); reference symbols:
m.GetRemediationState, state, RemediationStateAnnotation, EquivalenceGroups,
EquivalenceGroupState.

In @fault-remediation/pkg/crstatus/deprecated_checker.go:
- Around line 77-91: In CRStatusChecker.checkCondition, the skip/allow logic is
inverted: when status or conditions are missing you should treat the resource as
non-terminal (do not skip) and when a conditionStatus is terminal you should
skip. Change the two early returns that currently return true to return false
(for the status/conditions-not-found or error cases), and change the final
return from !c.isTerminal(conditionStatus) to c.isTerminal(conditionStatus);
this uses the existing methods findConditionStatus and isTerminal to determine
terminality correctly.

In @fault-remediation/pkg/reconciler/reconciler_test.go:
- Around line 48-50: The local interface named CRStatusCheckerInterface is
unused and duplicates crstatus.CRStatusCheckerInterface used by the mock; remove
the local type definition of CRStatusCheckerInterface from the test file and
ensure all references (e.g., the mock that currently expects
crstatus.CRStatusCheckerInterface) use the package-scoped
crstatus.CRStatusCheckerInterface, updating imports if necessary so the test
only relies on crstatus.CRStatusCheckerInterface.

In @fault-remediation/pkg/remediation/deprecated_remediation.go:
- Around line 266-293: The code performs two redundant annotation updates: first
calling c.AnnotationManager.UpdateRemediationState with crName, then again with
actualCRName after obtaining actualCRName := createdCR.GetName(); remove the
first update block (the if that checks group != "" and calls
UpdateRemediationState with crName) and the redundant reassignment of group
before the second update, leaving only the single UpdateRemediationState call
that uses actualCRName (keep references to common.GetRemediationGroupForAction,
createdCR.GetName, and c.AnnotationManager.UpdateRemediationState) so the node
annotation is set once with the real CR name and the function still returns
actualCRName, nil.

In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 56-58: The test calls NewCtrlRuntimeRemediationClient with a
TemplateData that sets TemplateFileName to "rebootnode-Template.yaml" which
mismatches the actual file name; change the TemplateData.TemplateFileName value
to "rebootnode-template.yaml" (lowercase "template") so the test loads the
correct template file.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 356-376: The nil-check logic incorrectly skips metrics when
job.Annotations is nil; change the guard from "job.Annotations != nil &&
job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal" to
"job.Annotations == nil || job.Annotations[jobMetricsAlreadyCountedAnnotation]
!= trueStringVal" so freshly-created jobs without annotations still enter the
block (you can keep the existing updateJob.Annotations initialization). Apply
this same fix in the analogous checks inside checkLogCollectorFailed and
checkLogCollectorTimedOut so they use the same "annotations == nil ||
annotations[key] != trueStringVal" condition.
🧹 Nitpick comments (21)
commons/pkg/statemanager/statemanager_test.go (1)

508-508: Use consistent context creation.

This test uses context.TODO() while all other test functions in this file use context.Background() (lines 270, 282, 310, 341, 370, 399, 441). Maintain consistency across the test suite.

♻️ Proposed fix
-		ctx := context.TODO()
+		ctx := context.Background()
commons/pkg/statemanager/statemanager.go (1)

387-387: Prefer explicit nil return for clarity.

At this point in the code, err is guaranteed to be nil (any non-nil error would have triggered an early return at line 374). For clarity and explicitness, consider returning true, nil directly.

♻️ Proposed fix
-	return true, err
+	return true, nil
fault-remediation/pkg/remediation/deprecated_remediation.go (1)

506-517: Consider returning errors instead of swallowing them for proper retry handling.

Lines 506-517 swallow errors from log collector job timeout and failure by returning ctrl.Result{}, nil. While the comments indicate this is intentional to allow reconciliation to continue, completely swallowing errors prevents proper retry/backoff behavior and can mask issues.

Consider returning the error with a requeue request:

return ctrl.Result{Requeue: true, RequeueAfter: 5*time.Minute}, fmt.Errorf(...)

This allows the controller-runtime to apply exponential backoff and proper error tracking while still continuing reconciliation.

fault-remediation/pkg/crstatus/crstatus_test.go (1)

120-212: Eliminate test duplication with a parameterized helper.

The entire test function TestCheckConditionCtrlRuntime duplicates TestCheckCondition with only the checker initialization differing. This violates DRY and creates a maintenance burden.

♻️ Refactor to use table-driven test with multiple checkers
-func TestCheckConditionCtrlRuntime(t *testing.T) {
-	cfg := &config.MaintenanceResource{
-		CompleteConditionType: "Completed",
-	}
-	checker := NewCtrlRuntimeCRStatusChecker(nil, cfg, false)
-
-	tests := []struct {
-		name     string
-		cr       *unstructured.Unstructured
-		expected bool
-	}{
-		// ... duplicate test cases ...
-	}
-
-	for _, tt := range tests {
-		t.Run(tt.name, func(t *testing.T) {
-			result := checker.checkCondition(tt.cr)
-			assert.Equal(t, tt.expected, result)
-		})
-	}
-}
+func TestCheckCondition(t *testing.T) {
+	cfg := &config.MaintenanceResource{
+		CompleteConditionType: "Completed",
+	}
+
+	checkers := []struct {
+		name    string
+		checker interface{ checkCondition(*unstructured.Unstructured) bool }
+	}{
+		{
+			name:    "K8s",
+			checker: NewCRStatusChecker(nil, nil, cfg, false),
+		},
+		{
+			name:    "CtrlRuntime",
+			checker: NewCtrlRuntimeCRStatusChecker(nil, cfg, false),
+		},
+	}
+
+	tests := []struct {
+		name     string
+		cr       *unstructured.Unstructured
+		expected bool
+	}{
+		// ... single set of test cases ...
+	}
+
+	for _, c := range checkers {
+		t.Run(c.name, func(t *testing.T) {
+			for _, tt := range tests {
+				t.Run(tt.name, func(t *testing.T) {
+					result := c.checker.checkCondition(tt.cr)
+					assert.Equal(t, tt.expected, result)
+				})
+			}
+		})
+	}
+}

Based on coding guidelines, as per the learnings for table-driven tests when testing multiple scenarios in Go.

fault-remediation/pkg/remediation/remediation_test.go (1)

114-126: Make test name more descriptive.

The test name "dry run" is less descriptive than recommended. Consider following the pattern from the past review comment suggestion.

πŸ“ Suggested improvement
 		{
-			name:              "dry run",
+			name:              "Successful rebootnode creation - dry run",
 			nodeName:          "test-node-1",
 			dryRun:            true,
 			recommendedAction: protos.RecommendedAction_RESTART_BM,

Based on coding guidelines, as per the naming format TestFunctionName_Scenario_ExpectedBehavior for Go tests.

fault-remediation/pkg/initializer/init.go (2)

89-115: Extract TemplateData construction to reduce duplication.

The TemplateData construction (lines 92-96 and 105-109) is duplicated in both initialization branches. This creates a maintenance burden if the template configuration structure changes.

♻️ Extract template data construction
+	templateData := remediation.TemplateData{
+		TemplateMountPath:   tomlConfig.Template.MountPath,
+		TemplateFileName:    tomlConfig.Template.FileName,
+		MaintenanceResource: tomlConfig.MaintenanceResource,
+	}
+
 	if params.UseCtrlRuntime {
 		remediationClient, err = remediation.NewCtrlRuntimeRemediationClient(
 			ctrlruntimeClient,
-			params.DryRun, remediation.TemplateData{
-				TemplateMountPath:   tomlConfig.Template.MountPath,
-				TemplateFileName:    tomlConfig.Template.FileName,
-				MaintenanceResource: tomlConfig.MaintenanceResource,
-			})
+			params.DryRun,
+			templateData)
 		if err != nil {
 			return nil, fmt.Errorf("error while initializing ctrl runtime client: %w", err)
 		}
 		stateManager = statemanager.NewCtrlRuntimeStateManager(ctrlruntimeClient)
 	} else {
 		remediationClient, clientSet, err = remediation.NewK8sClient(
 			params.KubeconfigPath,
 			params.DryRun,
-			remediation.TemplateData{
-				TemplateMountPath:   tomlConfig.Template.MountPath,
-				TemplateFileName:    tomlConfig.Template.FileName,
-				MaintenanceResource: tomlConfig.MaintenanceResource,
-			},
+			templateData,
 		)

117-117: Make log message more informative about initialization mode.

The generic log message doesn't indicate which client type was initialized, making it harder to debug initialization issues.

πŸ“ Proposed improvement
-	slog.Info("Successfully initialized client")
+	if params.UseCtrlRuntime {
+		slog.Info("Successfully initialized ctrl-runtime remediation client")
+	} else {
+		slog.Info("Successfully initialized Kubernetes remediation client")
+	}
fault-remediation/pkg/remediation/remediation.go (1)

357-367: Error from annotation update causes early return without recording metrics.

If c.client.Update(ctx, updateJob) fails at line 364, the function returns false, err without recording the success metric. This means a successful job completion may not be tracked if the annotation update fails. Consider recording metrics before attempting the annotation update.

♻️ Proposed refactor
 	if completeCondition != nil && completeCondition.Status == metav1.ConditionTrue {
 		slog.Info("Log collector job completed successfully", "job", job.Name)
-		// Use job's actual duration instead of custom tracking
-		// reconciliation can be called multiple times so use annotation to make sure we're not duplicate recording metrics
-		if job.Annotations != nil && job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal {
+		if job.Annotations == nil || job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal {
+			duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
+			metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc()
+			metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration)
+
 			updateJob := job.DeepCopy()
 			if updateJob.Annotations == nil {
 				updateJob.Annotations = map[string]string{}
 			}
-
 			updateJob.Annotations[jobMetricsAlreadyCountedAnnotation] = trueStringVal
-
-			err := c.client.Update(ctx, updateJob)
-			if err != nil {
-				return false, err
+			if err := c.client.Update(ctx, updateJob); err != nil {
+				slog.Warn("Failed to mark job metrics as recorded", "job", job.Name, "error", err)
+				// Continue - metrics already recorded
 			}
-
-			duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
-
-			metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc()
-			metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration)
 		}
-
 		return true, nil
 	}
fault-remediation/pkg/remediation/deprecated_remediation_test.go (1)

17-40: Import grouping does not follow Go conventions.

The events import at line 19 is placed between standard library imports and third-party imports. Go convention is to group imports: standard library, then external packages, then internal packages.

♻️ Proposed fix
 import (
 	"context"
-	"github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
 	"testing"
 	"text/template"

 	"github.com/google/uuid"
 	"github.com/stretchr/testify/assert"
 	corev1 "k8s.io/api/core/v1"
 	metameta "k8s.io/apimachinery/pkg/api/meta"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
 	"k8s.io/apimachinery/pkg/runtime/schema"
 	"k8s.io/client-go/discovery"
 	"k8s.io/client-go/discovery/cached/memory"
 	"k8s.io/client-go/dynamic"
 	"k8s.io/client-go/kubernetes/fake"
 	"k8s.io/client-go/rest"
 	"k8s.io/client-go/restmapper"

 	"github.com/nvidia/nvsentinel/data-models/pkg/model"
 	"github.com/nvidia/nvsentinel/data-models/pkg/protos"
 	"github.com/nvidia/nvsentinel/fault-remediation/pkg/config"
+	"github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
 )
fault-remediation/pkg/reconciler/reconciler_test.go (1)

17-38: Import grouping does not follow Go conventions.

Imports are mixed: errors and testing/time are separated by internal package imports. Group standard library imports together, followed by external packages, then internal packages.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

17-54: Import grouping does not follow Go conventions.

Imports are scattered with blank lines in unexpected places (lines 19, 23). Standard library, external, and internal packages should be in separate groups.


168-179: Consider consolidating package-level test variables.

Multiple package-level variables are declared across lines 168-179. Consider grouping related variables or using a test fixture struct to improve readability and make test setup/teardown clearer.

fault-remediation/pkg/reconciler/reconciler.go (9)

183-190: Consider error wrapping for context.

The error from RunLogCollectorJob is returned without additional context. As per coding guidelines, errors should be wrapped using fmt.Errorf("context: %w", err) to provide clarity about where the failure occurred.

πŸ“ Proposed error wrapping
 	result, err := r.Config.RemediationClient.RunLogCollectorJob(ctx, healthEvent.NodeName, eventUID)
 	if err != nil {
 		slog.Error("Log collector job failed for node",
 			"node", healthEvent.NodeName,
 			"error", err)
 
-		return ctrl.Result{}, err
+		return ctrl.Result{}, fmt.Errorf("log collector job failed for node %s: %w", healthEvent.NodeName, err)
 	}

203-211: Add error context for label update failure.

The error from UpdateNVSentinelStateNodeLabel is returned without wrapping. Adding context will make debugging easier when this operation fails during the "remediating" state update.

πŸ“ Proposed error wrapping
 	_, err := r.Config.StateManager.UpdateNVSentinelStateNodeLabel(ctx,
 		healthEventWithStatus.HealthEvent.NodeName,
 		statemanager.RemediatingLabelValue, false)
 	if err != nil {
 		slog.Error("Error updating node label to remediating", "error", err)
 		metrics.ProcessingErrors.WithLabelValues("label_update_error", nodeName).Inc()
 
-		return "", err
+		return "", fmt.Errorf("failed to update node label to remediating for node %s: %w", nodeName, err)
 	}

220-244: Good error handling pattern but consider adding more context.

The deferred state update pattern (attempting to set label to "failed" even when CR creation fails) is solid and aligns with the PR objective to handle errors properly. However, the returned errors lack context.

πŸ“ Enhanced error wrapping for clarity
 	_, err = r.Config.StateManager.UpdateNVSentinelStateNodeLabel(ctx,
 		healthEventWithStatus.HealthEvent.NodeName,
 		remediationLabelValue, false)
 	if err != nil {
 		slog.Error("Error updating node label",
 			"label", remediationLabelValue,
 			"error", err)
 		metrics.ProcessingErrors.WithLabelValues("label_update_error", nodeName).Inc()
 
-		return "", errors.Join(createMaintenanceResourceError, err)
+		labelErr := fmt.Errorf("failed to update node label to %s for node %s: %w", remediationLabelValue, nodeName, err)
+		return "", errors.Join(createMaintenanceResourceError, labelErr)
 	}
 
 	if createMaintenanceResourceError != nil {
-		return "", createMaintenanceResourceError
+		return "", fmt.Errorf("failed to create maintenance resource for node %s: %w", nodeName, createMaintenanceResourceError)
 	}

259-264: Add error context for state clearing failure.

The error from ClearRemediationState should be wrapped with context per coding guidelines to aid debugging.

πŸ“ Proposed error wrapping
 	if err := r.annotationManager.ClearRemediationState(ctx, nodeName); err != nil {
 		slog.Error("Failed to clear remediation state for node",
 			"node", nodeName,
 			"error", err)
 
-		return ctrl.Result{}, err
+		return ctrl.Result{}, fmt.Errorf("failed to clear remediation state for node %s: %w", nodeName, err)
 	}

267-272: Add error context for mark processed failure.

Per coding guidelines, wrap the error with context to clarify which operation failed.

πŸ“ Proposed error wrapping
 	if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil {
 		metrics.ProcessingErrors.WithLabelValues("mark_processed_error", nodeName).Inc()
 		slog.Error("Error updating resume token", "error", err)
 
-		return ctrl.Result{}, err
+		return ctrl.Result{}, fmt.Errorf("failed to mark cancellation event as processed for node %s: %w", nodeName, err)
 	}

328-344: Good error propagation for retry behavior.

The handling correctly propagates errors from runLogCollector and performRemediation to trigger retries, and uses errors.Join to combine multiple failures. The pattern of updating remediation status even when performRemediation fails is solid.

However, for consistency with coding guidelines, consider adding context to the final error returns to clarify which operation failed in the remediation flow.


291-296: Pattern: Multiple MarkProcessed errors lack context.

These four locations return errors from MarkProcessed or checkExistingCRStatus without wrapping. While the error propagation is correct for retry behavior, adding context would improve debuggability.

Consider wrapping these errors with context about the operation and node name, similar to the pattern suggested in other review comments.

Also applies to: 302-307, 316-321, 348-353


364-367: Add error context for document ID extraction.

For consistency with the error handling at line 385 and coding guidelines, wrap this error with context.

πŸ“ Proposed error wrapping
 	documentID, err := utils.ExtractDocumentID(eventWithToken.Event)
 	if err != nil {
-		return err
+		return fmt.Errorf("failed to extract document ID from event: %w", err)
 	}

406-410: Add error context for annotation operations.

Both GetRemediationState and RemoveGroupFromState errors should be wrapped with context per coding guidelines.

πŸ“ Proposed error wrapping
 	state, _, err := r.annotationManager.GetRemediationState(ctx, nodeName)
 	if err != nil {
 		slog.Error("Error getting remediation state", "node", nodeName, "error", err)
-		return true, "", err
+		return true, "", fmt.Errorf("failed to get remediation state for node %s: %w", nodeName, err)
 	}

 	// ... later ...

 	if err = r.annotationManager.RemoveGroupFromState(ctx, nodeName, group); err != nil {
 		slog.Error("Failed to remove CR from annotation", "error", err)
-		return false, "", err
+		return false, "", fmt.Errorf("failed to remove group %s from remediation state for node %s: %w", group, nodeName, err)
 	}

Also applies to: 438-441

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 8788f69 and 1b3ff7f.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (26)
  • .gitignore
  • commons/go.mod
  • commons/pkg/statemanager/statemanager.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (1)
  • fault-remediation/pkg/reconciler/fault_remediation_client_interface.go
βœ… Files skipped from review due to trivial changes (1)
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • commons/go.mod
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/initializer/init.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/deprecated_annotation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/crstatus/deprecated_checker.go
  • commons/pkg/statemanager/statemanager.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (23)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • commons/pkg/statemanager/statemanager.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • commons/pkg/statemanager/statemanager_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/deprecated_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
  • commons/pkg/statemanager/statemanager.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/pkg/remediation/deprecated_remediation.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (13)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
  • FaultRemediationClientInterface (29-34)
  • TemplateData (37-44)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
fault-remediation/pkg/config/config.go (2)
  • Template (27-30)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/crstatus_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCtrlRuntimeCRStatusChecker (34-44)
fault-remediation/main.go (4)
fault-remediation/pkg/initializer/init.go (2)
  • InitializationParams (38-44)
  • InitializeAll (51-168)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (61-69)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/remediation/remediation_test.go (5)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewCtrlRuntimeRemediationClient (47-95)
  • CtrlRuntimeRemediationClient (38-45)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_RESTART_BM (48-48)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/remediation/deprecated_remediation.go (11)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/deprecated_checker.go (2)
  • CRStatusChecker (29-34)
  • NewCRStatusChecker (36-48)
fault-remediation/pkg/annotation/deprecated_annotation.go (1)
  • NewNodeAnnotationManager (38-42)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
fault-remediation/pkg/reconciler/reconciler_test.go (1)
  • CRStatusCheckerInterface (48-50)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
platform-connectors/pkg/pipeline/factory.go (1)
  • Create (32-39)
fault-remediation/pkg/common/equivalence_groups.go (1)
  • GetRemediationGroupForAction (35-45)
fault-remediation/pkg/metrics/metrics.go (3)
  • LogCollectorErrors (86-92)
  • LogCollectorJobs (71-77)
  • LogCollectorJobDuration (78-85)
fault-remediation/pkg/crstatus/deprecated_checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/remediation/deprecated_remediation_test.go (4)
fault-remediation/pkg/config/config.go (1)
  • Template (27-30)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • TemplateData (37-44)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/remediation/deprecated_remediation.go (1)
  • FaultRemediationClient (62-74)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (18-24)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (12-15)
  • HealthEventDoc (6-9)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (16-21)
  • RemediationStateAnnotation (24-26)
  • EquivalenceGroupState (29-32)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (48-48)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewCtrlRuntimeRemediationClient (47-95)
commons/pkg/statemanager/statemanager.go (4)
  • NewStateManager (209-213)
  • StateManager (200-203)
  • DrainSucceededLabelValue (170-170)
  • RemediatingLabelValue (174-174)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • AnnotationKey (12-12)
fault-remediation/pkg/metrics/metrics.go (6)
  • TotalEventsReceived (33-38)
  • EventHandlingDuration (62-68)
  • EventsProcessed (39-45)
  • CRStatusCreated (25-25)
  • CRStatusSkipped (26-26)
  • ProcessingErrors (46-52)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (38)
commons/pkg/statemanager/statemanager_test.go (1)

269-549: Comprehensive test coverage for controller-runtime state manager.

These tests provide excellent coverage of the controller-runtime state manager implementation, mirroring the existing client-go tests and ensuring both implementations behave consistently. The table-driven approach for state transition validation is particularly well-structured.

commons/pkg/statemanager/statemanager.go (2)

298-306: Clean constructor pattern for controller-runtime state manager.

The type declaration and constructor follow established patterns and properly return the StateManager interface, enabling flexible client selection.


310-388: Well-structured controller-runtime implementation.

The implementation properly mirrors the client-go version while adapting to controller-runtime patterns. The error handling, logging, and validation logic are consistent with the existing state manager. As discussed in previous reviews, relying on controller-runtime's reconciliation retry mechanism is an appropriate design choice for this context.

fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

15-15: YAMLlint error is a false positive for Go template syntax.

The yamllint syntax error on line 15 is expected because {{.ApiGroup}}/{{.Version}} is Go template syntax that will be rendered at runtime. This is not a valid concern.

fault-remediation/pkg/annotation/deprecated_annotation.go (1)

83-125: LGTM: GetRemediationState signature update is consistent.

The updated signature correctly returns the node alongside the state, and all return paths properly include the node value. Callers appropriately handle the extra return value.

fault-remediation/pkg/remediation/deprecated_remediation.go (2)

254-316: LGTM: AlreadyExists handling is well-structured.

The new handleCRCreateAlreadyExists helper properly handles the case where a CR already exists by updating the node annotation and logging appropriately. The separation into a helper method improves readability.


379-385: No issue found - constants are properly defined in the same package.

The constants logCollectorNodeLabel and logCollectorEventLabel are defined in remediation.go (lines 32-33) within the same package. In Go, package-level constants defined in one file are accessible from all other files in the same package without explicit imports. This code will compile successfully.

Likely an incorrect or invalid review comment.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

29-34: LGTM: Clean interface design.

The FaultRemediationClientInterface provides a well-defined contract with appropriate method signatures for remediation operations, annotation management, and status checking.


36-44: LGTM: Well-structured TemplateData definition.

The TemplateData struct properly embeds config.MaintenanceResource and includes all necessary fields for template rendering and health event tracking.

fault-remediation/pkg/crstatus/crstatus_interface.go (1)

1-11: LGTM! Package documentation and interface are well-defined.

The package-level godoc has been added as requested in the previous review, and the interface follows Go conventions with a clear, focused contract.

fault-remediation/pkg/metrics/metrics.go (1)

15-93: LGTM! Metric exports and package reorganization are correct.

The package rename and metric variable exports follow Go conventions. The TODO comment at line 30 was already flagged in a previous review and is outside the scope of these changes.

fault-remediation/main.go (3)

122-136: LGTM!

The non-controller-runtime setup properly initializes components, sets up deferred cleanup, and handles the metrics server and event processing. The structure is clean and follows the expected patterns.


228-239: LGTM!

The controller-runtime setup correctly initializes components with the manager's client and properly defers cleanup. The wiring with mgr.GetClient() aligns with the controller-runtime pattern.


194-197: The current code is correct. rest.Config.Wrap() is a convenience method that modifies the config's transport wrapping in-place by composing wrappers onto WrapTransport. It does not return a value requiring reassignment. The approach used here is the idiomatic way to add HTTP transport wrappers in client-go, and it properly stacks multiple wrappers when needed.

fault-remediation/pkg/annotation/annotation.go (3)

15-25: LGTM!

The struct and constructor follow Go conventions and properly initialize the manager with the controller-runtime client.


106-131: LGTM!

ClearRemediationState properly handles the nil annotations case and uses the merge-from patch pattern correctly.


133-168: LGTM!

RemoveGroupFromState correctly handles the case where no groups remain by delegating to ClearRemediationState, and properly propagates errors.

fault-remediation/pkg/remediation/remediation.go (3)

46-95: LGTM!

The constructor properly validates template existence, parses the template, initializes dry-run mode, and sets up annotation manager and status checker. Error handling is thorough with context-wrapped errors.


105-200: LGTM!

CreateMaintenanceResource properly handles dry-run mode, creates owner references for garbage collection, handles IsAlreadyExists errors gracefully, and updates node annotation with the actual CR name. The previous duplicate annotation update issue has been addressed.


246-309: LGTM!

launchLogCollectorJob properly reads the manifest, sets labels for deduplication, handles the case of multiple existing jobs, and requeues after creation to check status later.

fault-remediation/pkg/remediation/deprecated_remediation_test.go (2)

347-369: LGTM!

Test correctly updated to use events.HealthEventData and handles the new (string, error) return signature from CreateMaintenanceResource.


396-418: LGTM!

Tests properly updated to use eventId parameter and handle the (ctrl.Result, error) return type from RunLogCollectorJob.

fault-remediation/pkg/crstatus/checker.go (3)

28-44: LGTM!

The struct and constructor properly initialize the status checker with the controller-runtime client and configuration.


58-69: LGTM!

The ObjectKey now correctly includes both Name and Namespace from the config, addressing the previous review concern about missing namespace for namespaced CRs.


71-105: LGTM!

The condition checking logic correctly traverses the unstructured status map, finds the configured condition type, and determines if the status is terminal.

fault-remediation/pkg/reconciler/reconciler_test.go (3)

40-66: LGTM!

MockK8sClient properly updated with new method signatures using events.HealthEventData, ctrl.Result, and the correct interface types for annotation manager and status checker.


102-134: LGTM!

MockNodeAnnotationManager properly implements the updated interface with the new GetRemediationState signature returning (*annotation.RemediationStateAnnotation, *corev1.Node, error).


601-666: LGTM!

TestRunLogCollectorJobErrorScenarios is a well-structured table-driven test that covers success, failure, and requeue scenarios with proper assertions on the ctrl.Result return type.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)

208-217: Manager created but not started before using GetClient.

The manager is created at lines 208-213 and GetClient() is called at line 217, but the manager isn't started until the goroutine at lines 250-254. While controller-runtime allows this pattern, be aware that the client's cache won't be synced until the manager starts, which could cause timing issues in tests.


302-315: LGTM!

createTestRemediationClient properly uses remediation.NewCtrlRuntimeRemediationClient with the shared controller-runtime client and returns the interface type. Template data configuration is correct.


880-888: LGTM!

Metrics assertions properly use the dedicated metrics package symbols (metrics.TotalEventsReceived, metrics.EventsProcessed, etc.) and verify the expected behavior for CR creation and deduplication.


1312-1328: LGTM!

cleanupNodeAnnotations helper properly uses annotation.AnnotationKey for cleanup, maintaining consistency with the rest of the codebase.

fault-remediation/pkg/reconciler/reconciler.go (6)

25-28: LGTM: Clean interface-based refactoring.

The addition of new internal packages and the shift to interface-based dependencies (RemediationClient, annotationManager) improves testability and modularity. The exposed Config field and initialization flow are consistent.

Also applies to: 51-51, 66-67, 83-84


92-127: LGTM: Proper event parsing and routing.

The Reconcile method correctly records metrics, handles parse errors by marking them processed (preventing infinite retries on malformed events), and routes to appropriate handlers based on node quarantine status.


156-165: Verify: Should label update errors trigger retry?

The error from UpdateNVSentinelStateNodeLabel is logged and recorded in metrics but not propagated. Given the PR's goal to "throw errors to trigger retries", consider whether label update failures should prevent event processing and trigger a retry.

If the label update is critical for cluster state consistency, propagating the error might be appropriate. The current implementation could lead to events being skipped even when state updates fail.

Based on the PR objectives and the importance of label consistency, should this error be propagated to the caller?


383-386: LGTM: Proper error wrapping with context.

This correctly follows the coding guideline to wrap errors with context using fmt.Errorf("context: %w", err).


448-498: LGTM: Robust error handling with proper classification.

The method correctly:

  • Wraps errors with context (line 491)
  • Records metrics with appropriate error labels based on failure type
  • Marks malformed events as processed to prevent queue blocking
  • Returns structured HealthEventDoc type

503-584: LGTM: Clean utility methods.

The utility methods properly handle:

  • Watcher lifecycle management
  • Error aggregation in CloseAll using errors.Join
  • Controller-runtime integration with proper channel adaptation
  • Goroutine cleanup on context cancellation or channel closure

Comment on lines 77 to 91
func (c *CRStatusChecker) checkCondition(obj *unstructured.Unstructured) bool {
status, found, err := unstructured.NestedMap(obj.Object, "status")
if err != nil || !found {
return true
}

conditions, found, err := unstructured.NestedSlice(status, "conditions")
if err != nil || !found {
return true
}

conditionStatus := c.findConditionStatus(conditions)

return !c.isTerminal(conditionStatus)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | πŸ”΄ Critical

Critical: Inverted skip/allow logic in checkCondition remains unfixed.

This critical issue was previously identified but not yet addressed. The return values in checkCondition are inverted:

  • Lines 80 and 85: Return true (skip CR creation) when status or conditions are missing, but missing status/conditions should be treated as non-terminal (do not skip). Should return false.
  • Line 90: Returns !c.isTerminal(conditionStatus), meaning skip when NOT terminal. But terminal status should trigger skip. Should return c.isTerminal(conditionStatus).
πŸ› Proposed fix
 func (c *CRStatusChecker) checkCondition(obj *unstructured.Unstructured) bool {
 	status, found, err := unstructured.NestedMap(obj.Object, "status")
 	if err != nil || !found {
-		return true
+		return false
 	}
 
 	conditions, found, err := unstructured.NestedSlice(status, "conditions")
 	if err != nil || !found {
-		return true
+		return false
 	}
 
 	conditionStatus := c.findConditionStatus(conditions)
 
-	return !c.isTerminal(conditionStatus)
+	return c.isTerminal(conditionStatus)
 }
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (c *CRStatusChecker) checkCondition(obj *unstructured.Unstructured) bool {
status, found, err := unstructured.NestedMap(obj.Object, "status")
if err != nil || !found {
return true
}
conditions, found, err := unstructured.NestedSlice(status, "conditions")
if err != nil || !found {
return true
}
conditionStatus := c.findConditionStatus(conditions)
return !c.isTerminal(conditionStatus)
}
func (c *CRStatusChecker) checkCondition(obj *unstructured.Unstructured) bool {
status, found, err := unstructured.NestedMap(obj.Object, "status")
if err != nil || !found {
return false
}
conditions, found, err := unstructured.NestedSlice(status, "conditions")
if err != nil || !found {
return false
}
conditionStatus := c.findConditionStatus(conditions)
return c.isTerminal(conditionStatus)
}
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/crstatus/deprecated_checker.go around lines 77 - 91,
In CRStatusChecker.checkCondition, the skip/allow logic is inverted: when status
or conditions are missing you should treat the resource as non-terminal (do not
skip) and when a conditionStatus is terminal you should skip. Change the two
early returns that currently return true to return false (for the
status/conditions-not-found or error cases), and change the final return from
!c.isTerminal(conditionStatus) to c.isTerminal(conditionStatus); this uses the
existing methods findConditionStatus and isTerminal to determine terminality
correctly.

Comment on lines 266 to 293
group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
if group != "" && c.AnnotationManager != nil {
if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
group, crName); err != nil {
slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName,
"error", err)

return "", err
}
}

// Get the actual name of the created CR
actualCRName := createdCR.GetName()
log.Printf("Created Maintenance CR %s successfully for node %s", actualCRName, healthEvent.NodeName)

// Update node annotation with CR reference
group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
if group != "" && c.annotationManager != nil {
if err := c.annotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
group = common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
if group != "" && c.AnnotationManager != nil {
if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
group, actualCRName); err != nil {
slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName,
"error", err)

return "", err
}
}

return actualCRName, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Duplicate annotation update - only the second update is necessary.

Lines 266-275 update the annotation with crName, then lines 282-291 update again with actualCRName. Since actualCRName := createdCR.GetName() provides the actual CR name from the created resource, only the second update is needed. Additionally, line 282 redundantly re-assigns group.

πŸ”Ž Suggested fix
-	group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
-	if group != "" && c.AnnotationManager != nil {
-		if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
-			group, crName); err != nil {
-			slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName,
-				"error", err)
-
-			return "", err
-		}
-	}
-
 	// Get the actual name of the created CR
 	actualCRName := createdCR.GetName()
 	log.Printf("Created Maintenance CR %s successfully for node %s", actualCRName, healthEvent.NodeName)
 
 	// Update node annotation with CR reference
-	group = common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
+	group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
 	if group != "" && c.AnnotationManager != nil {
 		if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
 			group, actualCRName); err != nil {
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
if group != "" && c.AnnotationManager != nil {
if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
group, crName); err != nil {
slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName,
"error", err)
return "", err
}
}
// Get the actual name of the created CR
actualCRName := createdCR.GetName()
log.Printf("Created Maintenance CR %s successfully for node %s", actualCRName, healthEvent.NodeName)
// Update node annotation with CR reference
group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
if group != "" && c.annotationManager != nil {
if err := c.annotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
group = common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
if group != "" && c.AnnotationManager != nil {
if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
group, actualCRName); err != nil {
slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName,
"error", err)
return "", err
}
}
return actualCRName, nil
// Get the actual name of the created CR
actualCRName := createdCR.GetName()
log.Printf("Created Maintenance CR %s successfully for node %s", actualCRName, healthEvent.NodeName)
// Update node annotation with CR reference
group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
if group != "" && c.AnnotationManager != nil {
if err = c.AnnotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
group, actualCRName); err != nil {
slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName,
"error", err)
return "", err
}
}
return actualCRName, nil
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/deprecated_remediation.go around lines 266
- 293, The code performs two redundant annotation updates: first calling
c.AnnotationManager.UpdateRemediationState with crName, then again with
actualCRName after obtaining actualCRName := createdCR.GetName(); remove the
first update block (the if that checks group != "" and calls
UpdateRemediationState with crName) and the redundant reassignment of group
before the second update, leaving only the single UpdateRemediationState call
that uses actualCRName (keep references to common.GetRemediationGroupForAction,
createdCR.GetName, and c.AnnotationManager.UpdateRemediationState) so the node
annotation is set once with the real CR name and the function still returns
actualCRName, nil.

newStateLabelValue NVSentinelStateLabelValue,
removeStateLabel bool,
) (bool, error) {
node := &corev1.Node{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of duplicating the logic, can we instead get a Kubernetes client from controller runtime manager and use that:

client, err := kubernetes.NewForConfig(mgr.GetConfig())

It is not ideal, but it will allow us to keep the logic in a single function, and we don't have to make bug fixes in two places. Once we have all the modules cut over, we can do a final refactor to use the controller-runtime client, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the reason was because the retries here would conflict with the built in ctrl runtime retries when an error is returned but that's probably worth the tradeoff of not having the duplicate logic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as earlier about converting clients so that we don't have duplicated logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up consolidating on ctrl-runtime

var obj map[string]any
if err = yaml.Unmarshal(buf.Bytes(), &obj); err != nil {
slog.Error("Failed to unmarshal YAML", "error", err)
return "", err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap errors please?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should just bite the bullet and go all in on controller-runtime for this module... I'm not really a fan of all the duplication we have due to use of "kubenertes" client v/s controller-runtime client. TBH, it is much more than what I had initially anticipated. I'm worried that bug fixes will become complex till we get rid of one of the flows. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed I think that would be much cleaner at this point

@ivelichkovich ivelichkovich force-pushed the remediationerrors branch 4 times, most recently from 6806866 to 560b1dd Compare January 9, 2026 00:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
fault-remediation/pkg/initializer/init.go (1)

38-43: Blocker: InitializationParams.config is unexported, likely breaking callers outside initializer.
If InitializeAll is called from another package (typical), InitializationParams{config: ...} won’t compile. Rename/export it (e.g., RestConfig *rest.Config).

Proposed diff
 type InitializationParams struct {
-	config             *rest.Config
+	RestConfig         *rest.Config
 	TomlConfigPath     string
 	DryRun             bool
 	EnableLogCollector bool
 }
fault-remediation/pkg/crstatus/checker.go (1)

46-74: Incomplete comment and potential namespace handling for cluster-scoped resources.

  1. Line 50 has an incomplete comment: // note, if β€” this should be completed or removed.

  2. Line 67 uses resource.Namespace for the ObjectKey, but the config.MaintenanceResource struct has a Scope field that indicates whether the resource is cluster-scoped or namespaced. For cluster-scoped resources, ensure that Namespace is empty; otherwise, the lookup may fail silently. The current code logs and returns false (allowing creation), which provides a safe fallback, but you may want to explicitly handle scope.

Suggested fix for the incomplete comment
-	return true // note, if
+	return true // No configuration found; skip CR creation
πŸ€– Fix all issues with AI agents
In @.gitignore:
- Line 143: The .gitignore now contains a broad ".idea/" rule but still keeps
many specific .idea/* entries (e.g., ".idea/replstate.xml", ".idea/sonarlint/",
".idea/sonarIssues.xml"), causing redundancy and making the PR summary
inaccurate; fix by either removing all the specific .idea/* lines so the single
".idea/" entry covers them, or remove the general ".idea/" entry and keep the
explicit granular rulesβ€”apply the chosen approach consistently in the .gitignore
and update the PR summary to accurately reflect the consolidation or retained
granularity.

In @fault-remediation/main.go:
- Line 111: Replace the bare TODO comment "//TODO: setup informers for node and
job" with a TODO that references a tracked issue (e.g., include the issue number
or full issue URL) so it reads something like "// TODO: setup informers for node
and job β€” tracked in ISSUE-1234" (or a repo issue URL); ensure the comment
includes the canonical issue identifier per project guidelines and keep the same
intent text.

In @fault-remediation/pkg/annotation/annotation_test.go:
- Around line 146-168: The test TestClearRemediationState calls
AnnotationManager.ClearRemediationState and then immediately reassigns err when
calling client.Get, so the result of ClearRemediationState is never asserted;
after invoking annotationManager.ClearRemediationState(context.TODO(), nodeName)
capture and assert the call succeeded (e.g., assert.NoError(t, err) or
require.NoError(t, err)) before you reuse err for the client.Get call to ensure
failures in ClearRemediationState are detected.

In @fault-remediation/pkg/annotation/annotation.go:
- Around line 63-73: In UpdateRemediationState, the code assigns a default state
on the error path then immediately returns the error, making the assignment
dead/unused; either remove the unreachable assignment (delete the state =
&RemediationStateAnnotation{...} line) so the function simply logs and returns
the error, or if the intended behavior is to continue with a default state
instead of returning, initialize state and node to sane defaults and clear err
(do not return) so the rest of UpdateRemediationState can proceed; refer to
UpdateRemediationState, the local variables state/node/err and the call to
GetRemediationState to locate the change.
- Line 40: Replace the bare TODO comment in annotation.go ("TODO: maybe split
this up so it's not returning both node and state") with a TODO that references
the tracked issue ID or URL (e.g., "TODO: track in ISSUE-1234" or "TODO:
https://.../issues/1234") so the note points to a concrete ticket; update the
comment near the function or return logic that currently returns both node and
state (where the TODO is located) to include that issue reference.
- Around line 27-60: In GetRemediationState the JSON unmarshal error is
currently only logged and the function returns a nil error, hiding corrupt
annotations; change the handler for json.Unmarshal failure (in the block
referencing AnnotationKey and RemediationStateAnnotation) to return a wrapped
error (e.g. fmt.Errorf("failed to unmarshal remediation annotation for node %s:
%w", nodeName, err)) so the caller can retry/fail, or alternatively attempt to
clear the bad annotation via the client (m.client.Update) and return a clear
success/error outcomeβ€”ensure you return a non-nil error when unmarshal fails and
keep references to node and nodeName for context.

In @fault-remediation/pkg/initializer/init.go:
- Around line 82-97: The remediation client error message is misleading and the
kube client error is returned unwrapped; update the
remediation.NewRemediationClient error handling to return a clear, specific
message like "error initializing remediation client" (or similar) instead of
"ctrl runtime client", and wrap the kubernetes.NewForConfig error with context
using fmt.Errorf("error creating kube client: %w", err) so both failures provide
actionable context; adjust the return statements around
remediation.NewRemediationClient and kubernetes.NewForConfig accordingly in
init.go.

In @fault-remediation/pkg/reconciler/reconciler_test.go:
- Around line 49-51: Remove the locally defined CRStatusCheckerInterface type
declaration and replace any local usages with the imported
crstatus.CRStatusCheckerInterface; delete the type block "type
CRStatusCheckerInterface interface { IsSuccessful(ctx context.Context, crName
string) bool }" and ensure all references in the test (e.g., mock variables,
function signatures) use crstatus.CRStatusCheckerInterface, and if necessary
adjust imports to avoid unused-import or missing-symbol errors.

In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go:
- Around line 29-35: Exported type FaultRemediationClientInterface lacks a godoc
comment; add a one-line Go doc comment immediately above the type declaration
that begins with "FaultRemediationClientInterface" and briefly describes its
purpose and role (e.g., that it defines methods for creating maintenance
resources, running log collector jobs, and providing access to annotation
manager, status checker, and config). Ensure the comment follows Go convention
(starts with the type name) and references the interface as a whole; leave the
existing method signatures unchanged.

In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 526-527: The test assertion uses assert.Equal with arguments
reversed; change the call in remediation_test.go from assert.Equal(t,
result.RequeueAfter, tt.requeueTime) to assert.Equal(t, tt.requeueTime,
result.RequeueAfter) so the expected value (tt.requeueTime) is first and the
actual (result.RequeueAfter) is second for clearer failure output.
- Around line 29-97: The table-driven test leaves tt.client nil which makes
TestNewCtrlRuntimeClient brittle; either populate tt.client with a real fake
controller-runtime client before calling NewRemediationClient or remove the
unused client field from the test cases. To fix, in TestNewCtrlRuntimeClient
initialize tt.client using controller-runtime's fake client builder (e.g.,
fake.NewClientBuilder().WithScheme(yourScheme).WithObjects(...).Build()) for
each subtest that needs a non-nil client and pass that into
NewRemediationClient, or simplify the table by deleting the client field and
always pass nil if the constructor is expected to accept nil.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 116-139: Validate and harden loadAndParseTemplate: ensure fileName
is a plain base name (no path separators or parent refs) and refuse values
containing "/" or "\" or ".." (or compare filepath.Base(fileName) == fileName),
then build the path and verify the resolved absolute path is inside the
mountPath root before reading; also set the template option to fail on missing
keys by calling tmpl = template.New(templateName).Option("missingkey=error")
prior to Parse so templates error on unknown data.
- Around line 43-114: Add proper godoc comments above the exported
FaultRemediationClient type and the NewRemediationClient function: for
FaultRemediationClient add a one-line summary describing its role (e.g.,
"FaultRemediationClient manages remediation templates, annotation manager and
status checking for remediation actions.") and for NewRemediationClient add a
summary plus brief parameter/return description (single sentence each)
explaining the client, dryRun flag, and remediationConfig parameters and what
the function returns (constructed *FaultRemediationClient or error). Ensure the
comments start immediately above the declarations and follow Go doc style (begin
with the symbol name).
- Around line 456-491: In FaultRemediationClient.checkLogCollectorComplete: fix
the annotation gating and nil-time panics by treating a missing annotations map
as "not counted" (i.e., always attempt to set jobMetricsAlreadyCountedAnnotation
when its value != trueStringVal), create or initialize updateJob.Annotations
before setting the flag, and perform nil checks for job.Status.StartTime and
job.Status.CompletionTime before computing duration (skip metrics or use safe
zero/early-return if either is nil); ensure you still call c.client.Update with
the patched annotation and only record metrics when duration is computed safely
and the annotation was not already true.

In @fault-remediation/pkg/remediation/templates/rebootnode-template.yaml:
- Around line 15-20: The YAML scalars that begin with template substitutions
must be quoted to avoid invalid YAML; update the template to quote the
apiVersion and nodeName values so they become apiVersion:
"{{.ApiGroup}}/{{.Version}}" and spec.nodeName: "{{.NodeName}}". Locate the
apiVersion line and the spec nodeName line in the RebootNode template and wrap
the entire substitution (including the slash) in double quotes; keep other
fields (e.g., metadata.name) unchanged unless they also begin with a brace.
🧹 Nitpick comments (11)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

104-108: Drop unnecessary quotes for consistency.
path: "/readyz" and port: "health" work, but quoting is inconsistent with livenessProbe and adds noise.

Proposed diff
           readinessProbe:
             httpGet:
-              path:  "/readyz"
-              port: "health"
+              path: /readyz
+              port: health
fault-remediation/pkg/remediation/remediation_test.go (1)

22-24: Consider envtest for higher-fidelity behavior where it matters.
Repo guidance suggests envtest over fake clients for Kubernetes controller behavior; these tests validate controller-runtime client flows, status, and Jobs, so fake-client semantics may miss real API behavior.

Also applies to: 243-246, 505-509

fault-remediation/pkg/annotation/annotation_test.go (2)

15-202: Align test names with the repo’s naming convention.
Consider renaming to TestGetRemediationState_NodeNotFound_ReturnsError, etc., to match TestFunctionName_Scenario_ExpectedBehavior.


10-10: Consider envtest if these tests are intended to validate controller behavior.
Repo guidance prefers envtest over fake clients; if you’re relying on real API semantics for annotations/patches, fake client can diverge.

fault-remediation/main.go (1)

147-151: Cleanup runs with a cancelled ctx; use a fresh timeout context for CloseAll.
On SIGTERM, ctx is cancelled; reconciler.CloseAll(ctx) may be unable to close gracefully.

Proposed tweak
 defer func() {
-  if err := reconciler.CloseAll(ctx); err != nil {
+  shutdownCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+  defer cancel()
+  if err := reconciler.CloseAll(shutdownCtx); err != nil {
     slog.Error("failed to close datastore components", "error", err)
   }
 }()
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

17-26: Non-standard import grouping.

The imports have inconsistent grouping with blank lines in unexpected places. Go convention groups imports into standard library, external packages, and internal packages, each separated by a single blank line.

Suggested import grouping
 import (
 	"context"
-
 	"log"
 	"os"
 	"path/filepath"
-	"sigs.k8s.io/controller-runtime/pkg/client"
 	"sync"
 	"testing"
 	"time"
+
+	"sigs.k8s.io/controller-runtime/pkg/client"
 	...

370-372: TODO comments should reference issues.

Per coding guidelines, TODO comments should reference GitHub issues. These TODOs indicate potential issues with state transition handling that should be tracked.

-	// TODO: ignoring error otherwise need to properly walk state transitions
-	_, _ = stateManager.UpdateNVSentinelStateNodeLabel(ctx, nodeName, statemanager.DrainSucceededLabelValue, false)
+	// TODO(#XXX): ignoring error otherwise need to properly walk state transitions
+	_, _ = stateManager.UpdateNVSentinelStateNodeLabel(ctx, nodeName, statemanager.DrainSucceededLabelValue, false)

1123-1125: Potential test flakiness with fixed sleep duration.

Using time.Sleep(500 * time.Millisecond) for synchronization can lead to flaky tests. Consider using assert.Eventually or require.Eventually with appropriate timeout and polling intervals for more reliable eventual consistency checks.

fault-remediation/pkg/reconciler/reconciler_test.go (2)

17-26: Non-standard import grouping.

Similar to the e2e test file, imports have inconsistent grouping. Standard library imports should be grouped together, followed by external packages, then internal packages.

Suggested import ordering
 import (
 	"context"
-	"errors"
 	"fmt"
-	"github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation"
-	"github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
-	corev1 "k8s.io/api/core/v1"
-	ctrl "sigs.k8s.io/controller-runtime"
 	"testing"
 	"time"
+	"errors"
+
+	corev1 "k8s.io/api/core/v1"
+	ctrl "sigs.k8s.io/controller-runtime"
+	...
+
+	"github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation"
+	"github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
 	...

120-135: MockNodeAnnotationManager always returns nil for Node.

The GetRemediationState mock always returns nil for the *corev1.Node return value. This is acceptable for current unit tests but may cause issues if tests need to verify node-related behavior. Consider documenting this limitation or adding a configurable node return value.

fault-remediation/pkg/reconciler/reconciler.go (1)

276-279: High cyclomatic complexity flagged by nolint.

The // nolint: cyclop // todo comment indicates this function has high complexity. While this is noted and not blocking, consider refactoring handleRemediationEvent in a follow-up to improve maintainabilityβ€”for example, by extracting the log collector, CR creation, and status update flows into separate helper methods.

Would you like me to open an issue to track the refactoring of handleRemediationEvent to reduce complexity?

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 1b3ff7f and 6806866.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (29)
  • .gitignore
  • commons/go.mod
  • commons/pkg/statemanager/statemanager.go
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (7)
  • distros/kubernetes/nvsentinel/values.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • fault-remediation/pkg/reconciler/remediation_test.go
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/remediation.go
βœ… Files skipped from review due to trivial changes (2)
  • commons/go.mod
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • commons/pkg/statemanager/statemanager.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (14)
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (7)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (23-25)
  • AnnotationKey (11-11)
  • EquivalenceGroupState (28-34)
fault-remediation/pkg/annotation/annotation.go (1)
  • AnnotationManager (16-18)
fault-remediation/pkg/remediation/remediation_test.go (4)
fault-remediation/pkg/config/config.go (2)
  • Template (47-50)
  • MaintenanceResource (27-44)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_COMPONENT_RESET (44-44)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewRemediationClient (56-114)
  • FaultRemediationClient (43-54)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/initializer/init.go (2)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (56-114)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (205-209)
  • StateManager (196-199)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (23-25)
  • AnnotationKey (11-11)
  • EquivalenceGroupState (28-34)
fault-remediation/main.go (3)
fault-remediation/pkg/initializer/init.go (2)
  • InitializationParams (38-43)
  • InitializeAll (49-148)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (61-69)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (15-20)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (16)
.gitignore (1)

143-143: Verify scope: this change appears orthogonal to PR objectives.

The PR objectives describe retry behavior and fault-remediation system refactoring, but this change affects only .gitignore. Confirm that core remediation/retry logic changes are included elsewhere in the PR.

distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

96-111: Verify the app actually serves /readyz on the health port after this chart change.
This chart now hard-wires readiness to path: "/readyz" and port: "health" (Line 106-107). If the binary doesn’t expose /readyz on the health listener (or uses a different scheme/port), pods will flap.

fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

17-20: Confirm namespace handling for namespaced actions.
If RebootNode is namespaced and the client doesn’t set namespace after rendering, omitting metadata.namespace will cause create failures (or default-namespace surprises). Consider adding namespace: "{{.Namespace}}" gated by scope, or ensure the code always injects namespace for namespaced actions.

fault-remediation/pkg/annotation/annotation_interface.go (1)

9-34: LGTM: clear, minimal public contract and JSON shape.
AnnotationKey constant + typed RemediationStateAnnotation / EquivalenceGroupState matches intended annotation payload.

fault-remediation/pkg/annotation/annotation.go (1)

62-105: and

fault-remediation/pkg/crstatus/checker.go (2)

76-90: LGTM: Condition checking logic is correct.

The checkCondition method properly extracts nested status/conditions from the unstructured object and delegates to findConditionStatus and isTerminal. The boolean logic correctly returns true (skip creation) when terminal conditions aren't met.


28-44: LGTM: Clean refactor to controller-runtime client.

The struct and constructor are properly updated to use client.Client from controller-runtime. This aligns with the PR's goal of consolidating on controller-runtime patterns.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

208-217: LGTM: Proper controller-runtime manager setup for tests.

The manager is correctly initialized with the envtest config and metrics disabled (BindAddress: "0"), which is appropriate for test environments. Using envtest aligns with the coding guidelines.


302-329: LGTM: Test remediation client factory properly updated.

The createTestRemediationClient function correctly uses the new remediation.NewRemediationClient with controller-runtime client and returns the interface type. Template configuration is properly structured.

fault-remediation/pkg/reconciler/reconciler_test.go (2)

350-404: LGTM: Tests properly updated for new performRemediation signature.

The tests correctly handle the updated performRemediation signature that returns (string, error). Both success and failure paths are covered with appropriate assertions.


576-613: LGTM: Log collector test properly updated for new signature.

The test correctly uses the new RunLogCollectorJob signature returning (ctrl.Result, error) and properly verifies the conditional invocation based on EnableLogCollector and action type.

fault-remediation/pkg/reconciler/reconciler.go (5)

203-211: Early return on label update failure may leave state inconsistent.

If the label update to RemediatingLabelValue fails, the function returns immediately without attempting CR creation. This could leave the node in the previous state while the event is retried. Consider whether this is the desired behavior, or if the CR creation should still be attempted (with the label error logged but not blocking).

However, given the PR intent to "throw errors to trigger retries," this early return will cause controller-runtime to requeue and retry the entire operation, which is a valid approach for achieving consistency.

Verify that the retry behavior is acceptable for your use case. If label updates are flaky, this could cause repeated retries before CR creation succeeds.


218-245: LGTM: Error handling properly defers CR creation error for state update.

The code correctly captures createMaintenanceResourceError but continues to update the state label before returning. Using errors.Join to combine errors when both the CR creation and label update fail is the right approach. This ensures the caller receives all relevant error information for logging/debugging.


328-356: LGTM: Log collector and remediation flow properly chain errors.

The updated flow correctly:

  1. Returns early if log collector fails or requests requeue
  2. Captures performRemediationErr and uses it to determine nodeRemediatedStatus
  3. Uses errors.Join when both remediation and status update fail
  4. Returns the remediation error after status update attempt

This supports the PR goal of throwing errors to trigger retries.


418-459: Error handling in checkExistingCRStatus now returns errors instead of silently allowing.

The changes on lines 421 and 453-455 now return errors instead of silently allowing CR creation. This is a behavioral change:

  • Line 421: GetRemediationState failure now returns error (previously may have allowed creation)
  • Line 455: RemoveGroupFromState failure now returns (false, "", err) instead of allowing creation

This aligns with the PR intent to trigger retries on errors. However, ensure this doesn't cause infinite retry loops if these operations consistently fail for a node.

Verify that transient failures in annotation operations will eventually succeed on retry, and that permanent failures (e.g., node doesn't exist) are handled appropriately upstream.


66-87: LGTM: Public Config field and constructor updates.

The Config field is now exported (capitalized), and the constructor properly initializes both Config and annotationManager from the provided configuration. This aligns with the broader refactor to expose configuration for testing and external access.

Comment on lines +526 to +527
assert.Equal(t, result.RequeueAfter, tt.requeueTime)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Swap assert.Equal argument order (expected, actual).
assert.Equal(t, result.RequeueAfter, tt.requeueTime) should be assert.Equal(t, tt.requeueTime, result.RequeueAfter) for clearer failures.

Proposed diff
-			assert.Equal(t, result.RequeueAfter, tt.requeueTime)
+			assert.Equal(t, tt.requeueTime, result.RequeueAfter)
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert.Equal(t, result.RequeueAfter, tt.requeueTime)
assert.Equal(t, tt.requeueTime, result.RequeueAfter)
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation_test.go around lines 526 -
527, The test assertion uses assert.Equal with arguments reversed; change the
call in remediation_test.go from assert.Equal(t, result.RequeueAfter,
tt.requeueTime) to assert.Equal(t, tt.requeueTime, result.RequeueAfter) so the
expected value (tt.requeueTime) is first and the actual (result.RequeueAfter) is
second for clearer failure output.

Comment on lines 43 to 114
type FaultRemediationClient struct {
client client.Client
dryRunMode []string

// Multi-template support
remediationConfig config.TomlConfig
templates map[string]*template.Template // map from template file name to parsed template
templateMountPath string

annotationManager annotation.NodeAnnotationManagerInterface
statusChecker *crstatus.CRStatusChecker
}

func NewRemediationClient(
client client.Client,
dryRun bool,
remediationConfig config.TomlConfig,
) (*FaultRemediationClient, error) {
// Determine template mount path
templateMountPath := remediationConfig.Template.MountPath
if templateMountPath == "" {
return nil, fmt.Errorf("template mount path is not configured")
}

// Pre-load and parse all templates
templates := make(map[string]*template.Template)

// Load templates for multi-template actions
for actionName, maintenanceResource := range remediationConfig.RemediationActions {
if maintenanceResource.TemplateFileName == "" {
return nil, fmt.Errorf("remediation action %s is missing template file configuration", actionName)
}

tmpl, err := loadAndParseTemplate(templateMountPath, maintenanceResource.TemplateFileName, actionName)
if err != nil {
return nil, fmt.Errorf("failed to load template for action %s: %w", actionName, err)
}

templates[actionName] = tmpl
}

// Validate namespace configuration for namespaced resources
for actionName, maintenanceResource := range remediationConfig.RemediationActions {
if maintenanceResource.Scope == "Namespaced" && maintenanceResource.Namespace == "" {
return nil, fmt.Errorf("remediation action %s is namespaced but missing namespace configuration", actionName)
}
}

ctrlRuntimeRemediationClient := &FaultRemediationClient{
client: client,
templates: templates,
templateMountPath: templateMountPath,
remediationConfig: remediationConfig,
}

if dryRun {
ctrlRuntimeRemediationClient.dryRunMode = []string{metav1.DryRunAll}
} else {
ctrlRuntimeRemediationClient.dryRunMode = []string{}
}

// Initialize annotation manager
ctrlRuntimeRemediationClient.annotationManager = annotation.NewAnnotationManager(client)

ctrlRuntimeRemediationClient.statusChecker = crstatus.NewCRStatusChecker(
client,
remediationConfig.RemediationActions,
dryRun,
)

return ctrlRuntimeRemediationClient, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Add missing godoc for exported FaultRemediationClient and NewRemediationClient.
Both are exported but lack leading comments (per guidelines).

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go around lines 43 - 114, Add
proper godoc comments above the exported FaultRemediationClient type and the
NewRemediationClient function: for FaultRemediationClient add a one-line summary
describing its role (e.g., "FaultRemediationClient manages remediation
templates, annotation manager and status checking for remediation actions.") and
for NewRemediationClient add a summary plus brief parameter/return description
(single sentence each) explaining the client, dryRun flag, and remediationConfig
parameters and what the function returns (constructed *FaultRemediationClient or
error). Ensure the comments start immediately above the declarations and follow
Go doc style (begin with the symbol name).

Comment on lines +116 to +139
// loadAndParseTemplate loads and parses a template file
func loadAndParseTemplate(mountPath, fileName, templateName string) (*template.Template, error) {
templatePath := filepath.Join(mountPath, fileName)

// Check if the template file exists
if _, err := os.Stat(templatePath); os.IsNotExist(err) {
return nil, fmt.Errorf("template file does not exist: %s", templatePath)
}

// Read and parse the template
templateContent, err := os.ReadFile(templatePath)
if err != nil {
return nil, fmt.Errorf("error reading template file: %w", err)
}

tmpl := template.New(templateName)

tmpl, err = tmpl.Parse(string(templateContent))
if err != nil {
return nil, fmt.Errorf("error parsing template: %w", err)
}

return tmpl, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden template loading: prevent path traversal; fail on missing keys.
filepath.Join(mountPath, fileName) will accept ../... from config; also text/template defaults can silently render <no value>. Validate fileName is a base name (no separators) and consider tmpl.Option("missingkey=error").

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go around lines 116 - 139,
Validate and harden loadAndParseTemplate: ensure fileName is a plain base name
(no path separators or parent refs) and refuse values containing "/" or "\" or
".." (or compare filepath.Base(fileName) == fileName), then build the path and
verify the resolved absolute path is inside the mountPath root before reading;
also set the template option to fail on missing keys by calling tmpl =
template.New(templateName).Option("missingkey=error") prior to Parse so
templates error on unknown data.

Comment on lines 150 to 245
func (c *FaultRemediationClient) CreateMaintenanceResource(
ctx context.Context,
healthEventData *events.HealthEventData,
) (string, error) {
healthEvent := healthEventData.HealthEvent
healthEventID := healthEventData.ID

// Generate CR name
crName := fmt.Sprintf("maintenance-%s-%s", healthEvent.NodeName, healthEventID)

// Skip custom resource creation if dry-run is enabled
if len(c.dryRunMode) > 0 {
slog.Info("DRY-RUN: Skipping custom resource creation", "node", healthEvent.NodeName)
return crName, nil
}

recommendedActionName := healthEvent.RecommendedAction.String()

maintenanceResource, selectedTemplate, actionKey, err :=
c.selectRemediationActionAndTemplate(recommendedActionName, healthEvent.NodeName)
if err != nil {
return "", fmt.Errorf("error selecting remediation action and template: %w", err)
}

// Get the node object to extract UID for owner reference
// This also verifies the node exists before creating CR
node, err := c.getNodeForOwnerReference(ctx, healthEvent.NodeName)
if err != nil {
slog.Warn("Failed to get node for owner reference, skipping CR creation",
"node", healthEvent.NodeName,
"error", err)

return "", fmt.Errorf("failed to get node for owner reference: %w", err)
}

slog.Info("Creating maintenance CR",
"node", healthEvent.NodeName,
"template", actionKey,
"nodeUID", node.UID)

templateData := TemplateData{
NodeName: healthEvent.NodeName,
HealthEventID: healthEventID,
RecommendedAction: healthEvent.RecommendedAction,
RecommendedActionName: recommendedActionName,

ApiGroup: maintenanceResource.ApiGroup,
Version: maintenanceResource.Version,
Kind: maintenanceResource.Kind,
Namespace: maintenanceResource.Namespace,
}

maintenance, yamlStr, err := renderMaintenanceFromTemplate(selectedTemplate, templateData)
if err != nil {
slog.Error("Failed to render maintenance template",
"template", actionKey,
"error", err)

return "", fmt.Errorf("error rendering maintenance template: %w", err)
}

slog.Debug("Generated YAML from template",
"template", actionKey,
"yaml", yamlStr)

setNodeOwnerRef(maintenance, node)

err = c.client.Create(ctx, maintenance)
if err != nil {
if apierrors.IsAlreadyExists(err) {
slog.Info("Maintenance CR already exists for node, treating as success", "CR",
crName, "node", healthEvent.NodeName)
} else {
return "", err
}
}

actualCRName := maintenance.GetName()
slog.Info("Created Maintenance CR successfully",
"crName", actualCRName,
"node", healthEvent.NodeName,
"template", actionKey)

group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
if group != "" && c.annotationManager != nil {
if err = c.annotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
group, actualCRName, recommendedActionName); err != nil {
slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName,
"error", err)

return "", err
}
}

return actualCRName, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging full rendered YAML; also ensure CR name/namespace are deterministic.
Line 211-214 logs yamlStr, which can leak secrets/credentials embedded in templates. Additionally, crName is computed but not enforced on the rendered object; AlreadyExists handling can return an empty/incorrect actualCRName if the template uses generateName or omits metadata.name.

Also applies to: 211-214

Comment on lines +358 to +421
func (c *FaultRemediationClient) launchLogCollectorJob(
ctx context.Context,
nodeName string,
eventUID string,
) (batchv1.Job, ctrl.Result, error) {
// Read Job manifest
manifestPath := os.Getenv(LogCollectorManifestPathEnv)
if manifestPath == "" {
manifestPath = filepath.Join(c.templateMountPath, "log-collector-job.yaml")
}

content, err := os.ReadFile(manifestPath)
if err != nil {
metrics.LogCollectorErrors.WithLabelValues("manifest_read_error", nodeName).Inc()
return batchv1.Job{}, ctrl.Result{}, fmt.Errorf("failed to read log collector manifest: %w", err)
}

// Create Job from manifest using strong types
job := &batchv1.Job{}
if err = yaml.Unmarshal(content, job); err != nil {
metrics.LogCollectorErrors.WithLabelValues("manifest_unmarshal_error", nodeName).Inc()
return batchv1.Job{}, ctrl.Result{}, fmt.Errorf("failed to unmarshal Job manifest: %w", err)
}

// Set target node
job.Spec.Template.Spec.NodeName = nodeName
labels := map[string]string{
logCollectorNodeLabel: nodeName,
logCollectorEventLabel: eventUID,
}

job.Labels = labels
// Get job if exists otherwise create
existingJobs := &batchv1.JobList{}

err = c.client.List(
ctx,
existingJobs,
client.MatchingLabels(labels),
client.InNamespace(job.GetNamespace()),
)
if err != nil {
return batchv1.Job{}, ctrl.Result{}, err
}

// There should not be multiple jobs for same event, in this case return error
// this will then requeue and wait until the jobs clear
if len(existingJobs.Items) > 1 {
return batchv1.Job{},
ctrl.Result{},
fmt.Errorf("expecting zero or one log collector job per event per node, found %v", existingJobs.Items)
}

if len(existingJobs.Items) == 0 {
err = c.client.Create(ctx, job)
if err != nil {
return batchv1.Job{}, ctrl.Result{}, err
}
// if created, requeue to check status later
return batchv1.Job{}, ctrl.Result{RequeueAfter: 10 * time.Second}, nil
}

return existingJobs.Items[0], ctrl.Result{}, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Log-collector Job namespace/labels handling is fragile (can fail or break manifests).

  • Line 397-398: client.InNamespace(job.GetNamespace()) with empty namespace can error or list unexpectedly; validate job.Namespace is set (Jobs are namespaced).
  • Line 389: job.Labels = labels overwrites manifest labels; merge instead.
  • Line 408: error formats full existingJobs.Items (very large + may leak env/config); report names/UIDs instead.

@ivelichkovich ivelichkovich force-pushed the remediationerrors branch 4 times, most recently from 4cea34a to aaf78c2 Compare January 9, 2026 01:03
@ivelichkovich ivelichkovich changed the title fix(remediation): retry on errors and throw errors to trigger retries fix(remediation): use ctrl-runtime business logic Jan 9, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
fault-remediation/pkg/crstatus/checker.go (2)

15-26: Add package-level documentation.

The package lacks a godoc comment describing its purpose and functionality.

As per coding guidelines, package-level godoc is required for all Go packages. Add documentation explaining that this package provides CR status checking functionality for determining whether remediation CR creation should be skipped.


46-74: Add missing package and function documentation; complete the incomplete comment.

Three issues require attention:

  1. Line 50: Incomplete comment // note, if should be removed or completed.
  2. Missing godoc for exported function ShouldSkipCRCreation (required per Go conventions).
  3. Missing package-level godoc for the crstatus package (required per coding guidelines).

The error handling at lines 69-71 is intentional by design: the function returns bool and is part of an interface that explicitly specifies this return type. The fail-open behavior (logging and returning false on Get error) is a safe default that allows upstream retry logic to proceed.

Proposed improvements
  1. Add package-level documentation at the top of the file (after copyright header).
  2. Add godoc for the exported function.
  3. Remove or complete the incomplete comment at line 50:
-func (c *CRStatusChecker) ShouldSkipCRCreation(ctx context.Context, actionName string, crName string) bool {
+// ShouldSkipCRCreation determines whether creating a new CR should be skipped based on
+// the existence and status of an existing CR for the given action and name.
+func (c *CRStatusChecker) ShouldSkipCRCreation(ctx context.Context, actionName string, crName string) bool {
 	resource, exists := c.remediationActions[actionName]
 	if !exists {
 		slog.Error("No remediation configuration found for action", "action", actionName)
-		return true // note, if
+		return true
 	}
fault-remediation/main.go (1)

103-143: Critical: InitializationParams.config is never set.

The InitializationParams struct expects a config *rest.Config field (as seen in fault-remediation/pkg/initializer/init.go line 38), but this field is never populated before passing params to InitializeAll at line 140. This will cause the initialization to use a nil config, leading to a runtime error when trying to create the Kubernetes client at init.go line 90.

πŸ› Proposed fix
 func setupCtrlRuntimeManagement(ctx context.Context, params initializer.InitializationParams) error {
 	slog.Info("Running in controller runtime managed mode")
 
 	cfg := ctrl.GetConfigOrDie()
+	params.config = cfg
 	cfg.Wrap(func(rt http.RoundTripper) http.RoundTripper {
 		return auditlogger.NewAuditingRoundTripper(rt)
 	})

Note: You may also need to export the config field in InitializationParams (change it to Config) to follow Go naming conventions for exported struct fields.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

17-52: Potential compile blocker: missing annotation2 / remediation2 imports (aliases used but not imported).
This file references annotation2.AnnotationKey and remediation2.NewRemediationClient(...), but those packages/aliases aren’t present in the shown import list.

Proposed fix (imports)
 import (
   "context"
   "log"
   "os"
   "path/filepath"
   "sync"
   "testing"
   "time"

   "github.com/prometheus/client_golang/prometheus"
   dto "github.com/prometheus/client_model/go"
   "github.com/stretchr/testify/assert"
   "github.com/stretchr/testify/require"
   corev1 "k8s.io/api/core/v1"
   metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
   "k8s.io/apimachinery/pkg/runtime/schema"
   "k8s.io/client-go/dynamic"
   "k8s.io/client-go/kubernetes"
   "k8s.io/client-go/kubernetes/scheme"
   "k8s.io/client-go/rest"
   ctrl "sigs.k8s.io/controller-runtime"
   "sigs.k8s.io/controller-runtime/pkg/envtest"
   metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"

+  annotation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation"
   "github.com/nvidia/nvsentinel/commons/pkg/statemanager"
   "github.com/nvidia/nvsentinel/data-models/pkg/model"
   "github.com/nvidia/nvsentinel/data-models/pkg/protos"
   "github.com/nvidia/nvsentinel/fault-remediation/pkg/common"
   "github.com/nvidia/nvsentinel/fault-remediation/pkg/config"
   "github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
   "github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics"
+  remediation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation"
   "github.com/nvidia/nvsentinel/store-client/pkg/datastore"
   "github.com/nvidia/nvsentinel/store-client/pkg/testutils"
 )

Also applies to: 166-177, 299-327


179-265: Avoid flakiness: wait for manager cache sync before using mgr.GetClient() in tests.
Without a cache sync barrier, reads via the ctrl-runtime client (used by the remediation/annotation managers) can race startup.

Proposed fix (start manager + wait for cache sync)
  mgr, err := ctrl.NewManager(testEnv.Config, ctrl.Options{
    Scheme: scheme.Scheme,
    Metrics: metricsserver.Options{
      BindAddress: "0",
    },
  })
  if err != nil {
    panic(err)
  }
  ctrlRuntimeClient = mgr.GetClient()

+ mgrErrCh := make(chan error, 1)
+ go func() {
+   mgrErrCh <- mgr.Start(testContext)
+ }()
+
+ if ok := mgr.GetCache().WaitForCacheSync(testContext); !ok {
+   log.Fatalf("Failed to sync manager cache")
+ }
+
+ // Optional: fail fast if manager exits early
+ select {
+ case err := <-mgrErrCh:
+   if err != nil {
+     log.Fatalf("Manager exited: %v", err)
+   }
+ default:
+ }

  remediationClient, err := createTestRemediationClient(false)
  if err != nil {
    log.Fatalf("Failed to create remediation client: %v", err)
  }

  ...
- go func() {
-   if err := mgr.Start(testContext); err != nil {
-     log.Fatalf("Failed to start the test environment manager: %v", err)
-   }
- }()

Also applies to: 206-216

fault-remediation/pkg/reconciler/reconciler_test.go (2)

17-38: Compile blocker + mock drift: add missing annotation2 import; align log-collector mock signature; set ActionName in mock state.
The file uses annotation2.* types but doesn’t import it; also the new production logic relies on eventId and groupState.ActionName, which the mocks currently drop.

Proposed fix (imports + mock signatures)
 import (
   "context"
   "errors"
   "fmt"
+  annotation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation"
   "github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
   corev1 "k8s.io/api/core/v1"
   ctrl "sigs.k8s.io/controller-runtime"
   "testing"
   "time"
   ...
 )

 type MockK8sClient struct {
   createMaintenanceResourceFn func(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
-  runLogCollectorJobFn        func(ctx context.Context, nodeName string) (ctrl.Result, error)
+  runLogCollectorJobFn        func(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
   annotationManagerOverride   annotation2.NodeAnnotationManagerInterface
   realStatusChecker           crstatus.CRStatusCheckerInterface
 }

 func (m *MockK8sClient) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) {
-  return m.runLogCollectorJobFn(ctx, nodeName)
+  return m.runLogCollectorJobFn(ctx, nodeName, eventId)
 }

 func (m *MockNodeAnnotationManager) GetRemediationState(ctx context.Context, nodeName string) (*annotation2.RemediationStateAnnotation, *corev1.Node, error) {
   ...
   return &annotation2.RemediationStateAnnotation{
     EquivalenceGroups: map[string]annotation2.EquivalenceGroupState{
       "restart": {
         MaintenanceCR: m.existingCR,
         CreatedAt:     time.Now(),
+        ActionName:    protos.RecommendedAction_RESTART_BM.String(),
       },
     },
   }, nil, nil
 }

Also applies to: 40-66, 115-135


461-504: Test comment contradicts expectation (label update failures): update wording or assertion.
The comment says remediation β€œshould still succeed”, but the test asserts err != nil. Pick one to match intended behavior.

Proposed fix (comment only)
- // Even with label update errors, remediation should still succeed
+ // Label update errors should be surfaced to trigger a retry (even if CR creation succeeded).
  _, err := r.performRemediation(ctx, healthEventDoc)
  assert.Error(t, err)
fault-remediation/pkg/reconciler/reconciler.go (3)

17-43: Compile blocker: missing annotation2 / remediation2 imports (types referenced).
ReconcilerConfig.RemediationClient remediation2.FaultRemediationClientInterface and annotationManager annotation2.NodeAnnotationManagerInterface require imports that aren’t present in the shown import block.

Proposed fix (imports)
 import (
   "context"
   "errors"
   "fmt"
   "log/slog"
   "strings"
   "time"

+  annotation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation"
   "github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
   "github.com/nvidia/nvsentinel/fault-remediation/pkg/metrics"
+  remediation2 "github.com/nvidia/nvsentinel/fault-remediation/pkg/remediation"
   "k8s.io/client-go/util/workqueue"
   ctrl "sigs.k8s.io/controller-runtime"
   ...
 )

Also applies to: 45-67, 69-85


245-273: Don’t use context.Background() for MarkProcessed (can hang shutdown); prefer ctx or a short timeout.
This aligns with the repo’s shutdown/cancellation guidance.

Proposed fix (use ctx with timeout)
 func (r *FaultRemediationReconciler) handleCancellationEvent(...)(ctrl.Result, error) {
   ...
-  if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil {
+  markCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
+  defer cancel()
+  if err := watcherInstance.MarkProcessed(markCtx, resumeToken); err != nil {
     metrics.ProcessingErrors.WithLabelValues("mark_processed_error", nodeName).Inc()
     ...
   }
   ...
 }

 func (r *FaultRemediationReconciler) parseHealthEvent(...)(events.HealthEventDoc, error) {
   ...
-  if markErr := watcherInstance.MarkProcessed(context.Background(), eventWithToken.ResumeToken); markErr != nil {
+  markCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+  defer cancel()
+  if markErr := watcherInstance.MarkProcessed(markCtx, eventWithToken.ResumeToken); markErr != nil {
     ...
   }
   ...
 }

(If you want cancellation to abort marking, use ctx instead of context.Background() in parseHealthEvent too.)

Also applies to: 459-511


399-457: Hardening: guard GetConfig() and annotationManager nils in checkExistingCRStatus.
A nil tomlConfig (or nil annotationManager) will panic. Returning β€œallow creation” is typically safer in those cases.

Proposed fix (nil guards)
 tomlConfig := r.Config.RemediationClient.GetConfig()
+if tomlConfig == nil {
+  slog.Warn("Remediation config is nil; allowing creation", "node", nodeName, "action", actionName)
+  return true, "", nil
+}
+if r.annotationManager == nil {
+  slog.Warn("Annotation manager is nil; allowing creation", "node", nodeName)
+  return true, "", nil
+}
πŸ€– Fix all issues with AI agents
In @fault-remediation/main.go:
- Line 111: Replace the TODO comment "//TODO: setup informers for node and job"
with a TODO that references the tracking GitHub issue (e.g., "//
TODO(#<issue-number>): setup informers for node and job") so the work is
traceable; update the exact comment string in fault-remediation/main.go to
include the issue number and a short owner or milestone if applicable.
- Around line 147-151: Ensure cleanup runs even on early returns by registering
the CloseAll defer as soon as a non-nil reconciler exists or by explicitly
closing on error: after calling InitializeAll and receiving (reconciler, err),
if err != nil and reconciler != nil call reconciler.CloseAll(ctx) before
returning; otherwise, if reconciler is non-nil and initialization succeeded,
then immediately defer reconciler.CloseAll(ctx). This guarantees
reconciler.CloseAll is invoked for partially-initialized state while still
deferring cleanup for the successful path.

In @fault-remediation/pkg/annotation/annotation_interface.go:
- Line 1: Add a package-level godoc comment for the package "annotation" at the
top of annotation_interface.go (and mirror in annotation.go) describing the
package purpose and public API; insert a concise paragraph comment immediately
above the "package annotation" declaration that explains what the annotation
package provides, typical usage, and any important behavior or invariants so it
satisfies Go's package-level documentation guideline.

In @fault-remediation/pkg/annotation/annotation.go:
- Line 40: Replace the bare TODO in annotation.go with a TODO that references a
tracking issue (e.g., "TODO(#1234): split return of node and state" or a full
issue URL) so it complies with guidelines; update the comment near the package
annotation (or the function that returns both node and state) to reference the
created issue ID/URL or remove the TODO if you instead implement the change now.
- Line 1: Add a package-level godoc comment immediately above the "package
annotation" declaration in annotation.go: write a concise, complete-sentence
comment that begins with "Package annotation ..." and describes the package's
purpose, responsibilities, and intended usage (mention exported types or
functions briefly if applicable) following Go documentation conventions.

In @fault-remediation/pkg/crstatus/checker.go:
- Around line 34-44: Add a godoc comment for the exported constructor
NewCRStatusChecker that explains its purpose (constructs and returns a
*CRStatusChecker), describes the parameters (client client.Client,
remediationActions map[string]config.MaintenanceResource, dryRun bool) and
states the return value (*CRStatusChecker); place the comment immediately above
the NewCRStatusChecker function and follow Go doc conventions (start the comment
with "NewCRStatusChecker" and keep it concise).

In @fault-remediation/pkg/crstatus/crstatus_interface.go:
- Around line 9-11: The exported interface CRStatusCheckerInterface and its
method ShouldSkipCRCreation lack godoc and parameter names; add a package-level
or type-level comment for CRStatusCheckerInterface and a method comment for
ShouldSkipCRCreation describing purpose and behavior, and update the method
signature to include parameter names (e.g., ctx context.Context, namespace
string, name string) so callers and linters have clear documentation and
readable parameters.

In @fault-remediation/pkg/events/health_event.go:
- Around line 1-3: Add a package-level godoc comment immediately above "package
events" in health_event.go that briefly describes this package’s purpose β€” e.g.,
that package events defines event data structures for health/monitoring events
and documents the different serialization tags (JSON, protobuf, etc.) used on
those types β€” so the package has a clear, guideline-compliant description.

In @fault-remediation/pkg/initializer/init.go:
- Around line 90-93: The error returned from kubernetes.NewForConfig in init.go
is returned raw; wrap it with context per guidelines. Replace the direct return
of err after calling kubernetes.NewForConfig(params.config) so that the function
returns fmt.Errorf("creating kube client with provided config: %w", err) (or
similar contextual message), ensuring you import fmt if not already and keep the
kubeClient, err := kubernetes.NewForConfig(params.config) assignment intact.

In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: Replace the bare TODO comment in metrics.go ("//TODO: evaluate and
remove redundant metrics with ctrl-runtime defaults") with a TODO that
references a tracking GitHub issue (e.g., "// TODO(#<issue-number>): evaluate
and remove redundant metrics with ctrl-runtime defaults") so the task is
tracked; update the comment text to include the issue number and, if helpful, a
short owner or ETA tag.

In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go:
- Around line 299-327: The test uses a relative "./templates" path which breaks
when go test runs from a different working directory; in
createTestRemediationClient resolve the Template.MountPath to an absolute path
anchored to the test file by using runtime.Caller to get the test file directory
and filepath.Join to build the templates path, then set
remediationConfig.Template.MountPath to that resolved path before calling
remediation2.NewRemediationClient; update any imports accordingly (runtime,
path/filepath).

In @fault-remediation/pkg/reconciler/reconciler.go:
- Around line 193-243: The bug is that performRemediation treats any
post-CR-step error (like a node-label update failure) as overall remediation
failure, which can set FaultRemediated=false even though
CreateMaintenanceResource succeeded; change the error flow in performRemediation
so CR creation success is honored: call
RemediationClient.CreateMaintenanceResource and capture
createMaintenanceResourceError, then attempt UpdateNVSentinelStateNodeLabel; if
the label update fails but createMaintenanceResourceError is nil, log/emit
metrics but return the created crName and nil error (do not join/return the
label error); only return an error (or joined errors) when
CreateMaintenanceResource itself failed (or both creation and label update
failed), referencing the symbols performRemediation, CreateMaintenanceResource,
createMaintenanceResourceError, UpdateNVSentinelStateNodeLabel,
remediationLabelValue and ensure the caller’s FaultRemediated logic can rely on
performRemediationErr==nil meaning the CR was created.

In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go:
- Around line 28-34: Add a godoc comment for the exported interface
FaultRemediationClientInterface: write a concise sentence starting with
"FaultRemediationClientInterface ..." that describes the interface's purpose and
behavior, and optionally a brief phrase listing what its key methods do
(CreateMaintenanceResource, RunLogCollectorJob, GetAnnotationManager,
GetStatusChecker, GetConfig). Place the comment immediately above the type
declaration so it satisfies Go export documentation guidelines.
- Line 31: The interface refers to annotation2.NodeAnnotationManagerInterface
but the import alias is missing; add the import alias declaration annotation2
"github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation" to the file's
import block so the GetAnnotationManager() return type resolves correctly (this
aligns with usage in GetAnnotationManager and other files like remediation.go
and reconciler.go).

In @fault-remediation/pkg/remediation/remediation.go:
- Line 462: Replace the bare "// todo" comment next to the nolint directive with
a TODO that references an existing issue or ticket (e.g., change
"//nolint:nestif // todo" to something like "//nolint:nestif // TODO: track ref
#<issue-number> or https://.../issues/<id>"), so update the comment in
remediation.go where the "//nolint:nestif // todo" appears to include the issue
identifier or URL.
- Line 148: Replace the bare "// nolint: cyclop // todo" comment with a TODO
that references a concrete issue/ID and a short rationale so it complies with
coding guidelines; for example change the "nolint: cyclop" comment to include
"TODO: <issue-number> - <brief reason>" (keeping the nolint directive), ensuring
the comment associated with the nolint: cyclop directive documents the issue
tracker ID and why the cyclomatic complexity exemption is needed.
- Line 500: Replace the bare "//nolint:nestif // todo" comment with a TODO that
references the tracking issue or ticket (e.g., "TODO: link to ISSUE-NNNN" or
include the issue URL) so it complies with the Go coding guidelines; update the
nolint line (the "//nolint:nestif" comment adjacent to the remediation
function/block) to include a clear TODO with the issue identifier and optionally
a short rationale so future maintainers can find the related discussion.
- Line 553: Replace the bare "//nolint:nestif // todo" comment with a TODO that
references a real issue ID or URL (for example "//nolint:nestif // TODO: issue
#12345" or "//nolint:nestif // TODO: https://tracker/PROJECT/issues/12345") so
the intent and follow-up are traceable; if no issue exists, create one
describing why the nested-if needs refactoring and reference that new issue in
the TODO comment where the current "//nolint:nestif // todo" appears in
remediation.go.
- Line 1: The package lacks a package-level godoc comment: add a concise package
comment immediately above the "package remediation" declaration that describes
the purpose and responsibilities of the remediation package (what it does, key
abstractions or intended use), so tools like godoc and linters will pick it up;
ensure the comment starts with "Package remediation" to follow Go documentation
conventions.
🟑 Minor comments (7)
fault-remediation/pkg/annotation/annotation_test.go-160-167 (1)

160-167: Assert error from ClearRemediationState.

The test should verify that ClearRemediationState succeeded before checking the annotation was removed.

πŸ”§ Proposed fix
 	err := annotationManager.ClearRemediationState(context.TODO(), nodeName)
+	assert.NoError(t, err)
 
 	err = client.Get(context.TODO(), types.NamespacedName{
fault-remediation/pkg/annotation/annotation_test.go-199-201 (1)

199-201: Add error assertion before using state.

Similar to TestUpdateRemediationState, verify that GetRemediationState succeeded before accessing the result.

πŸ”§ Proposed fix
 	state, _, err := annotationManager.GetRemediationState(context.TODO(), nodeName)
+	assert.NoError(t, err)
 	assert.NotContains(t, state.EquivalenceGroups, removedGroup)
 	assert.Contains(t, state.EquivalenceGroups, notRemovedGroup)
fault-remediation/pkg/annotation/annotation_test.go-140-143 (1)

140-143: Add error assertion before using state.

The test should verify that GetRemediationState succeeded before accessing state.EquivalenceGroups to avoid potential nil pointer dereference.

πŸ”§ Proposed fix
 	state, _, err := annotationManager.GetRemediationState(context.TODO(), nodeName)
+	assert.NoError(t, err)
 	assert.Contains(t, state.EquivalenceGroups, group)
 	assert.Equal(t, crName, state.EquivalenceGroups[group].MaintenanceCR)
 	assert.Equal(t, actionName, state.EquivalenceGroups[group].ActionName)
fault-remediation/pkg/annotation/annotation.go-50-57 (1)

50-57: Unmarshal failure silently returns empty state.

When JSON unmarshal fails (line 51), the error is logged but not returned, causing the caller to receive an empty state instead of being notified of the corruption. This could mask data integrity issues in the annotation.

Consider returning the unmarshal error to the caller so they can handle corrupted annotations appropriately.

πŸ”§ Proposed fix
 	var state RemediationStateAnnotation
 	if err = json.Unmarshal([]byte(annotationValue), &state); err != nil {
-		slog.Error("Failed to unmarshal annotation", "node", nodeName, "error", err)
-		// Return empty state if unmarshal fails
-		return &RemediationStateAnnotation{
-			EquivalenceGroups: make(map[string]EquivalenceGroupState),
-		}, node, nil
+		return nil, node, fmt.Errorf("failed to unmarshal remediation state annotation: %w", err)
 	}
fault-remediation/pkg/annotation/annotation.go-66-73 (1)

66-73: Dead code: state assignment followed by immediate error return.

Lines 69-71 assign an empty state, but line 72 immediately returns the error, making the assignment unreachable. Either remove the assignment and return early, or remove the return to use the empty state as a fallback.

πŸ”§ Proposed fix (option 1: return early)
 	// Get current state
 	state, node, err := m.GetRemediationState(ctx, nodeName)
 	if err != nil {
-		slog.Warn("Failed to get current remediation state", "node", nodeName, "error", err)
-		state = &RemediationStateAnnotation{
-			EquivalenceGroups: make(map[string]EquivalenceGroupState),
-		}
-		return err
+		return fmt.Errorf("failed to get current remediation state: %w", err)
 	}
πŸ”§ Proposed fix (option 2: use fallback and continue)
 	// Get current state
 	state, node, err := m.GetRemediationState(ctx, nodeName)
 	if err != nil {
 		slog.Warn("Failed to get current remediation state", "node", nodeName, "error", err)
 		state = &RemediationStateAnnotation{
 			EquivalenceGroups: make(map[string]EquivalenceGroupState),
 		}
-		return err
 	}
fault-remediation/pkg/reconciler/reconciler.go-275-277 (1)

275-277: Replace // nolint: cyclop // todo with an issue reference (or remove if no longer needed).
As per coding guidelines, TODOs should reference issues.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go-368-370 (1)

368-370: TODOs should be tracked (and β€œignore error” should be justified) to avoid hiding regressions.
Repo guideline: TODOs should reference issues. As per coding guidelines, consider turning these into a small helper that asserts the expected state transitions, or at least link an issue ID.

Also applies to: 430-432, 488-491, 519-521, 571-574

🧹 Nitpick comments (15)
fault-remediation/pkg/events/health_event.go (1)

5-15: Consider enhancing type documentation.

While basic comments exist for the exported types, they could be more descriptive about the intended use cases and the difference between the two types.

Enhanced documentation example
-// HealthEventDoc represents health event data with JSON "_id" tag for document-based storage.
+// HealthEventDoc represents a health event suitable for JSON serialization.
+// The ID field uses the json:"_id" tag for document-based storage systems.
 type HealthEventDoc struct {
 	ID                          string `json:"_id"`
 	model.HealthEventWithStatus `json:",inline"`
 }
 
-// HealthEventData represents health event data with string ID for compatibility
+// HealthEventData represents a health event suitable for BSON serialization.
+// The ID field uses the bson:"_id,omitempty" tag for MongoDB compatibility.
 type HealthEventData struct {
 	ID                          string `bson:"_id,omitempty"`
 	model.HealthEventWithStatus `bson:",inline"`
 }
fault-remediation/pkg/initializer/init.go (1)

96-96: Clarify log message.

The log message "Successfully initialized client" is vague given that multiple clients are initialized (remediationClient, kubeClient). Consider being more specific about which clients were initialized or using a more general message like "Successfully initialized Kubernetes clients".

fault-remediation/pkg/remediation/remediation_test.go (1)

29-97: Consider using standard testing package assertions for simple checks.

Per repository learnings, avoid introducing testify dependency for simple equality/inequality checks. Use t.Error, t.Errorf, t.Fatal for straightforward assertions like nil checks and equality comparisons.

Based on learnings, the repository prefers standard testing package assertions for simple scenarios.

fault-remediation/pkg/remediation/remediation.go (7)

216-224: Wrap error with context before returning.

Line 222 returns the error directly without wrapping it with context. This makes debugging harder as the error loses the context of where it occurred.

πŸ”§ Proposed fix
 	err = c.client.Create(ctx, maintenance)
 	if err != nil {
 		if apierrors.IsAlreadyExists(err) {
 			slog.Info("Maintenance CR already exists for node, treating as success", "CR",
 				crName, "node", healthEvent.NodeName)
 		} else {
-			return "", err
+			return "", fmt.Errorf("failed to create maintenance CR: %w", err)
 		}
 	}

As per coding guidelines, wrap errors with context using fmt.Errorf("context: %w", err).


232-241: Wrap error with context before returning.

Line 239 returns the error from UpdateRemediationState without additional context. Consider wrapping it to indicate the failure occurred during annotation update after CR creation.

πŸ”§ Proposed fix
 	group := common.GetRemediationGroupForAction(healthEvent.RecommendedAction)
 	if group != "" && c.annotationManager != nil {
 		if err = c.annotationManager.UpdateRemediationState(ctx, healthEvent.NodeName,
 			group, actualCRName, recommendedActionName); err != nil {
 			slog.Warn("Failed to update node annotation", "node", healthEvent.NodeName,
 				"error", err)
 
-			return "", err
+			return "", fmt.Errorf("failed to update node annotation after CR creation: %w", err)
 		}
 	}

As per coding guidelines, wrap errors with context using fmt.Errorf("context: %w", err).


392-400: Wrap error with context before returning.

Line 399 returns the error from c.client.List without wrapping it with context. This makes debugging harder.

πŸ”§ Proposed fix
 	err = c.client.List(
 		ctx,
 		existingJobs,
 		client.MatchingLabels(labels),
 		client.InNamespace(job.GetNamespace()),
 	)
 	if err != nil {
-		return batchv1.Job{}, ctrl.Result{}, err
+		return batchv1.Job{}, ctrl.Result{}, fmt.Errorf("failed to list existing log collector jobs: %w", err)
 	}

As per coding guidelines, wrap errors with context using fmt.Errorf("context: %w", err).


410-417: Wrap error with context before returning.

Line 413 returns the error from c.client.Create without wrapping it with context.

πŸ”§ Proposed fix
 	if len(existingJobs.Items) == 0 {
 		err = c.client.Create(ctx, job)
 		if err != nil {
-			return batchv1.Job{}, ctrl.Result{}, err
+			return batchv1.Job{}, ctrl.Result{}, fmt.Errorf("failed to create log collector job: %w", err)
 		}
 		// if created, requeue to check status later
 		return batchv1.Job{}, ctrl.Result{RequeueAfter: 10 * time.Second}, nil
 	}

As per coding guidelines, wrap errors with context using fmt.Errorf("context: %w", err).


467-478: Wrap error with context before returning.

Line 476 returns the error from c.client.Update without wrapping it with context. This error is returned from within the metrics-recording logic for completed jobs.

πŸ”§ Proposed fix
 			err := c.client.Update(ctx, updateJob)
 			if err != nil {
-				return false, err
+				return false, fmt.Errorf("failed to update job annotation for metrics tracking: %w", err)
 			}

As per coding guidelines, wrap errors with context using fmt.Errorf("context: %w", err).


505-516: Wrap error with context before returning.

Line 514 returns the error from c.client.Update without wrapping it with context. This error is returned from within the metrics-recording logic for failed jobs.

πŸ”§ Proposed fix
 			err := c.client.Update(ctx, updateJob)
 			if err != nil {
-				return false, err
+				return false, fmt.Errorf("failed to update job annotation for metrics tracking: %w", err)
 			}

As per coding guidelines, wrap errors with context using fmt.Errorf("context: %w", err).


557-568: Wrap error with context before returning.

Line 565 returns the error from c.client.Update without wrapping it with context. This error is returned from within the metrics-recording logic for timed-out jobs.

πŸ”§ Proposed fix
 			err := c.client.Update(ctx, updateJob)
 			if err != nil {
-				return false, err
+				return false, fmt.Errorf("failed to update job annotation for metrics tracking: %w", err)
 			}

As per coding guidelines, wrap errors with context using fmt.Errorf("context: %w", err).

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

1324-1340: Annotation cleanup may need conflict retry (optional).
If the reconciler updates the same Node concurrently, Nodes().Update(...) can conflict; consider retry.RetryOnConflict around the update even in tests to reduce flakes.

fault-remediation/pkg/reconciler/reconciler_test.go (3)

575-612: Test likely no longer matches production behavior: log collector on NONE.
FaultRemediationReconciler.runLogCollector(...) returns early on RecommendedAction_NONE, so this test’s β€œStart loop behavior” simulation is misleading. Consider removing it or rewriting it to assert log collector is not invoked for NONE.


614-679: Good coverage for log-collector gating/error paths; consider asserting eventId propagation.
Once the mock accepts eventId, these tests can validate it’s passed through (regression-proofing the retry/dedup logic).

Also applies to: 681-705, 707-743


821-884: Dedup tests don’t exercise status-checker path (fields like crSucceeded currently unused).
Consider adding a realStatusChecker mock and asserting ShouldSkipCRCreation(...) is called with the stored ActionName + CR name so the new cross-action semantics are actually verified.

Also applies to: 886-957

fault-remediation/pkg/reconciler/reconciler.go (1)

569-597: Optional: consider buffering out in AdaptEvents to reduce backpressure.
With an unbuffered out, a slow controller can block the goroutine and indirectly block the watcher read loop.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 6806866 and 0a80647.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (29)
  • .gitignore
  • commons/go.mod
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (8)
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/annotation.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • commons/go.mod
  • .gitignore
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
🧠 Learnings (17)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/events/health_event.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/remediation.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
🧬 Code graph analysis (10)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (23-25)
  • AnnotationKey (11-11)
  • EquivalenceGroupState (28-34)
fault-remediation/pkg/annotation/annotation.go (1)
  • NodeAnnotationManager (16-18)
fault-remediation/pkg/remediation/remediation_test.go (1)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewRemediationClient (55-113)
  • FaultRemediationClient (42-53)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (15-20)
fault-remediation/pkg/crstatus/crstatus_test.go (1)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCRStatusChecker (34-44)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (7)
health-monitors/kubernetes-object-monitor/pkg/cel/environment.go (1)
  • Environment (30-35)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/config/config.go (3)
  • TomlConfig (59-71)
  • Template (47-50)
  • MaintenanceResource (27-44)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (55-113)
commons/pkg/statemanager/statemanager.go (3)
  • NewStateManager (206-210)
  • StateManager (197-200)
  • RemediatingLabelValue (171-171)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventDoc (6-9)
fault-remediation/pkg/metrics/metrics.go (4)
  • TotalEventsReceived (33-38)
  • EventHandlingDuration (62-68)
  • EventsProcessed (39-45)
  • ProcessingErrors (46-52)
fault-remediation/pkg/initializer/init.go (2)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (55-113)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (206-210)
  • StateManager (197-200)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (23-25)
  • AnnotationKey (11-11)
  • EquivalenceGroupState (28-34)
fault-remediation/pkg/reconciler/reconciler_test.go (2)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (12-15)
  • HealthEventDoc (6-9)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (15-20)
  • RemediationStateAnnotation (23-25)
  • EquivalenceGroupState (28-34)
fault-remediation/pkg/reconciler/reconciler.go (5)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • FaultRemediationClientInterface (28-34)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (15-20)
fault-remediation/pkg/metrics/metrics.go (3)
  • EventHandlingDuration (62-68)
  • TotalEventsReceived (33-38)
  • ProcessingErrors (46-52)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (197-200)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventDoc (6-9)
  • HealthEventData (12-15)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (12)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

1-20: LGTM! Template structure is correct.

The YAMLlint syntax error is a false positiveβ€”Go template syntax (e.g., {{.ApiGroup}}) is not valid YAML until rendered. The template correctly defines a RebootNode CR with appropriate placeholders for dynamic substitution.

fault-remediation/pkg/crstatus/crstatus_test.go (1)

34-34: Constructor update aligns with controller-runtime migration.

The constructor call correctly reflects the new signature in checker.go, which now accepts a single controller-runtime client instead of separate dynamic client and REST mapper parameters.

distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

96-111: LGTM! Probe configuration correctly simplified for unconditional controller-runtime usage.

The changes remove conditional logic from liveness and readiness probes, fixing them to use the "health" port and "/readyz" path. This aligns with the broader migration to exclusively use controller-runtime initialization.

fault-remediation/pkg/annotation/annotation_test.go (1)

15-120: Well-structured table-driven test.

The test provides comprehensive coverage for GetRemediationState, including error cases (node not found), missing annotations, invalid JSON, and valid state parsing. Good use of table-driven testing pattern.

fault-remediation/pkg/metrics/metrics.go (1)

29-93: LGTM: Metrics extraction and export.

The extraction of metrics into a dedicated package with exported identifiers is a clean refactoring that improves modularity. The metric definitions (names, labels, help text) are preserved correctly.

fault-remediation/main.go (1)

95-98: The config field in InitializationParams does not need to be exported. The field is only used within the initializer package itself (at init.go:90), and it is not being set from outside the package in main.go. The field is left as a zero value (nil) in the instantiation at main.go:89-93, so exporting it would not be necessary.

Likely an incorrect or invalid review comment.

fault-remediation/pkg/initializer/init.go (1)

81-94: Initialization flow with controller-runtime is correct.

Verification confirms that downstream code properly uses the controller-runtime client for CR operations (annotation management, status checking) through the RemediationClient, and the Kubernetes clientset for node label updates through StateManager. Both clients serve their intended purposes without mixing concerns.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)

36-49: LGTM!

The TemplateData struct is well-documented with clear section comments explaining the purpose of each field group.

fault-remediation/pkg/remediation/remediation.go (3)

55-113: LGTM!

The constructor properly validates configuration, pre-loads templates, and returns errors to the caller. Error wrapping follows Go best practices using fmt.Errorf with %w.


115-138: LGTM!

Template loading includes proper error handling with context wrapping using fmt.Errorf with %w, following Go best practices.


338-355: LGTM!

The function properly orchestrates the log collector workflow and returns errors to trigger controller retries, aligning with the PR's goal to "throw errors to trigger retries."

fault-remediation/pkg/reconciler/reconciler.go (1)

169-191: No changes needed. ctrl.Result.IsZero() is available in controller-runtime v0.22.4, and the current approach of returning ctrl.Result{} for empty results is idiomatic and compatible with the pinned version.

slog.Info("Running in controller runtime managed mode")

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
//TODO: setup informers for node and job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference a GitHub issue for tracking.

Based on coding guidelines.

πŸ€– Prompt for AI Agents
In @fault-remediation/main.go at line 111, Replace the TODO comment "//TODO:
setup informers for node and job" with a TODO that references the tracking
GitHub issue (e.g., "// TODO(#<issue-number>): setup informers for node and
job") so the work is traceable; update the exact comment string in
fault-remediation/main.go to include the issue number and a short owner or
milestone if applicable.

Comment on lines +1 to +3
package events

import "github.com/nvidia/nvsentinel/data-models/pkg/model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

Add package-level documentation.

The package lacks a godoc comment describing its purpose.

As per coding guidelines, package-level godoc is required for all Go packages. Add documentation explaining that this package defines event data structures for health events with different serialization tags.

Proposed documentation
+// Package events provides data structures for representing health events
+// with support for different serialization formats (JSON and BSON).
 package events
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
package events
import "github.com/nvidia/nvsentinel/data-models/pkg/model"
// Package events provides data structures for representing health events
// with support for different serialization formats (JSON and BSON).
package events
import "github.com/nvidia/nvsentinel/data-models/pkg/model"
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/events/health_event.go around lines 1 - 3, Add a
package-level godoc comment immediately above "package events" in
health_event.go that briefly describes this package’s purpose β€” e.g., that
package events defines event data structures for health/monitoring events and
documents the different serialization tags (JSON, protobuf, etc.) used on those
types β€” so the package has a clear, guideline-compliant description.

Comment on lines 299 to 329
// createTestRemediationClient creates a real FaultRemediationClient for e2e tests
func createTestRemediationClient(dryRun bool) (*FaultRemediationClient, error) {

// Create discovery client for RESTMapper
discoveryClient, err := discovery.NewDiscoveryClientForConfig(testRestConfig)
if err != nil {
return nil, err
}

cachedClient := memory.NewMemCacheClient(discoveryClient)
mapper := restmapper.NewDeferredDiscoveryRESTMapper(cachedClient)

templatePath := filepath.Join("templates", "rebootnode-template.yaml")
templateContent, err := os.ReadFile(templatePath)
if err != nil {
return nil, err
}

tmpl, err := template.New("maintenance").Parse(string(templateContent))
if err != nil {
return nil, err
}

// Create remediation config with the test template
func createTestRemediationClient(dryRun bool) (remediation2.FaultRemediationClientInterface, error) {
remediationConfig := config.TomlConfig{
Template: config.Template{
MountPath: "./templates",
FileName: "rebootnode-template.yaml",
},
RemediationActions: map[string]config.MaintenanceResource{
"RESTART_BM": {
ApiGroup: "janitor.dgxc.nvidia.com",
Version: "v1alpha1",
Kind: "RebootNode",
TemplateFileName: "test.yaml",
TemplateFileName: "rebootnode-template.yaml",
CompleteConditionType: "NodeReady",
EquivalenceGroup: "restart",
},
"COMPONENT_RESET": {
ApiGroup: "janitor.dgxc.nvidia.com",
Version: "v1alpha1",
Kind: "RebootNode",
TemplateFileName: "gpu-reset.yaml",
TemplateFileName: "rebootnode-template.yaml",
CompleteConditionType: "NodeReady",
EquivalenceGroup: "restart",
},
},
}

// Create templates map
templates := map[string]*template.Template{
"RESTART_BM": tmpl,
"COMPONENT_RESET": tmpl, // Use same template for testing
}

client := &FaultRemediationClient{
clientset: testDynamic,
kubeClient: testClient,
restMapper: mapper,
remediationConfig: remediationConfig,
templates: templates,
templateMountPath: "/tmp",
annotationManager: NewNodeAnnotationManager(testClient),
statusChecker: crstatus.NewCRStatusChecker(testDynamic, mapper, remediationConfig.RemediationActions, dryRun),
}

if dryRun {
client.dryRunMode = []string{metav1.DryRunAll}
} else {
client.dryRunMode = []string{}
}

return client, nil
return remediation2.NewRemediationClient(ctrlRuntimeClient, dryRun, remediationConfig)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make template paths robust: "./templates" is sensitive to go test working directory.
This is a common CI-only failure mode; prefer resolving relative to the test file location.

Proposed fix (resolve mount path via runtime.Caller)
+import (
+  ...
+  "runtime"
+  ...
+)

 func createTestRemediationClient(dryRun bool) (remediation2.FaultRemediationClientInterface, error) {
+  _, thisFile, _, _ := runtime.Caller(0)
+  templatesDir := filepath.Join(filepath.Dir(thisFile), "templates")
+
   remediationConfig := config.TomlConfig{
     Template: config.Template{
-      MountPath: "./templates",
+      MountPath: templatesDir,
       FileName:  "rebootnode-template.yaml",
     },
     RemediationActions: map[string]config.MaintenanceResource{
       ...
     },
   }

   return remediation2.NewRemediationClient(ctrlRuntimeClient, dryRun, remediationConfig)
 }

Committable suggestion skipped: line range outside the PR's diff.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go around lines 299 -
327, The test uses a relative "./templates" path which breaks when go test runs
from a different working directory; in createTestRemediationClient resolve the
Template.MountPath to an absolute path anchored to the test file by using
runtime.Caller to get the test file directory and filepath.Join to build the
templates path, then set remediationConfig.Template.MountPath to that resolved
path before calling remediation2.NewRemediationClient; update any imports
accordingly (runtime, path/filepath).

@@ -0,0 +1,582 @@
package remediation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

Add package-level godoc comment.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ“ Suggested package documentation
+// Package remediation provides functionality for managing fault remediation workflows,
+// including maintenance resource creation, log collection, and remediation state tracking.
 package remediation

As per coding guidelines, package-level godoc required for all Go packages.

πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
package remediation
// Package remediation provides functionality for managing fault remediation workflows,
// including maintenance resource creation, log collection, and remediation state tracking.
package remediation
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 1, The package
lacks a package-level godoc comment: add a concise package comment immediately
above the "package remediation" declaration that describes the purpose and
responsibilities of the remediation package (what it does, key abstractions or
intended use), so tools like godoc and linters will pick it up; ensure the
comment starts with "Package remediation" to follow Go documentation
conventions.

return c.statusChecker
}

// nolint: cyclop // todo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

As per coding guidelines, TODO comments should reference issues.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 148, Replace the
bare "// nolint: cyclop // todo" comment with a TODO that references a concrete
issue/ID and a short rationale so it complies with coding guidelines; for
example change the "nolint: cyclop" comment to include "TODO: <issue-number> -
<brief reason>" (keeping the nolint directive), ensuring the comment associated
with the nolint: cyclop directive documents the issue tracker ID and why the
cyclomatic complexity exemption is needed.

conditions []metav1.Condition,
) (bool, error) {
completeCondition := meta.FindStatusCondition(conditions, string(batchv1.JobComplete))
//nolint:nestif // todo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

As per coding guidelines, TODO comments should reference issues.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 462, Replace the
bare "// todo" comment next to the nolint directive with a TODO that references
an existing issue or ticket (e.g., change "//nolint:nestif // todo" to something
like "//nolint:nestif // TODO: track ref #<issue-number> or
https://.../issues/<id>"), so update the comment in remediation.go where the
"//nolint:nestif // todo" appears to include the issue identifier or URL.

) (bool, error) {
// check if failed
failedCondition := meta.FindStatusCondition(conditions, string(batchv1.JobFailed))
//nolint:nestif // todo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

As per coding guidelines, TODO comments should reference issues.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 500, Replace the
bare "//nolint:nestif // todo" comment with a TODO that references the tracking
issue or ticket (e.g., "TODO: link to ISSUE-NNNN" or include the issue URL) so
it complies with the Go coding guidelines; update the nolint line (the
"//nolint:nestif" comment adjacent to the remediation function/block) to include
a clear TODO with the issue identifier and optionally a short rationale so
future maintainers can find the related discussion.

}

// check timeout
//nolint:nestif // todo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

As per coding guidelines, TODO comments should reference issues.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go at line 553, Replace the
bare "//nolint:nestif // todo" comment with a TODO that references a real issue
ID or URL (for example "//nolint:nestif // TODO: issue #12345" or
"//nolint:nestif // TODO: https://tracker/PROJECT/issues/12345") so the intent
and follow-up are traceable; if no issue exists, create one describing why the
nested-if needs refactoring and reference that new issue in the TODO comment
where the current "//nolint:nestif // todo" appears in remediation.go.

@ivelichkovich ivelichkovich force-pushed the remediationerrors branch 4 times, most recently from 59aab7a to 5947ed5 Compare January 9, 2026 21:27
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
fault-remediation/pkg/initializer/init.go (1)

40-45: Unexported config field in exported struct prevents external initialization.

The config field is unexported (lowercase c), but InitializationParams is an exported struct. External callers cannot set this field when constructing the params. Either export the field or provide a constructor function.

πŸ”§ Proposed fix - export the field
 type InitializationParams struct {
-	config             *rest.Config
+	Config             *rest.Config
 	TomlConfigPath     string
 	DryRun             bool
 	EnableLogCollector bool
 }

Then update usages within this file to use params.Config instead of params.config.

fault-remediation/main.go (1)

54-54: kubeconfigPath flag is parsed but never used.

The --kubeconfig flag is parsed into kubeconfigPath but the code uses ctrl.GetConfigOrDie() instead, which ignores this flag. Either use kubeconfigPath via clientcmd.BuildConfigFromFlags or remove the unused flag.

πŸ”§ Proposed fix - use the flag
 func setupCtrlRuntimeManagement(ctx context.Context, params initializer.InitializationParams) error {
 	slog.Info("Running in controller runtime managed mode")

-	cfg := ctrl.GetConfigOrDie()
+	cfg, err := clientcmd.BuildConfigFromFlags("", kubeconfigPath)
+	if err != nil {
+		return fmt.Errorf("failed to build config: %w", err)
+	}
 	cfg.Wrap(func(rt http.RoundTripper) http.RoundTripper {

This requires importing "k8s.io/client-go/tools/clientcmd".

Also applies to: 183-183

fault-remediation/pkg/crstatus/checker.go (1)

28-76: Nil-client safety: ShouldSkipCRCreation can panic in tests or miswired setups.
Line 70: if c.client is nil, c.client.Get(...) will panic. Consider a defensive guard returning β€œdon’t skip” (i.e., allow create) with a warn log.

Proposed fix
 func (c *CRStatusChecker) ShouldSkipCRCreation(ctx context.Context, actionName string, crName string) bool {
+	if c.client == nil {
+		slog.Warn("CRStatusChecker client is nil, allowing create", "crName", crName, "action", actionName)
+		return false
+	}
fault-remediation/pkg/reconciler/reconciler.go (3)

395-459: Backward-compat: groupState.ActionName may be empty for older annotations.
Line 443-445: if the stored annotation predates ActionName, storedActionName can be empty, making status checks fail open/oddly. Consider falling back to current actionName.

Proposed fix
-	storedActionName := groupState.ActionName
+	storedActionName := groupState.ActionName
+	if storedActionName == "" {
+		storedActionName = actionName
+	}
 	shouldSkip := statusChecker.ShouldSkipCRCreation(ctx, storedActionName, groupState.MaintenanceCR)

463-513: Incorrect error message on parse failure (and TODO/nolint issue refs).
Line 492: returning fmt.Errorf("error updating resume token: %w", err) is misleading (the failure is parsing, not marking processed). Also, if any nolint/TODO tags exist in this file, they should reference an issue per guidelines.

Proposed fix
-		return result, fmt.Errorf("error updating resume token: %w", err)
+		return result, fmt.Errorf("error parsing health event: %w", err)

524-538: Nil safety in CloseAll: r.ds / r.Watcher can be nil.
Given tests construct the reconciler with ds=nil, CloseAll should guard against nil before calling Close.

Proposed fix
 func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error {
 	var errs []error
 
-	if err := r.ds.Close(ctx); err != nil {
+	if r.ds != nil {
+		if err := r.ds.Close(ctx); err != nil {
 			slog.Error("failed to close datastore", "error", err)
 			errs = append(errs, err)
-	}
+		}
+	}
 
-	if err := r.Watcher.Close(ctx); err != nil {
+	if r.Watcher != nil {
+		if err := r.Watcher.Close(ctx); err != nil {
 			slog.Error("failed to close Watcher", "error", err)
 			errs = append(errs, err)
-	}
+		}
+	}
 
 	return errors.Join(errs...)
 }
fault-remediation/pkg/reconciler/reconciler_test.go (1)

572-610: Test does not reflect actual reconciler behavior for NONE actions with log collector enabled.

The test manually calls RunLogCollectorJob and expects it to be invoked for RecommendedAction_NONE events when EnableLogCollector is true (line 604-606). However, the production runLogCollector method returns early for NONE actions and never reaches the call to RunLogCollectorJob. Update the test to either: (1) remove the expectation that RunLogCollectorJob is called for NONE actions, or (2) adjust production logic if NONE should actually trigger log collection.

πŸ€– Fix all issues with AI agents
In @fault-remediation/main.go:
- Around line 89-93: The InitializationParams constructed in main builds
TomlConfigPath/DryRun/EnableLogCollector but omits the Config field, so
InitializeAll (which calls kubernetes.NewForConfig(params.Config)) receives nil;
fix by setting the Config field to the controller config (cfg returned from
ctrl.GetConfigOrDie()) when creating the InitializationParams instance or
otherwise pass cfg into setupCtrlRuntimeManagement/InitializeAll; update the
struct literal that creates params to include Config: cfg (or the correct
exported field name) so kubernetes.NewForConfig(params.Config) no longer
dereferences nil.

In @fault-remediation/pkg/annotation/annotation_test.go:
- Around line 200-202: After calling
annotationManager.GetRemediationState(context.TODO(), nodeName) you must assert
the call returned no error before inspecting state; add an assertion like
assert.NoError(t, err) (or require.NoError if you prefer test termination)
immediately after the GetRemediationState call so subsequent checks against
state, removedGroup and notRemovedGroup are valid.
- Around line 140-143: After calling annotationManager.GetRemediationState,
assert the returned error is nil to avoid masking failures β€” e.g., right after
state, _, err := annotationManager.GetRemediationState(context.TODO(), nodeName)
add assert.NoError(t, err) (or require.NoError(t, err)) before the subsequent
assertions on state. Ensure you use the same test assertion package already in
the file.

In @fault-remediation/pkg/annotation/annotation.go:
- Around line 62-102: UpdateUpdateRemediationState: wrap the entire
get-modify-patch read-modify-write cycle in
retry.RetryOnConflict(retry.DefaultBackoff, func() error { ... }) inside
UpdateRemediationState; inside the retry closure call m.GetRemediationState(ctx,
nodeName) on each attempt, modify state.EquivalenceGroups[group] as before,
marshal to JSON, set node.Annotations[AnnotationKey] on a DeepCopy of the
freshly fetched node, then call m.client.Patch(ctx, node,
client.MergeFrom(patchBase)) and return that error from the closure; after the
retry call return any error from RetryOnConflict. Also add the import for
k8s.io/apimachinery/pkg/util/retry and ensure existing log statements remain
outside/after the retry as appropriate.

In @fault-remediation/pkg/crstatus/crstatus_interface.go:
- Around line 9-11: The exported CRStatusCheckerInterface and its method
ShouldSkipCRCreation lack documentation and unnamed parameters; add a
package-style godoc comment for CRStatusCheckerInterface and a descriptive
comment for ShouldSkipCRCreation, and update the method signature to use named
parameters (e.g., ShouldSkipCRCreation(ctx context.Context, namespace string,
resourceName string) bool) so callers see parameter intent; ensure comments
describe purpose, parameters, and return value and follow Go doc conventions.

In @fault-remediation/pkg/initializer/init.go:
- Around line 90-92: The error message is misleading: update the fmt.Errorf call
that wraps err (the return nil, fmt.Errorf("error while initializing ctrl
runtime client: %w", err)) to reference the remediation client instead; change
the message to something like "error while initializing remediation client: %w"
so the wrapped error (err) correctly reflects the remediation client
initialization failure.

In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: Update the TODO comment on the redundant metrics line to include the
issue tracker reference: change the bare "//TODO: evaluate and remove redundant
metrics with ctrl-runtime defaults" to include the issue ID (e.g. "// TODO:
evaluate and remove redundant metrics with ctrl-runtime defaults (ISSUE-1234)")
so the comment references the corresponding issue; ensure you replace ISSUE-1234
with the actual issue number and keep the comment format consistent with other
TODOs in the repo.

In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go:
- Around line 368-373: Replace the bare "TODO:" comments in
reconciler_e2e_test.go with tracked-issue references per repo guidelines: find
the calls near UpdateNVSentinelStateNodeLabel(...) and r.performRemediation(...)
(and the other TODO locations around lines shown) and update each TODO to
include an issue ID or URL (e.g., "TODO: ISSUE-1234 - reason") or create the
corresponding issue and reference it; ensure every TODO now includes the issue
identifier and a brief one-line context so reviewers can find the associated
ticket.

In @fault-remediation/pkg/reconciler/reconciler.go:
- Around line 170-193: The runLogCollector function incorrectly wraps a nil
error when RunLogCollectorJob returns a non-empty ctrl.Result but nil err,
causing normal requeue to be treated as an error; change the logic around
r.Config.RemediationClient.RunLogCollectorJob so you only call fmt.Errorf and
return an error when err != nil (log and wrap err via fmt.Errorf("%w", err)),
and otherwise return the returned result with a nil error (i.e., return result,
nil) from runLogCollector.

In @fault-remediation/pkg/remediation/remediation_test.go:
- Around line 377-381: The "Skip creation with dry run" test case fails because
it omits templateDir so RunLogCollectorJob errors while loading templates
instead of exercising dry-run logic; update that test case to include a valid
templateDir (e.g., the same template path used by other cases or a
testdata/templates fixture) so the template loads successfully and the test
reaches the dry-run branch, leaving dryRun: true and expectedError: false
unchanged.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 149-245: The file has an unreferenced nolint TODO and a
non-deterministic CR name: update the `// nolint: cyclop // todo` comment (and
any other `//nolint:* // todo` occurrences) to include a tracking issue/ID
reference, and make CR naming deterministic in CreateMaintenanceResource by
ensuring the rendered object has a name β€” after calling
renderMaintenanceFromTemplate(…) and before setNodeOwnerRef(…), check if
maintenance.GetName() is empty and if so call maintenance.SetName(crName) so
that Create(ctx, maintenance) will not fail if the template omitted
metadata.name.
- Around line 423-454: The checkLogCollectorStatus currently treats found==true
as an error and wraps nil into an error; update checkLogCollectorStatus so each
call to checkLogCollectorComplete, checkLogCollectorFailed, and
checkLogCollectorTimedOut only returns a wrapped error when err != nil, and if
found==true return success (e.g., return ctrl.Result{}, nil) rather than an
error; specifically change the three blocks that now read "if found || err !=
nil { return ..., fmt.Errorf(...%w, err) }" to first check "if err != nil {
return ctrl.Result{}, fmt.Errorf(...%w, err) }" and then "if found { return
ctrl.Result{}, nil }" so that completed/failed/timed-out jobs are treated as
terminal success by checkLogCollectorStatus.
- Around line 456-579: The annotation check and timestamp usage are unsafe: in
checkLogCollectorComplete, checkLogCollectorFailed and checkLogCollectorTimedOut
you currently gate metrics with job.Annotations != nil &&
job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal which skips
recording forever when Annotations is nil, and you dereference
job.Status.StartTime / CompletionTime without nil checks risking panics. Fix by
reading the annotation value safely (val, ok :=
job.Annotations[jobMetricsAlreadyCountedAnnotation]; if !ok || val !=
trueStringVal { ... }) and still ensure updateJob.Annotations is initialized
before writing; and compute durations only after guarding StartTime (if
job.Status.StartTime != nil) and CompletionTime (if non-nil use
CompletionTime.Sub(start) else use time.Since(start)), falling back to a safe
zero or timeout-based duration if StartTime is nil; apply these changes in
functions checkLogCollectorComplete, checkLogCollectorFailed, and
checkLogCollectorTimedOut referencing jobMetricsAlreadyCountedAnnotation and
trueStringVal.
🧹 Nitpick comments (12)
fault-remediation/pkg/events/health_event.go (1)

11-15: Clarify the purpose of HealthEventData in the comment.

The comment "for compatibility" is vague. Consider specifying what it's compatible with (e.g., BSON serialization for MongoDB storage).

πŸ“ Proposed improvement
-// HealthEventData represents health event data with string ID for compatibility
+// HealthEventData represents health event data with BSON tags for MongoDB storage compatibility.
 type HealthEventData struct {
 	ID                          string `bson:"_id,omitempty"`
 	model.HealthEventWithStatus `bson:",inline"`
 }
fault-remediation/pkg/annotation/annotation_test.go (1)

104-117: Simplify control flow by removing else after return.

The else block is unnecessary after return in the error case.

♻️ Suggested refactor
 			if tt.expectError {
 				assert.Error(t, err)
 				return
-			} else {
-				assert.NoError(t, err)
-				assert.Equal(t, tt.node.Name, node.Name)
-				assert.Equal(t, len(tt.expectedState.EquivalenceGroups), len(resultState.EquivalenceGroups))
-				for expectedKey, expectedValue := range tt.expectedState.EquivalenceGroups {
-					assert.Equal(t, expectedValue.MaintenanceCR, resultState.EquivalenceGroups[expectedKey].MaintenanceCR)
-					assert.Equal(t, expectedValue.ActionName, resultState.EquivalenceGroups[expectedKey].ActionName)
-					assert.Equal(t, expectedValue.CreatedAt.Unix(), resultState.EquivalenceGroups[expectedKey].CreatedAt.Unix())
-				}
 			}
+			assert.NoError(t, err)
+			assert.Equal(t, tt.node.Name, node.Name)
+			assert.Equal(t, len(tt.expectedState.EquivalenceGroups), len(resultState.EquivalenceGroups))
+			for expectedKey, expectedValue := range tt.expectedState.EquivalenceGroups {
+				assert.Equal(t, expectedValue.MaintenanceCR, resultState.EquivalenceGroups[expectedKey].MaintenanceCR)
+				assert.Equal(t, expectedValue.ActionName, resultState.EquivalenceGroups[expectedKey].ActionName)
+				assert.Equal(t, expectedValue.CreatedAt.Unix(), resultState.EquivalenceGroups[expectedKey].CreatedAt.Unix())
+			}
 		})
fault-remediation/pkg/remediation/remediation_test.go (1)

521-526: Asserting RequeueAfter on error cases may be unreliable.

When tt.expectedError is true, the result struct may not be properly initialized. Consider only asserting RequeueAfter for non-error cases.

♻️ Suggested refactor
 			if tt.expectedError {
 				assert.Error(t, err)
-			} else {
+				return
+			}
 				assert.NoError(t, err)
-			}
 			assert.Equal(t, result.RequeueAfter, tt.requeueTime)
fault-remediation/main.go (1)

111-111: TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

fault-remediation/pkg/annotation/annotation.go (2)

1-1: Missing package-level godoc.

As per coding guidelines, package-level godoc is required for all Go packages.

♻️ Proposed fix
+// Package annotation provides utilities for managing remediation state
+// annotations on Kubernetes nodes.
 package annotation

40-40: TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code.

fault-remediation/pkg/annotation/annotation_interface.go (1)

1-1: Missing package-level godoc.

As per coding guidelines, package-level godoc is required for all Go packages. Since this file defines the interface, it would be appropriate to add the package doc here.

♻️ Proposed fix
+// Package annotation provides interfaces and types for managing remediation
+// state annotations on Kubernetes nodes.
 package annotation
fault-remediation/pkg/remediation/remediation.go (3)

1-115: Add package-level godoc + exported API comments (repo guideline).
Line 1: package remediation should have a // Package remediation ... comment, and exported funcs/methods (Line 56, 141, 145, 150, 340, 581) should have doc comments. As per coding guidelines.


49-54: templates map comment mismatches actual key; dryRunMode []string looks like accidental complexity.
Line 49: comment says β€œmap from template file name”, but the key used is actionName (Line 71-82). Consider fixing comment and switching dryRunMode to dryRun bool (it’s only used as a boolean flag in this file).

Also applies to: 56-114


358-421: Don’t overwrite Job labels from manifest; merge instead (may break selectors/ownership).
Line 384-390: job.Labels = labels discards labels from the YAML manifest (which might be required by selectors/GC/ops tooling). Prefer merging.

fault-remediation/pkg/reconciler/reconciler_test.go (2)

41-63: Mock signature mismatch: runLogCollectorJobFn ignores eventId (can hide bugs).
Line 44-55: RunLogCollectorJob(ctx, nodeName, eventId) drops eventId when calling the mock function. Recommend including eventId in runLogCollectorJobFn so tests can assert it’s passed through correctly.


818-954: Deduplication tests don’t actually exercise status-check behavior.
TestCRBasedDeduplication / TestCrossActionRemediationWithEquivalenceGroups set crSucceeded etc., but the mock provides no CRStatusCheckerInterface, so assertions reduce to β€œallow create when status checker is nil”. Consider adding a simple mock implementing ShouldSkipCRCreation to cover both β€œskip” and β€œallow retry” paths.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 0a80647 and 5947ed5.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (29)
  • .gitignore
  • commons/go.mod
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (8)
  • distros/kubernetes/nvsentinel/values-full.yaml
  • fault-remediation/pkg/reconciler/remediation_test.go
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/annotation.go
βœ… Files skipped from review due to trivial changes (1)
  • commons/go.mod
🚧 Files skipped from review as they are similar to previous changes (3)
  • .gitignore
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
🧠 Learnings (21)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Health monitors should include Helm chart in `distros/kubernetes/nvsentinel/charts/`

Applied to files:

  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (9)
fault-remediation/pkg/initializer/init.go (2)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (56-114)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (206-210)
  • StateManager (197-200)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/annotation/annotation.go (1)
  • NodeAnnotationManager (16-18)
fault-remediation/pkg/remediation/remediation_test.go (4)
fault-remediation/pkg/config/config.go (2)
  • Template (47-50)
  • MaintenanceResource (27-44)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (48-48)
  • RecommendedAction_COMPONENT_RESET (44-44)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewRemediationClient (56-114)
  • FaultRemediationClient (43-54)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
fault-remediation/pkg/reconciler/reconciler.go (6)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/metrics/metrics.go (7)
  • EventHandlingDuration (62-68)
  • TotalEventsReceived (33-38)
  • TotalUnsupportedRemediationActions (53-59)
  • ProcessingErrors (46-52)
  • EventsProcessed (39-45)
  • CRStatusSkipped (26-26)
  • CRStatusCreated (25-25)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventDoc (6-9)
  • HealthEventData (12-15)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventWithStatus (53-57)
store-client/pkg/utils/document_utils.go (1)
  • ExtractDocumentID (59-84)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (56-114)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (206-210)
  • StateManager (197-200)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • AnnotationKey (12-12)
fault-remediation/pkg/metrics/metrics.go (6)
  • TotalEventsReceived (33-38)
  • EventHandlingDuration (62-68)
  • EventsProcessed (39-45)
  • CRStatusCreated (25-25)
  • CRStatusSkipped (26-26)
  • TotalUnsupportedRemediationActions (53-59)
fault-remediation/pkg/reconciler/reconciler_test.go (3)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (12-15)
  • HealthEventDoc (6-9)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (16-21)
  • RemediationStateAnnotation (24-26)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (9)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

99-99: LGTM! Health and readiness probe configuration unified.

The removal of conditional logic around the ctrlRuntimeEnabled flag simplifies the deployment template and aligns with the controller-runtime consolidation.

Also applies to: 106-107

fault-remediation/pkg/metrics/metrics.go (1)

33-92: LGTM! Metrics properly exported for package-level access.

The exported metric variables follow Go naming conventions and enable proper access from consuming packages. The migration to a dedicated metrics package with exported identifiers is a sound refactoring.

fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (1)

15-20: LGTM! Go template syntax is correct.

The template uses valid Go template placeholders that will be rendered into proper YAML. The static analysis syntax error is a false positiveβ€”YAMLlint cannot parse Go template directives.

fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

29-35: Interface design looks good.

The interface methods are well-defined with appropriate return types. The shift to returning (string, error) from CreateMaintenanceResource and (ctrl.Result, error) from RunLogCollectorJob provides better control flow and error handling.


37-50: TemplateData struct is well-documented.

The struct provides clear field documentation and appropriate separation between node/event data and CRD routing metadata.

fault-remediation/pkg/annotation/annotation.go (1)

50-57: Silent failure on unmarshal error may mask data corruption.

When json.Unmarshal fails, the code logs an error but returns an empty state, which could lead to overwriting valid data. Consider returning an error or using a different annotation key to avoid data loss.

fault-remediation/pkg/annotation/annotation_interface.go (1)

10-35: Interface and type definitions are well-structured.

The interface design is clean with appropriate method signatures. The struct types have proper JSON tags for serialization. The AnnotationKey constant is appropriately documented.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

299-327: Verify template mount path works under go test (likely wrong relative path).
Line 303: MountPath: "./templates" is relative to the package working directory at test runtime; if templates actually live under fault-remediation/pkg/remediation/templates, this will fail in CI. Please verify and, if needed, switch to a path that exists from the package dir (e.g., filepath.Join("..", "remediation", "templates")).

fault-remediation/pkg/reconciler/reconciler_test.go (1)

343-344: No action neededβ€”t.Context() is fully supported.

The repository requires Go 1.25 (with toolchain go1.25.3) across all modules. testing.T.Context() was added in Go 1.24, so it is available and compatible with the project's Go version.

Likely an incorrect or invalid review comment.

@ivelichkovich ivelichkovich force-pushed the remediationerrors branch 2 times, most recently from e9af50e to 81189df Compare January 9, 2026 22:45
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

104-111: Fix extra space in readiness probe path.

Line 106 has an extra space before the path value (path: "/readyz"). This should use a single space after the colon for standard YAML formatting.

🧹 Proposed fix
           readinessProbe:
             httpGet:
-              path:  "/readyz"
+              path: "/readyz"
               port: "health"
             initialDelaySeconds: 5
fault-remediation/pkg/crstatus/checker.go (1)

47-76: Add validation to ensure cluster-scoped resources have empty namespace.

The validateScope function in fault-remediation/pkg/config/config.go (lines 133-148) only validates that namespaced resources have a non-empty namespace, but does not validate the inverse: that cluster-scoped resources must have an empty namespace. This allows cluster-scoped remediation actions to be configured with a non-empty namespace value, which would cause the client.Get call in ShouldSkipCRCreation (line 68) to fail or behave unexpectedly.

Enhance validateScope to also validate that cluster-scoped resources (where scope == "Cluster") have an empty namespace, or explicitly clear the namespace for cluster-scoped ObjectKey construction.

πŸ€– Fix all issues with AI agents
In @fault-remediation/main.go:
- Line 184: The kubeconfigPath flag is defined but never used; either remove the
flag or wire it into config loading. To wire it, replace the direct call to
ctrl.GetConfigOrDie() with logic that uses clientcmd.BuildConfigFromFlags("",
kubeconfigPath) when kubeconfigPath is non-empty and falls back to
ctrl.GetConfigOrDie() otherwise (ensuring you handle and log
BuildConfigFromFlags errors); alternatively, delete the flag declaration
(kubeconfigPath and flag.StringVar) so no unused flag remains.

In @fault-remediation/pkg/annotation/annotation_interface.go:
- Line 1: Add a package-level godoc comment for the annotation package: place a
clear, one-paragraph comment above the "package annotation" declaration (in
either annotation_interface.go or annotation.go) describing the package purpose
and high-level behavior; ensure it follows Go doc conventions (complete
sentence, starts with "Package annotation ...") so godoc tools and linters
recognize it.

In @fault-remediation/pkg/annotation/annotation.go:
- Line 1: Add a package-level godoc comment above the "package annotation"
declaration that briefly describes the purpose and responsibilities of the
annotation package (what it provides, key concepts, and any important usage
notes); ensure the comment is a complete sentence or two starting with "Package
annotation ..." so it follows Go doc conventions and tools like godoc and go vet
will recognize it.

In @fault-remediation/pkg/crstatus/checker.go:
- Around line 34-44: Add a GoDoc comment above the exported constructor
NewCRStatusChecker that briefly describes what CRStatusChecker does and the
purpose of the constructor, and document the parameters (client client.Client,
remediationActions map[string]config.MaintenanceResource, dryRun bool) and the
returned *CRStatusChecker; place the comment in standard godoc form starting
with "NewCRStatusChecker ..." so tools like godoc and linters recognize it.

In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: Replace the standalone TODO comment in metrics.go (package metrics)
with a TODO that references a tracked issue ID or URL (e.g.,
"TODO(#<issue-number>): evaluate and remove redundant metrics with ctrl-runtime
defaults") β€” if the issue doesn't exist, create one first and put that issue
number or link in the comment so the TODO follows the repository guideline for
Go code.

In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go:
- Around line 368-369: Several TODO comments in reconciler_e2e_test.go lack
issue references; create separate tracker issues for (a) state transition
handling, (b) StateManager error behavior, and (c) potential state machine bug,
then update each TODO to reference the new issue ID. Specifically, for
occurrences around the UpdateNVSentinelStateNodeLabel usage (symbol:
stateManager.UpdateNVSentinelStateNodeLabel) and the labels like
statemanager.DrainSucceededLabelValue where errors are currently ignored, either
handle the returned error properly in the test or replace the ignore with a TODO
that references the created issue (e.g., TODO(issue-123): reason), and do the
same for other TODOs tied to state transitions and state manager behavior in the
test file.

In @fault-remediation/pkg/reconciler/reconciler.go:
- Line 66: The Reconciler's exported field Config is unused outside
initialization and should be made private to reduce API surface: rename the
struct field from Config to config (type ReconcilerConfig) and update all
internal references (e.g., in constructors like NewReconciler/createReconciler
and methods that currently use r.Config) to r.config; ensure any tests or other
package code that relied on the exported field are updated or, if external
access is required, expose only the needed pieces (e.g., keep annotationManager
private and provide specific accessor methods instead of exposing the whole
config).

In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go:
- Line 15: The package declaration "package remediation" is missing a
package-level godoc comment; add a concise package comment immediately above the
package line describing the purpose and responsibility of the remediation
package (e.g., what it provides, key abstractions or behavior) so the package
"remediation" has proper package-level documentation per Go guidelines.

In @fault-remediation/pkg/remediation/templates/log-collector-job.yaml:
- Around line 30-32: The container spec for the "log-collector" container
currently hardcodes image: test:test; change it to accept a template parameter
(e.g., replace the literal with a template/variable like {{
.Values.logCollector.image }} or {{ .Image }} depending on your templating
system) and add an explicit imagePullPolicy (e.g., imagePullPolicy: IfNotPresent
or Always) so image source and pull behavior are configurable at runtime; update
any README/values usage accordingly to ensure the new parameter is documented
and supplied.
- Around line 18-20: The manifest hardcodes namespace: test in the log-collector
job template; replace that literal with a parameterized template variable (e.g.,
use a template placeholder for the namespace in the same file) and ensure
callers that render this template supply the namespace (or provide a sensible
default), or if this YAML is only a test fixture add a clarifying comment above
the namespace field indicating it is intentionally fixed for tests; update any
code that renders or references this template to pass the new variable (the
field to change is namespace in the same resource that has generateName:
log-collector-job-).
🧹 Nitpick comments (8)
.gitignore (1)

143-143: Clarify intent: blanket rule makes specific patterns redundant.

The AI summary indicates that multiple specific JetBrains patterns should be replaced with the blanket .idea/ rule (line 143). However, the code still contains more specific patterns (e.g., .idea/**/mongoSettings.xml, .idea/replstate.xml, .idea/sonarlint/, etc.) on lines 149, 164, 167–168, 171, 186–188, 191, and 197.

Git's hierarchical ignore semantics mean that .idea/ already ignores all contents under .idea/, rendering the specific patterns redundant. Either the summary is incomplete, or these patterns should be removed for cleaner maintenance.

If the intent is to use only the blanket rule, consider removing the now-redundant specific patterns:

♻️ Optional cleanup to remove redundant patterns
# Mongo Explorer plugin
-.idea/**/mongoSettings.xml
-
# File-based project format
 *.iws

...

# Cursive Clojure plugin
-.idea/replstate.xml
-
# SonarLint plugin
-.idea/sonarlint/
-.idea/**/sonarlint/
-
# SonarQube Plugin
-.idea/**/sonarIssues.xml
-
...

# Markdown Navigator plugin
-.idea/**/markdown-navigator.xml
-.idea/**/markdown-navigator-enh.xml
-.idea/**/markdown-navigator/
-
# Cache file creation bug
-.idea/$CACHE_FILE$
-
...

# Azure Toolkit for IntelliJ plugin
-.idea/**/azureSettings.xml
-

Keep only .idea/ (line 143) and the plugin-specific comments for documentation.

Also applies to: 149-149, 164-164, 167-168, 171-171, 186-188, 191-191, 197-197

distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

96-103: Consider quoting the port value for consistency.

Line 99 uses an unquoted port name health, while line 107 quotes it as "health". For consistency and clarity, consider quoting the port name in the liveness probe as well.

♻️ Proposed fix for consistency
           livenessProbe:
             httpGet:
               path: /healthz
-              port: health
+              port: "health"
             initialDelaySeconds: 15
             periodSeconds: 20
fault-remediation/pkg/annotation/annotation_test.go (1)

6-6: Consider standard testing package for assertions.

Per repository learnings, testify should be reserved for complex scenarios requiring richer diagnostics. While this test does verify multiple fields, consider using standard testing package assertions (t.Error, t.Errorf) for the equality checks.

Based on learnings, avoid introducing testify dependency for simple equality checks across the repository.

Also applies to: 106-106, 109-116

fault-remediation/pkg/initializer/init.go (1)

51-51: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference specific issues for tracking purposes.

As per coding guidelines, TODO comments should reference issues in Go code.

fault-remediation/main.go (1)

105-105: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference specific issues for tracking purposes. Also, add a space after //TODO for consistency with Go conventions.

As per coding guidelines, TODO comments should reference issues in Go code.

fault-remediation/pkg/annotation/annotation.go (1)

40-40: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference specific issues for tracking purposes.

As per coding guidelines, TODO comments should reference issues in Go code.

fault-remediation/pkg/reconciler/reconciler_test.go (1)

53-55: Consider using the eventId parameter in mock.

The RunLogCollectorJob mock ignores the eventId parameter when calling the mock function. This could mask issues in tests where the event ID is expected to be propagated correctly.

♻️ Suggested improvement
-func (m *MockK8sClient) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) {
-	return m.runLogCollectorJobFn(ctx, nodeName)
+func (m *MockK8sClient) RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error) {
+	return m.runLogCollectorJobFn(ctx, nodeName, eventId)
 }

And update the function signature:

-	runLogCollectorJobFn        func(ctx context.Context, nodeName string) (ctrl.Result, error)
+	runLogCollectorJobFn        func(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
fault-remediation/pkg/reconciler/reconciler.go (1)

474-482: Consider structured error types for metric labeling.

The error label determination relies on string matching of error messages, which is fragile if error messages change. Consider using structured error types or error codes for more robust metric labeling.

Example approach

Define error types in the eventutil package:

var (
    ErrMarshalFailed   = errors.New("marshal failed")
    ErrUnmarshalFailed = errors.New("unmarshal failed")
)

Then use errors.Is() for classification:

if errors.Is(err, eventutil.ErrMarshalFailed) {
    errorLabel = "marshal_error"
} else if errors.Is(err, eventutil.ErrUnmarshalFailed) {
    errorLabel = "unmarshal_doc_error"
}
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 5947ed5 and 81189df.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (29)
  • .gitignore
  • commons/go.mod
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (8)
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • fault-remediation/pkg/reconciler/remediation_test.go
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/pkg/reconciler/annotation.go
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
βœ… Files skipped from review due to trivial changes (1)
  • commons/go.mod
🚧 Files skipped from review as they are similar to previous changes (4)
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/remediation.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/crstatus/checker.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
🧠 Learnings (20)
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/annotation/annotation_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧬 Code graph analysis (7)
fault-remediation/pkg/crstatus/crstatus_test.go (1)
fault-remediation/pkg/crstatus/checker.go (1)
  • NewCRStatusChecker (34-44)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
fault-remediation/pkg/reconciler/reconciler_test.go (3)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (12-15)
  • HealthEventDoc (6-9)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (16-21)
  • RemediationStateAnnotation (24-26)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/annotation/annotation.go (1)
  • NodeAnnotationManager (16-18)
fault-remediation/main.go (2)
commons/pkg/auditlogger/roundtripper.go (1)
  • NewAuditingRoundTripper (42-47)
fault-remediation/pkg/initializer/init.go (2)
  • InitializationParams (40-45)
  • InitializeAll (52-152)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (28)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml (2)

15-20: Static analysis false positive: YAMLlint doesn't understand Go templates.

The YAMLlint syntax error is expected and can be safely ignored. This file is a Go text/template that will be rendered before YAML parsing, so the {{.ApiGroup}}, {{.Version}}, {{.NodeName}}, and {{.HealthEventID}} placeholders are intentional and correct.


18-18: No validation needed for DNS-1123 compliance.

The resource name maintenance-{{.NodeName}}-{{.HealthEventID}} is already compliant with Kubernetes DNS-1123 naming rules. NodeName is sourced from Kubernetes Node objects, which are pre-validated by the Kubernetes API server to comply with DNS-1123 rules. HealthEventID is generated as a UUID (e.g., 68b000bd-9ce2-46f8-94f1-1e51532255f8), which inherently uses only lowercase alphanumeric characters and hyphens, making it DNS-1123 compliant. The full resource name (example: maintenance-test-node-1-68b000bd-9ce2-46f8-94f1-1e51532255f8) is well under the 253-character limit and matches the DNS-1123 subdomain pattern. No sanitization is required.

Likely an incorrect or invalid review comment.

fault-remediation/pkg/metrics/metrics.go (2)

15-15: LGTM: Package rename improves organization.

Renaming from reconciler to metrics better reflects the package's purpose and aligns with the controller-runtime migration.


33-93: All callers have been properly updated to use exported metric names.

The metric refactoring from unexported to exported (e.g., totalEventsReceived β†’ TotalEventsReceived) is complete throughout the fault-remediation module. All references in reconciler.go, remediation.go, and test files correctly use the new exported names with their respective label values.

fault-remediation/pkg/crstatus/checker.go (1)

28-32: LGTM: Controller-runtime client migration.

The migration from dynamic client to controller-runtime client aligns with modern Kubernetes controller patterns and improves type safety.

fault-remediation/pkg/remediation/templates/log-collector-job.yaml (1)

16-29: Job structure follows best practices.

The Job manifest structure is well-formed with appropriate settings:

  • generateName for unique job instances
  • ttlSecondsAfterFinished for automatic cleanup
  • restartPolicy: Never appropriate for batch jobs
  • Service account specified for proper RBAC

The main concerns are the hardcoded values flagged in other comments.

fault-remediation/pkg/crstatus/crstatus_test.go (1)

33-34: LGTM: Constructor call updated correctly.

The test has been properly updated to match the new NewCRStatusChecker signature, passing nil for the client parameter which is appropriate since this test focuses on the condition checking logic that doesn't require client operations.

fault-remediation/pkg/initializer/init.go (2)

85-99: Client initialization looks good.

The remediation client and Kubernetes client initialization properly uses error wrapping and follows the expected controller-runtime integration pattern.


40-56: All callers have been properly updated for the breaking API changes.

The single caller of InitializeAll in fault-remediation/main.go:141 has been correctly updated to pass all three required parameters: ctx, params, and mgr.GetClient(). The new signature is properly implemented and called.

fault-remediation/main.go (1)

97-167: Controller-runtime integration looks solid.

The setupCtrlRuntimeManagement function properly implements:

  • Auditing round-tripper integration
  • Health and readiness checks
  • Proper shutdown handling with deferred cleanup
  • Context-based lifecycle management
fault-remediation/pkg/annotation/annotation.go (1)

27-166: Implementation looks solid.

The annotation manager methods properly implement:

  • Error wrapping with context using fmt.Errorf
  • Structured logging with appropriate log levels
  • Safe node mutation using DeepCopy before patching
  • Proper cleanup (clearing annotation when no groups remain)
fault-remediation/pkg/annotation/annotation_interface.go (1)

10-35: Interface and type definitions look good.

The annotation interface and types are well-designed with:

  • Clear interface contract for annotation management
  • Appropriate JSON tags for serialization
  • Good documentation for exported items
  • Proper use of time.Time for timestamp tracking
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (2)

29-35: LGTM! Interface updates align with controller-runtime patterns.

The interface method signatures have been properly updated to:

  • Use the new events.HealthEventData type for health event handling
  • Return ctrl.Result from RunLogCollectorJob for controller-runtime integration
  • Adopt new annotation and status checker interfaces

These changes are consistent with the broader refactoring to controller-runtime-based orchestration.


37-50: LGTM! Well-documented template data structure.

The TemplateData struct is properly documented with clear field groupings for node/event data and CRD routing metadata. The structure supports template-driven remediation workflows introduced in this PR.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (6)

167-177: LGTM! Test infrastructure aligned with controller-runtime.

The global test variables are properly declared, including the new ctrlRuntimeClient obtained from the controller-runtime manager. This setup supports the migration to controller-runtime-based testing.


206-215: LGTM! Controller-runtime manager properly configured.

The manager setup correctly:

  • Disables metrics server (BindAddress: "0") for test isolation
  • Obtains the controller-runtime client via mgr.GetClient()
  • Uses the test environment configuration

This aligns with the broader migration to controller-runtime-based architecture.


300-327: LGTM! Test client creation aligned with new remediation API.

The helper function correctly:

  • Returns the public remediation.FaultRemediationClientInterface
  • Uses the new template configuration structure with TemplateFileName
  • Constructs the client with the controller-runtime client

These changes support template-driven remediation workflows.


752-753: LGTM! Metrics properly migrated to the metrics package.

The test correctly validates metric instrumentation using the new metrics package. Counter and histogram metrics are properly accessed and verified with appropriate label values.

Also applies to: 892-900


329-606: LGTM! Comprehensive test coverage for CR-based deduplication.

The test suite properly validates:

  • First event creates annotations and CRs
  • Subsequent events are deduplicated based on CR status
  • Failed CRs allow retry after cleanup
  • Cross-action deduplication within equivalence groups

Test structure follows best practices with descriptive names and clear assertions.

Based on coding guidelines.


378-378: Investigate undefined annotation2 reference in test assertions.

The code uses annotation2.AnnotationKey in 7 assertions (lines 378, 659, 781, 838, 880, 1227, 1335), but annotation2 is not imported in the imports section (lines 17-60) and no variable or constant declaration for it was found in the file. Verify that annotation2 is properly defined or imported; otherwise, the code will fail to compile.

fault-remediation/pkg/reconciler/reconciler_test.go (3)

116-131: LGTM! Mock annotation manager properly implements the interface.

The mock correctly implements the updated GetRemediationState signature with three return values. Returning nil for the node is acceptable for unit tests that don't require the node object.


393-399: LGTM! Tests properly convert between HealthEventData and HealthEventDoc.

The tests correctly create HealthEventData for internal use and wrap it into HealthEventDoc with an ID when calling performRemediation. This pattern aligns with the separation of concerns in the events package.

Also applies to: 449-456, 494-500


637-673: LGTM! Comprehensive log collector test coverage.

The test cases properly validate:

  • Success path with no errors
  • Failure path with error returns
  • Requeue scenarios with ctrl.Result{RequeueAfter: ...}

This ensures the log collector integration works correctly with controller-runtime patterns.

fault-remediation/pkg/reconciler/reconciler.go (5)

92-127: LGTM! Reconcile method properly structured with metrics and error handling.

The method correctly:

  • Instruments timing with EventHandlingDuration
  • Increments event counter
  • Handles parsing errors gracefully
  • Adds safety checks for nil pointers
  • Routes to appropriate handlers (cancellation vs remediation)

171-193: LGTM! Log collector method properly integrated with controller-runtime.

The method correctly:

  • Returns ctrl.Result to support requeue behavior
  • Checks configuration before executing
  • Propagates both result and error from the log collector job
  • Uses structured logging with appropriate context

196-245: LGTM! Remediation method properly handles state transitions and errors.

The method correctly:

  • Uses new events.HealthEventDoc type
  • Updates state labels before and after remediation
  • Combines multiple errors with errors.Join
  • Tracks failures with metrics
  • Ensures state labels reflect the outcome even on CR creation failure

The pattern of continuing to update state labels after CR creation errors is appropriate for maintaining consistent node state.


328-330: LGTM! Log collector positioning prevents duplicate jobs.

The log collector is now called only after verifying that a new CR should be created (shouldCreateCR check). This correctly prevents duplicate log-collector jobs when multiple events arrive for the same node, addressing the issue mentioned in the test comments.


395-459: LGTM! CR status checking properly implements deduplication logic.

The method correctly:

  • Retrieves configuration and state using the updated interfaces
  • Checks equivalence groups for existing CRs
  • Uses the status checker to determine if CR creation should be skipped
  • Removes failed/completed CR references from state to allow retries
  • Wraps errors with appropriate context using %w

)

flag.StringVar(&kubeconfigPath, "kubeconfig-path", "", "path to kubeconfig file")
flag.StringVar(&kubeconfigPath, "kubeconfig", "", "path to kubeconfig file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | πŸ”΄ Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check all references to kubeconfigPath variable
rg -nP '\bkubeconfigPath\b' --type go

Repository: NVIDIA/NVSentinel

Length of output: 3067


🏁 Script executed:

cat -n fault-remediation/main.go | sed -n '50,120p'

Repository: NVIDIA/NVSentinel

Length of output: 2658


Remove unused kubeconfig flag or wire it to config loading.

The kubeconfigPath flag is defined at line 184 but never referenced in the code. Line 100 uses ctrl.GetConfigOrDie(), which respects the KUBECONFIG environment variable and in-cluster configuration automatically, but does not use the kubeconfigPath variable. This creates a misleading flag that users may set without effect. Either remove the flag since ctrl.GetConfigOrDie() handles standard mechanisms, or wire it to use clientcmd.BuildConfigFromFlags() if explicit kubeconfig path support is intended.

πŸ€– Prompt for AI Agents
In @fault-remediation/main.go at line 184, The kubeconfigPath flag is defined
but never used; either remove the flag or wire it into config loading. To wire
it, replace the direct call to ctrl.GetConfigOrDie() with logic that uses
clientcmd.BuildConfigFromFlags("", kubeconfigPath) when kubeconfigPath is
non-empty and falls back to ctrl.GetConfigOrDie() otherwise (ensuring you handle
and log BuildConfigFromFlags errors); alternatively, delete the flag declaration
(kubeconfigPath and flag.StringVar) so no unused flag remains.

@@ -0,0 +1,35 @@
package annotation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

Package-level godoc missing.

While package-level godoc only needs to appear once per package, ensure one file in the annotation package (either this file or annotation.go) contains package-level documentation.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation_interface.go at line 1, Add a
package-level godoc comment for the annotation package: place a clear,
one-paragraph comment above the "package annotation" declaration (in either
annotation_interface.go or annotation.go) describing the package purpose and
high-level behavior; ensure it follows Go doc conventions (complete sentence,
starts with "Package annotation ...") so godoc tools and linters recognize it.

@@ -0,0 +1,166 @@
package annotation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

Add package-level godoc.

Per coding guidelines, package-level godoc is required for all Go packages.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ“ Suggested package documentation
+// Package annotation provides functionality for managing node remediation state
+// through Kubernetes node annotations. It enables tracking of ongoing remediation
+// actions across equivalence groups.
 package annotation
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
package annotation
// Package annotation provides functionality for managing node remediation state
// through Kubernetes node annotations. It enables tracking of ongoing remediation
// actions across equivalence groups.
package annotation
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation.go at line 1, Add a
package-level godoc comment above the "package annotation" declaration that
briefly describes the purpose and responsibilities of the annotation package
(what it provides, key concepts, and any important usage notes); ensure the
comment is a complete sentence or two starting with "Package annotation ..." so
it follows Go doc conventions and tools like godoc and go vet will recognize it.

Comment on lines 34 to 44
func NewCRStatusChecker(
dynamicClient dynamic.Interface,
restMapper *restmapper.DeferredDiscoveryRESTMapper,
client client.Client,
remediationActions map[string]config.MaintenanceResource,
dryRun bool,
) *CRStatusChecker {
return &CRStatusChecker{
dynamicClient: dynamicClient,
restMapper: restMapper,
client: client,
remediationActions: remediationActions,
dryRun: dryRun,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Add godoc comment for exported constructor.

As per coding guidelines, function comments are required for all exported Go functions. Add a godoc comment describing the constructor's purpose and parameters.

πŸ“ Suggested godoc
+// NewCRStatusChecker creates a new CRStatusChecker with the provided client,
+// remediation action configuration, and dry-run mode setting.
 func NewCRStatusChecker(
 	client client.Client,
 	remediationActions map[string]config.MaintenanceResource,

Based on coding guidelines: Function comments required for all exported Go functions.

πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func NewCRStatusChecker(
dynamicClient dynamic.Interface,
restMapper *restmapper.DeferredDiscoveryRESTMapper,
client client.Client,
remediationActions map[string]config.MaintenanceResource,
dryRun bool,
) *CRStatusChecker {
return &CRStatusChecker{
dynamicClient: dynamicClient,
restMapper: restMapper,
client: client,
remediationActions: remediationActions,
dryRun: dryRun,
}
}
// NewCRStatusChecker creates a new CRStatusChecker with the provided client,
// remediation action configuration, and dry-run mode setting.
func NewCRStatusChecker(
client client.Client,
remediationActions map[string]config.MaintenanceResource,
dryRun bool,
) *CRStatusChecker {
return &CRStatusChecker{
client: client,
remediationActions: remediationActions,
dryRun: dryRun,
}
}
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/crstatus/checker.go around lines 34 - 44, Add a GoDoc
comment above the exported constructor NewCRStatusChecker that briefly describes
what CRStatusChecker does and the purpose of the constructor, and document the
parameters (client client.Client, remediationActions
map[string]config.MaintenanceResource, dryRun bool) and the returned
*CRStatusChecker; place the comment in standard godoc form starting with
"NewCRStatusChecker ..." so tools like godoc and linters recognize it.

Comment on lines +368 to +371
// TODO: ignoring error otherwise need to properly walk state transitions
_, _ = stateManager.UpdateNVSentinelStateNodeLabel(ctx, nodeName, statemanager.DrainSucceededLabelValue, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

TODO comments should reference issues.

Multiple TODO comments appear throughout the test file without issue references. As per coding guidelines, TODO comments should reference issues for tracking.

Consider creating issues for:

  1. State transition handling (lines 368, 430, 488, 571)
  2. StateManager error behavior (lines 489, 572)
  3. Potential state machine bug (line 519)

Based on coding guidelines.

Also applies to: 430-431, 488-490, 519-520, 571-573

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/reconciler/reconciler_e2e_test.go around lines 368 -
369, Several TODO comments in reconciler_e2e_test.go lack issue references;
create separate tracker issues for (a) state transition handling, (b)
StateManager error behavior, and (c) potential state machine bug, then update
each TODO to reference the new issue ID. Specifically, for occurrences around
the UpdateNVSentinelStateNodeLabel usage (symbol:
stateManager.UpdateNVSentinelStateNodeLabel) and the labels like
statemanager.DrainSucceededLabelValue where errors are currently ignored, either
handle the returned error properly in the test or replace the ignore with a TODO
that references the created issue (e.g., TODO(issue-123): reason), and do the
same for other TODOs tied to state transitions and state manager behavior in the
test file.

healthEventStore datastore.HealthEventStore
config ReconcilerConfig
annotationManager NodeAnnotationManagerInterface
Config ReconcilerConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if Config field is accessed externally
rg -n "\.Config\." --type go -g '!*_test.go' -g '!fault-remediation/pkg/reconciler/reconciler.go' fault-remediation/

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Read the relevant section of the reconciler.go file to see struct definition and usage
cd fault-remediation && wc -l pkg/reconciler/reconciler.go

Repository: NVIDIA/NVSentinel

Length of output: 93


🏁 Script executed:

#!/bin/bash
# View the struct definition around line 66
cd fault-remediation && sed -n '55,95p' pkg/reconciler/reconciler.go

Repository: NVIDIA/NVSentinel

Length of output: 1606


🏁 Script executed:

#!/bin/bash
# Search for imports of FaultRemediationReconciler from other packages
rg "FaultRemediationReconciler" --type go -g '!*_test.go' -g '!fault-remediation/pkg/reconciler/reconciler.go'

Repository: NVIDIA/NVSentinel

Length of output: 459


🏁 Script executed:

#!/bin/bash
# Check if Config field is accessed in main.go
rg "\.Config" fault-remediation/main.go

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

#!/bin/bash
# Check if Config field is accessed in init.go
rg "\.Config" fault-remediation/pkg/initializer/init.go

Repository: NVIDIA/NVSentinel

Length of output: 152


🏁 Script executed:

#!/bin/bash
# Check if Watcher field is accessed to understand the pattern of public fields
rg "\.Watcher" fault-remediation/ --type go -g '!*_test.go'

Repository: NVIDIA/NVSentinel

Length of output: 880


Make Config field private to maintain encapsulation.

The Config field is exported but never accessed after initialization. Only config.RemediationClient.GetAnnotationManager() is extracted and stored in the private annotationManager field. Since the configuration is not accessed elsewhere, consider making it private (config instead of Config) to reduce the public API surface and prevent external packages from depending on internal configuration details.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/reconciler/reconciler.go at line 66, The Reconciler's
exported field Config is unused outside initialization and should be made
private to reduce API surface: rename the struct field from Config to config
(type ReconcilerConfig) and update all internal references (e.g., in
constructors like NewReconciler/createReconciler and methods that currently use
r.Config) to r.config; ensure any tests or other package code that relied on the
exported field are updated or, if external access is required, expose only the
needed pieces (e.g., keep annotationManager private and provide specific
accessor methods instead of exposing the whole config).

// limitations under the License.

package reconciler
package remediation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Add package-level documentation.

The package declaration lacks a package-level godoc comment. As per coding guidelines, all Go packages require package-level documentation.

πŸ“ Suggested package documentation
+// Package remediation provides interfaces and types for fault remediation operations,
+// including maintenance resource creation, log collection, and node annotation management.
 package remediation

Based on coding guidelines.

πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
package remediation
// Package remediation provides interfaces and types for fault remediation operations,
// including maintenance resource creation, log collection, and node annotation management.
package remediation
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go at
line 15, The package declaration "package remediation" is missing a
package-level godoc comment; add a concise package comment immediately above the
package line describing the purpose and responsibility of the remediation
package (e.g., what it provides, key abstractions or behavior) so the package
"remediation" has proper package-level documentation per Go guidelines.

Comment on lines +18 to +20
metadata:
generateName: log-collector-job-
namespace: test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

Hardcoded namespace should be parameterized.

The namespace is hardcoded to test. If this template is used at runtime to create Jobs, the namespace should be parameterized. If this is only a test fixture, consider adding a comment to clarify.

♻️ Suggested parameterization approach

If this is a Go template, replace with a template variable:

 metadata:
   generateName: log-collector-job-
-  namespace: test
+  namespace: {{ .Namespace }}

Or if this is intended only for testing, add a comment:

 metadata:
   generateName: log-collector-job-
+  # Note: This is a test fixture; namespace is intentionally hardcoded
   namespace: test
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
metadata:
generateName: log-collector-job-
namespace: test
metadata:
generateName: log-collector-job-
namespace: {{ .Namespace }}
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/templates/log-collector-job.yaml around
lines 18 - 20, The manifest hardcodes namespace: test in the log-collector job
template; replace that literal with a parameterized template variable (e.g., use
a template placeholder for the namespace in the same file) and ensure callers
that render this template supply the namespace (or provide a sensible default),
or if this YAML is only a test fixture add a clarifying comment above the
namespace field indicating it is intentionally fixed for tests; update any code
that renders or references this template to pass the new variable (the field to
change is namespace in the same resource that has generateName:
log-collector-job-).

Comment on lines +30 to +32
containers:
- name: log-collector
image: test:test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion | 🟠 Major

Hardcoded image should be parameterized and add imagePullPolicy.

The container image is hardcoded to test:test. If this template is used at runtime, the image should be parameterized. Additionally, consider adding an imagePullPolicy for better control over image pulling behavior.

♻️ Suggested parameterization

If this is a Go template:

       containers:
         - name: log-collector
-          image: test:test
+          image: {{ .Image }}
+          imagePullPolicy: {{ .ImagePullPolicy }}
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
containers:
- name: log-collector
image: test:test
containers:
- name: log-collector
image: {{ .Image }}
imagePullPolicy: {{ .ImagePullPolicy }}
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/templates/log-collector-job.yaml around
lines 30 - 32, The container spec for the "log-collector" container currently
hardcodes image: test:test; change it to accept a template parameter (e.g.,
replace the literal with a template/variable like {{ .Values.logCollector.image
}} or {{ .Image }} depending on your templating system) and add an explicit
imagePullPolicy (e.g., imagePullPolicy: IfNotPresent or Always) so image source
and pull behavior are configurable at runtime; update any README/values usage
accordingly to ensure the new parameter is documented and supplied.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

πŸ€– Fix all issues with AI agents
In @fault-remediation/pkg/annotation/annotation.go:
- Line 40: The TODO comment "// TODO: maybe split this up so it's not returning
both node and state" is not actionable; replace it with either a linked issue
TODO (e.g., "// TODO(#1234): consider splitting ReturnNodeAndState into two
functions to avoid returning both node and state") or remove it if no issue
exists; locate the comment near the function that returns both node and state
(the function that currently returns node and state in
pkg/annotation/annotation.go) and update the comment to reference the issue
number or change it to a concrete task with owner/ETA, ensuring the TODO format
matches the repository convention (issue number or shortlink).
- Around line 42-60: After unmarshalling into RemediationStateAnnotation, ensure
state.EquivalenceGroups is non-nil to avoid panics: if json.Unmarshal succeeds
but state.EquivalenceGroups == nil, initialize it with
make(map[string]EquivalenceGroupState). Apply the same nil-check/initialization
in the other unmarshalling spot noted (also around lines 72-78). This prevents
UpdateRemediationState or any assignment into EquivalenceGroups from causing a
nil-map panic.
- Around line 63-95: UpdateRemediationState (and likewise RemoveGroupFromState)
currently performs a read-modify-patch that can lose concurrent updates; wrap
the Get→mutate→Patch sequence in a retry.RetryOnConflict loop so on a conflict
you re-read the node/state and retry the mutation, using the
node.DeepCopy()β†’client.MergeFrom(patchBase) approach inside the retry body;
ensure you return errors from the retry body directly (not wrapped) so
RetryOnConflict can detect conflicts and retry, and only wrap the final returned
error if needed after the retry loop completes.

In @fault-remediation/pkg/metrics/metrics.go:
- Line 30: The TODO comment in metrics.go ("//TODO: evaluate and remove
redundant metrics with ctrl-runtime defaults") must reference an issue; update
that comment to include the tracker reference (e.g., issue number or URL) and a
short context so it follows the Go comment guidelineβ€”replace the plain TODO with
a TODO that includes the issue id (or link) and the same brief action so
reviewers can find the discussion.

In @fault-remediation/pkg/reconciler/reconciler.go:
- Around line 328-330: The current handling of r.runLogCollector(ctx,
healthEvent, healthEventWithStatus.ID) wraps err regardless of whether err is
nil, producing messages like "<nil>" when result is non-zero but err==nil;
change the logic to first check if err != nil and return result,
fmt.Errorf("error running log collector: %w", err) in that branch, and otherwise
if !result.IsZero() return result, nil so a non-zero result is propagated
without a spurious wrapped nil error. Ensure you update the code block around
the call to r.runLogCollector and use the existing result and err variables.

In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go:
- Around line 29-35: Add GoDoc comments: write a descriptive comment above the
exported type FaultRemediationClientInterface explaining its purpose and
intended use, then add short GoDoc comments for each exported method
(CreateMaintenanceResource, RunLogCollectorJob, GetAnnotationManager,
GetStatusChecker, GetConfig) describing what they do, their key parameters and
return values; ensure each comment begins with the exact method/type name to
satisfy golint/go vet requirements.

In @fault-remediation/pkg/remediation/remediation.go:
- Around line 465-500: checkLogCollectorComplete can panic when computing
duration because job.Status.StartTime (and possibly job.Status.CompletionTime)
can be nil; before calling job.Status.StartTime.Time or
job.Status.CompletionTime.Sub(...) add a nil-check: only compute and call
metrics.LogCollectorJobDuration.WithLabelValues(...).Observe(duration) when both
job.Status.StartTime != nil and job.Status.CompletionTime != nil, otherwise skip
observing duration (you may still increment metrics.LogCollectorJobs count if
desired). Update the code in checkLogCollectorComplete to guard the duration
calculation and observation (referencing job.Status.StartTime,
job.Status.CompletionTime, metrics.LogCollectorJobs,
metrics.LogCollectorJobDuration, and jobMetricsAlreadyCountedAnnotation) to
avoid the nil pointer dereference.

In @fault-remediation/pkg/remediation/templates/rebootnode-template.yaml:
- Around line 15-20: The YAML template contains unquoted Go template
placeholders (e.g., {{.ApiGroup}}, {{.Version}}, metadata.name:
maintenance-{{.NodeName}}-{{.HealthEventID}}, and spec.nodeName: {{.NodeName}})
which make the file fail static YAML linters; fix by wrapping each placeholder
value in quotes so the template is valid YAML while rendered output remains the
same (quote the ApiGroup/Version values and the metadata.name and spec.nodeName
placeholders), and do not add metadata.namespace since RebootNode is
cluster-scoped.
🧹 Nitpick comments (13)
fault-remediation/pkg/events/health_event.go (1)

5-15: json:",inline" is likely redundant/confusing for Go encoding/json. Embedded structs are already inlined by default; the inline option isn’t a standard encoding/json directive. Consider dropping the JSON tag (or confirm this is intentional for a specific serializer).

fault-remediation/pkg/annotation/annotation_test.go (1)

15-20: Test style: prefer require for fatal checks + more descriptive test names. E.g., require.NoError before dereferencing results, and rename to TestGetRemediationState_NodeMissing_ReturnsError-style per guidelines.

Also applies to: 122-207

fault-remediation/pkg/remediation/remediation_test.go (1)

346-541: Consider envtest for deeper integration testing.

While the current fake client approach is appropriate for unit testing, consider adding integration tests using envtest for more realistic controller behavior validation, especially for complex job lifecycle scenarios.

Based on learnings, use envtest for testing Kubernetes controllers instead of fake clients.

fault-remediation/pkg/remediation/remediation.go (5)

33-41: Missing package-level godoc.

As per coding guidelines, package-level godoc is required for all Go packages.

πŸ“ Add package documentation
+// Package remediation provides fault remediation client functionality for creating
+// maintenance resources and orchestrating log collector jobs in Kubernetes clusters.
 package remediation

141-147: Missing godoc for exported functions.

As per coding guidelines, function comments are required for all exported Go functions.

πŸ“ Add function documentation
+// GetAnnotationManager returns the node annotation manager used for remediation state tracking.
 func (c *FaultRemediationClient) GetAnnotationManager() annotation.NodeAnnotationManagerInterface {
 	return c.annotationManager
 }

+// GetStatusChecker returns the CR status checker used for determining CR completion status.
 func (c *FaultRemediationClient) GetStatusChecker() crstatus.CRStatusCheckerInterface {
 	return c.statusChecker
 }

149-245: CreateMaintenanceResource has comprehensive logic with proper error handling.

The function correctly:

  • Handles dry-run mode early
  • Validates node existence before CR creation
  • Handles AlreadyExists error gracefully
  • Updates annotation state after CR creation

The nolint: cyclop comment should reference a tracking issue per coding guidelines.

πŸ“ Add issue reference to TODO comment
-// nolint: cyclop // todo
+// nolint: cyclop // TODO(issue-XXX): refactor to reduce cyclomatic complexity
 func (c *FaultRemediationClient) CreateMaintenanceResource(

358-421: Log collector job creation has potential issues.

  1. Line 408: The error message includes the full job list which could be verbose in logs
  2. The job labels are set after unmarshaling but the job name from the manifest is used, which could cause collisions
πŸ“ Improve error message and consider unique job naming
 	if len(existingJobs.Items) > 1 {
 		return batchv1.Job{},
 			ctrl.Result{},
-			fmt.Errorf("expecting zero or one log collector job per event per node, found %v", existingJobs.Items)
+			fmt.Errorf("expecting zero or one log collector job per event per node, found %d jobs", len(existingJobs.Items))
 	}

590-592: Missing godoc for exported function GetConfig.

πŸ“ Add function documentation
+// GetConfig returns the parsed remediation TOML configuration.
 func (c *FaultRemediationClient) GetConfig() *config.TomlConfig {
 	return &c.remediationConfig
 }
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

17-54: Import ordering could be improved.

The import at line 23 (sigs.k8s.io/controller-runtime/pkg/client) is placed incorrectly between standard library imports. Consider grouping imports properly.

πŸ“ Fix import grouping
 import (
 	"context"
-
 	"log"
 	"os"
 	"path/filepath"
-	"sigs.k8s.io/controller-runtime/pkg/client"
 	"sync"
 	"testing"
 	"time"

 	"github.com/prometheus/client_golang/prometheus"
 	dto "github.com/prometheus/client_model/go"
 	"github.com/stretchr/testify/assert"
 	"github.com/stretchr/testify/require"
 	corev1 "k8s.io/api/core/v1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/runtime/schema"
 	"k8s.io/client-go/dynamic"
 	"k8s.io/client-go/kubernetes"
 	"k8s.io/client-go/kubernetes/scheme"
 	"k8s.io/client-go/rest"
 	ctrl "sigs.k8s.io/controller-runtime"
+	"sigs.k8s.io/controller-runtime/pkg/client"
 	"sigs.k8s.io/controller-runtime/pkg/envtest"

370-371: TODO comments should reference tracking issues.

Multiple TODO comments exist without issue references. As per coding guidelines, TODO comments should reference issues in Go code.

Consider creating tracking issues and updating these TODOs:

  • Line 370-371: State transition handling
  • Line 432-433, 490-492, 521-522, 573-575: State manager error handling

Also applies to: 432-433, 490-492, 521-522, 573-575, 653-654, 708-709


1124-1125: Using time.Sleep for synchronization is fragile.

At line 1124, time.Sleep(500 * time.Millisecond) is used to wait for event processing. Consider using assert.Eventually for more reliable async waiting.

πŸ“ Replace sleep with Eventually
-	// Allow time for second event to be processed (should be deduplicated)
-	time.Sleep(500 * time.Millisecond)
+	// Wait for second event to be processed (should be deduplicated)
+	time.Sleep(100 * time.Millisecond) // Brief pause for event to be queued
fault-remediation/pkg/reconciler/reconciler_test.go (1)

17-39: Import ordering needs attention.

Imports at lines 21-24 are placed incorrectly between standard library and third-party imports.

πŸ“ Fix import grouping
 import (
 	"context"
 	"errors"
 	"fmt"
+	"testing"
+	"time"
+
+	"github.com/google/uuid"
+	"github.com/stretchr/testify/assert"
+	corev1 "k8s.io/api/core/v1"
+	"k8s.io/utils/ptr"
+	ctrl "sigs.k8s.io/controller-runtime"
+
 	"github.com/nvidia/nvsentinel/fault-remediation/pkg/annotation"
 	"github.com/nvidia/nvsentinel/fault-remediation/pkg/events"
-	corev1 "k8s.io/api/core/v1"
-	ctrl "sigs.k8s.io/controller-runtime"
-	"testing"
-	"time"
-
-	"github.com/google/uuid"
-	"github.com/stretchr/testify/assert"
-	"k8s.io/utils/ptr"
-
 	"github.com/nvidia/nvsentinel/commons/pkg/statemanager"
fault-remediation/pkg/reconciler/reconciler.go (1)

277-356: handleRemediationEvent has proper error propagation.

The function correctly:

  • Uses errors.Join at line 339 to combine errors
  • Propagates requeue results from log collector at line 328-330
  • Updates metrics appropriately

The nolint: cyclop comment should reference a tracking issue.

πŸ“ Add issue reference
-// nolint: cyclop // todo
+// nolint: cyclop // TODO(issue-XXX): refactor to reduce cyclomatic complexity
 func (r *FaultRemediationReconciler) handleRemediationEvent(
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 81189df and 48914e5.

β›” Files ignored due to path filters (1)
  • commons/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (29)
  • .gitignore
  • commons/go.mod
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/main.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_interface.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/crstatus/checker.go
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/annotation.go
  • fault-remediation/pkg/reconciler/annotation_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/remediation/templates/rebootnode-template.yaml
πŸ’€ Files with no reviewable changes (8)
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • fault-remediation/pkg/reconciler/remediation_test.go
  • distros/kubernetes/nvsentinel/values-full.yaml
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/annotation.go
  • distros/kubernetes/nvsentinel/values.yaml
  • fault-remediation/pkg/reconciler/annotation_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • fault-remediation/pkg/crstatus/crstatus_interface.go
  • fault-remediation/main.go
  • fault-remediation/pkg/remediation/templates/log-collector-job.yaml
  • fault-remediation/pkg/annotation/annotation_interface.go
  • commons/go.mod
  • fault-remediation/pkg/crstatus/crstatus_test.go
  • .gitignore
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/checker.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (19)
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.

Applied to files:

  • fault-remediation/pkg/events/health_event.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

  • fault-remediation/pkg/events/health_event.go
  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/remediation/fault_remediation_client_interface.go
  • fault-remediation/pkg/remediation/remediation.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use inline comments for complex logic only in Go code

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/remediation/remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/annotation/annotation.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • fault-remediation/pkg/annotation/annotation_test.go
  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/remediation/remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/metrics/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2026-01-07T09:54:49.335Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Wrap errors with context using `fmt.Errorf("context: %w", err)` in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/crstatus/checker.go
πŸ“š Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.

Applied to files:

  • fault-remediation/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (10)
fault-remediation/pkg/annotation/annotation.go (1)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/annotation/annotation_test.go (2)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • RemediationStateAnnotation (24-26)
  • AnnotationKey (12-12)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/annotation/annotation.go (1)
  • NodeAnnotationManager (16-18)
fault-remediation/pkg/remediation/remediation_test.go (2)
fault-remediation/pkg/config/config.go (1)
  • Template (47-50)
fault-remediation/pkg/remediation/remediation.go (2)
  • NewRemediationClient (56-114)
  • FaultRemediationClient (43-54)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (3)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventData (12-15)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
fault-remediation/pkg/remediation/remediation.go (1)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventData (12-15)
  • HealthEventDoc (6-9)
fault-remediation/pkg/annotation/annotation_interface.go (3)
  • NodeAnnotationManagerInterface (16-21)
  • RemediationStateAnnotation (24-26)
  • EquivalenceGroupState (29-35)
fault-remediation/pkg/crstatus/crstatus_interface.go (1)
  • CRStatusCheckerInterface (9-11)
data-models/pkg/protos/health_event.pb.go (2)
  • RecommendedAction_RESTART_BM (48-48)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/reconciler/reconciler.go (5)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • FaultRemediationClientInterface (29-35)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • NodeAnnotationManagerInterface (16-21)
fault-remediation/pkg/metrics/metrics.go (5)
  • EventHandlingDuration (62-68)
  • TotalEventsReceived (33-38)
  • TotalUnsupportedRemediationActions (53-59)
  • ProcessingErrors (46-52)
  • EventsProcessed (39-45)
fault-remediation/pkg/events/health_event.go (2)
  • HealthEventDoc (6-9)
  • HealthEventData (12-15)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventWithStatus (53-57)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/remediation/remediation.go (1)
  • NewRemediationClient (56-114)
commons/pkg/statemanager/statemanager.go (2)
  • NewStateManager (206-210)
  • StateManager (197-200)
store-client/pkg/datastore/config.go (1)
  • LoadDatastoreConfig (27-44)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (7)
fault-remediation/pkg/reconciler/reconciler.go (2)
  • FaultRemediationReconciler (61-69)
  • ReconcilerConfig (47-56)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)
  • FaultRemediationClientInterface (29-35)
commons/pkg/statemanager/statemanager.go (3)
  • NewStateManager (206-210)
  • StateManager (197-200)
  • RemediatingLabelValue (171-171)
fault-remediation/pkg/events/health_event.go (1)
  • HealthEventDoc (6-9)
fault-remediation/pkg/annotation/annotation_interface.go (1)
  • AnnotationKey (12-12)
fault-remediation/pkg/metrics/metrics.go (5)
  • TotalEventsReceived (33-38)
  • EventHandlingDuration (62-68)
  • EventsProcessed (39-45)
  • TotalUnsupportedRemediationActions (53-59)
  • ProcessingErrors (46-52)
fault-remediation/pkg/crstatus/checker.go (1)
fault-remediation/pkg/config/config.go (1)
  • MaintenanceResource (27-44)
πŸͺ› YAMLlint (1.37.1)
fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ”‡ Additional comments (32)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

91-111: Verify probes/ports match the controller-runtime bind addresses. The chart now always probes /healthz and /readyz on the health port; please confirm the binary serves these endpoints on ((.Values.global).healthPort)|default 9440 in all deployments (and that metrics remain on the metrics port).

fault-remediation/pkg/annotation/annotation_test.go (1)

3-13: Tests appropriately use fake client for unit testing a utility manager. The guideline to prefer envtest applies to testing Kubernetes controllers (reconcilers); NodeAnnotationManager is a client wrapper utility. The actual controller behavior is tested with envtest in reconciler_e2e_test.go. These unit tests with fake client are appropriate and follow standard Go testing patterns.

Minor improvement: Consider naming tests following TestFunctionName_Scenario_ExpectedBehavior format (e.g., TestGetRemediationState_NodeNotFound_ReturnsError instead of TestGetRemediationState).

fault-remediation/pkg/metrics/metrics.go (1)

33-93: LGTM! Metrics exported for cross-package usage.

The metric variables have been appropriately exported (capitalized) to support the new controller-runtime-based architecture. The metric definitions, labels, and help text remain unchanged, ensuring backward compatibility.

fault-remediation/pkg/initializer/init.go (2)

52-56: LGTM! Improved dependency injection pattern.

The updated signature accepts a controller-runtime client, enabling better integration with the controller-runtime manager and improving testability through dependency injection.


85-102: LGTM! Clear initialization flow with proper error handling.

The initialization sequence is logical:

  1. Remediation client created first (with the controller-runtime client)
  2. Kubernetes client created from the provided config
  3. StateManager wired with the Kubernetes client

All error paths are properly wrapped with context.

fault-remediation/pkg/remediation/remediation_test.go (3)

29-97: LGTM! Comprehensive test coverage for client initialization.

The table-driven test covers key scenarios: non-existent templates, existing templates, and dry-run mode. The test validates both error paths and successful initialization, including proper dry-run mode propagation.


99-193: LGTM! Thorough validation of template configuration edge cases.

These end-to-end tests effectively validate error handling for missing and empty template file configurations, ensuring robust initialization failure paths.


195-344: LGTM! Well-structured test for CR creation flows.

The test comprehensively covers:

  • Error handling when nodes don't exist
  • Successful CR creation
  • Dry-run behavior validation
  • GVK verification
fault-remediation/pkg/remediation/fault_remediation_client_interface.go (1)

37-50: LGTM! Well-documented template data structure.

The TemplateData struct is clearly documented and logically groups node/event data with CRD routing metadata for template rendering.

fault-remediation/pkg/crstatus/checker.go (3)

28-44: LGTM! Clean migration to controller-runtime client.

The struct and constructor have been appropriately refactored to use the controller-runtime client, simplifying the implementation by removing the need for dynamic client and REST mapper.


46-76: LGTM! Well-documented and robust status checking logic.

The refactored ShouldSkipCRCreation method:

  • Uses controller-runtime's client.Get with ObjectKey for cleaner resource retrieval
  • Includes proper GVK context in error logging
  • Handles dry-run mode appropriately
  • Maintains the same logical behavior with improved clarity

78-113: LGTM! Condition checking logic preserved.

The condition evaluation logic remains intact and correct, with only the parameter name updated for clarity (cr β†’ obj).

fault-remediation/pkg/remediation/remediation.go (6)

1-31: Package structure and imports look good.

The package is well-organized with appropriate imports for Kubernetes API interactions, controller-runtime, and internal dependencies. The use of sigs.k8s.io/yaml for YAML handling is appropriate for Kubernetes resources.


43-54: FaultRemediationClient struct is well-designed.

The struct properly encapsulates all dependencies needed for remediation operations. The use of interfaces (annotation.NodeAnnotationManagerInterface) enables testability.


56-114: Constructor validates configuration appropriately.

The NewRemediationClient function performs proper validation of template configuration and preloads templates at startup, which is good for fail-fast behavior. The dry-run mode handling is correct.


502-545: checkLogCollectorFailed correctly handles nil CompletionTime.

Good defensive coding at lines 530-534 to handle the case where CompletionTime might be nil for failed jobs.


547-588: Timeout parsing silently falls back to default on error.

The warning log at line 558 is good, but consider whether a misconfigured timeout should be a hard error in production.


116-139: Function loadAndParseTemplate is non-exported and does not require godoc per guidelines.

Regarding the path traversal concern: the function is safe because templateMountPath originates from Kubernetes ConfigMap mounts (trusted) and fileName from the TOML configuration file. Both are validated during initializationβ€”config validation at line 112–131 in config.go already checks that the constructed template path exists via os.Stat() before templates are loaded in NewRemediationClient().

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

181-267: TestMain setup is comprehensive and uses envtest correctly.

Good use of envtest for testing Kubernetes controllers as per coding guidelines. The manager setup with disabled metrics binding (BindAddress: "0") is appropriate for tests.


301-329: createTestRemediationClient helper is well-structured.

The test helper properly configures the remediation client with test-appropriate settings.


731-939: TestFullReconcilerWithMockedMongoDB_E2E is comprehensive.

This test covers the complete flow including:

  • CR creation and deduplication
  • Event processing and metrics tracking
  • Annotation management
  • Unquarantine event handling

The use of assert.Eventually for async operations is appropriate.

fault-remediation/pkg/reconciler/reconciler_test.go (5)

41-76: MockK8sClient properly implements updated interface.

The mock correctly implements the FaultRemediationClientInterface with the new signatures including ctrl.Result returns.


112-131: MockNodeAnnotationManager updated correctly.

The mock now returns (*annotation.RemediationStateAnnotation, *corev1.Node, error) matching the interface. The second return value (*corev1.Node) is consistently returned as nil which is appropriate for these unit tests.


200-244: TestNewReconciler uses table-driven tests appropriately.

Good use of table-driven tests as per coding guidelines. The test validates both dry-run and non-dry-run scenarios.


572-609: TestRunLogCollectorOnNoneActionWhenEnabled tests NONE action handling.

The test correctly verifies that log collector is invoked for NONE actions when enabled. The manual simulation of the Start loop behavior at lines 604-606 is appropriate for unit testing.


611-676: TestRunLogCollectorJobErrorScenarios covers requeue scenarios.

The test at lines 637-643 tests the requeue result case which aligns with the new ctrl.Result return type.

fault-remediation/pkg/reconciler/reconciler.go (6)

37-45: Imports properly updated for new package structure.

The imports correctly reference the new annotation, events, metrics, and remediation packages.


170-193: runLogCollector properly returns ctrl.Result for requeue support.

The updated signature (ctrl.Result, error) allows the caller to handle requeue scenarios from the log collector job.


195-245: performRemediation error handling improved with errors.Join.

The use of errors.Join at line 237 properly aggregates multiple errors. However, there's an inconsistency: when createMaintenanceResourceError is non-nil but label update succeeds, the function falls through to line 241 and returns the error, which is correct.


461-513: parseHealthEvent properly wraps errors and handles resume token.

The error handling correctly:

  • Categorizes errors with appropriate labels for metrics
  • Marks events as processed even on parse failure to prevent infinite retries
  • Returns wrapped errors with context

540-569: SetupWithManager correctly integrates with controller-runtime.

The typed channel source and handler setup is correct for controller-runtime integration. The watcher is started before setting up the controller which ensures events are captured.


61-69: The Config field exposure is intentional and necessary for testing.

The field is actively accessed in test files (reconciler_test.go) for test setup and verificationβ€”accessing RemediationClient and EnableLogCollector. Exposing Config is justified and follows standard Go testing patterns where tests need direct access to configure and validate behavior. No change needed.

Likely an incorrect or invalid review comment.

if err != nil {
return nil, node, fmt.Errorf("failed to get node %s: %w", nodeName, err)
}
// TODO: maybe split this up so it's not returning both node and state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

TODO should reference an issue. Please replace with an issue-linked TODO (or remove if not actionable). Based on learnings / coding guidelines.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation.go at line 40, The TODO comment
"// TODO: maybe split this up so it's not returning both node and state" is not
actionable; replace it with either a linked issue TODO (e.g., "// TODO(#1234):
consider splitting ReturnNodeAndState into two functions to avoid returning both
node and state") or remove it if no issue exists; locate the comment near the
function that returns both node and state (the function that currently returns
node and state in pkg/annotation/annotation.go) and update the comment to
reference the issue number or change it to a concrete task with owner/ETA,
ensuring the TODO format matches the repository convention (issue number or
shortlink).

Comment on lines +42 to +60
annotationValue, exists := node.Annotations[AnnotationKey]
if !exists {
// No annotation means no active remediations
return &RemediationStateAnnotation{
EquivalenceGroups: make(map[string]EquivalenceGroupState),
}, node, nil
}

var state RemediationStateAnnotation
if err = json.Unmarshal([]byte(annotationValue), &state); err != nil {
slog.Error("Failed to unmarshal annotation", "node", nodeName, "error", err)
// Return empty state if unmarshal fails
return &RemediationStateAnnotation{
EquivalenceGroups: make(map[string]EquivalenceGroupState),
}, node, nil
}

return &state, node, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | πŸ”΄ Critical

Critical: possible nil-map panic when annotation JSON lacks equivalenceGroups. json.Unmarshal can produce state.EquivalenceGroups == nil, and UpdateRemediationState will panic on assignment.

Proposed fix
 	var state RemediationStateAnnotation
 	if err = json.Unmarshal([]byte(annotationValue), &state); err != nil {
 		slog.Error("Failed to unmarshal annotation", "node", nodeName, "error", err)
 		// Return empty state if unmarshal fails
 		return &RemediationStateAnnotation{
 			EquivalenceGroups: make(map[string]EquivalenceGroupState),
 		}, node, nil
 	}
+	if state.EquivalenceGroups == nil {
+		state.EquivalenceGroups = make(map[string]EquivalenceGroupState)
+	}

 	return &state, node, nil

Also applies to: 72-78

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation.go around lines 42 - 60, After
unmarshalling into RemediationStateAnnotation, ensure state.EquivalenceGroups is
non-nil to avoid panics: if json.Unmarshal succeeds but state.EquivalenceGroups
== nil, initialize it with make(map[string]EquivalenceGroupState). Apply the
same nil-check/initialization in the other unmarshalling spot noted (also around
lines 72-78). This prevents UpdateRemediationState or any assignment into
EquivalenceGroups from causing a nil-map panic.

Comment on lines +63 to +95
func (m *NodeAnnotationManager) UpdateRemediationState(ctx context.Context, nodeName string,
group string, crName string, actionName string) error {
// Get current state
state, node, err := m.GetRemediationState(ctx, nodeName)
if err != nil {
slog.Warn("Failed to get current remediation state", "node", nodeName, "error", err)
return fmt.Errorf("failed to get current remediation state: %w", err)
}

// Update state for the group
state.EquivalenceGroups[group] = EquivalenceGroupState{
MaintenanceCR: crName,
CreatedAt: time.Now().UTC(),
ActionName: actionName,
}

// Marshal to JSON
stateJSON, err := json.Marshal(state)
if err != nil {
return fmt.Errorf("failed to marshal remediation state: %w", err)
}

patchBase := node.DeepCopy()
if node.Annotations == nil {
node.Annotations = map[string]string{}
}

node.Annotations[AnnotationKey] = string(stateJSON)

if err = m.client.Patch(ctx, node, client.MergeFrom(patchBase)); err != nil {
return fmt.Errorf("failed to patch node annotation: %w", err)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd fault-remediation && git ls-files | grep -E "annotation\.go|annotation" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 162


🏁 Script executed:

find . -name "annotation.go" -type f

Repository: NVIDIA/NVSentinel

Length of output: 109


🏁 Script executed:

wc -l fault-remediation/pkg/annotation/annotation.go

Repository: NVIDIA/NVSentinel

Length of output: 111


🏁 Script executed:

head -200 fault-remediation/pkg/annotation/annotation.go

Repository: NVIDIA/NVSentinel

Length of output: 4812


🏁 Script executed:

grep -n "retry\|RetryOnConflict" fault-remediation/pkg/annotation/annotation.go

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

grep -rn "RetryOnConflict" fault-remediation --include="*.go" | head -10

Repository: NVIDIA/NVSentinel

Length of output: 43


🏁 Script executed:

grep -n "AnnotationKey" fault-remediation/pkg/annotation/annotation.go

Repository: NVIDIA/NVSentinel

Length of output: 281


🏁 Script executed:

cat -n fault-remediation/pkg/annotation/annotation.go | sed -n '1,30p'

Repository: NVIDIA/NVSentinel

Length of output: 983


🏁 Script executed:

find . -name "*.go" -path "*/annotation/*" -type f

Repository: NVIDIA/NVSentinel

Length of output: 297


🏁 Script executed:

cat fault-remediation/pkg/annotation/annotation_interface.go

Repository: NVIDIA/NVSentinel

Length of output: 1291


🏁 Script executed:

grep -rn "UpdateRemediationState\|RemoveGroupFromState" fault-remediation --include="*.go" | grep -v "annotation.go"

Repository: NVIDIA/NVSentinel

Length of output: 1481


🏁 Script executed:

grep -rn "client.Patch" fault-remediation --include="*.go" -A 1 -B 1 | head -30

Repository: NVIDIA/NVSentinel

Length of output: 954


🏁 Script executed:

grep -n "retry\|Retry" fault-remediation/pkg -r --include="*.go" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 2214


🏁 Script executed:

cat -n fault-remediation/pkg/annotation/annotation_test.go | sed -n '122,160p'

Repository: NVIDIA/NVSentinel

Length of output: 1497


Concurrent update of equivalence groups can cause lost writes. Two callers updating different groups in UpdateRemediationState (or RemoveGroupFromState) will overwrite each other's changes due to the read-modify-write pattern on a single annotation value. The entire state is serialized to JSON and patched as an atomic unit; if two processes fetch, modify, and patch concurrently, the last patch wins regardless of which groups were updated.

Wrap both functions' Get β†’ mutate β†’ Patch sequences in retry.RetryOnConflict to detect conflicts on patch failure and re-read the current state before retrying. Return errors unwrapped within the retry block to preserve the retry mechanism.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/annotation/annotation.go around lines 63 - 95,
UpdateRemediationState (and likewise RemoveGroupFromState) currently performs a
read-modify-patch that can lose concurrent updates; wrap the Get→mutate→Patch
sequence in a retry.RetryOnConflict loop so on a conflict you re-read the
node/state and retry the mutation, using the
node.DeepCopy()β†’client.MergeFrom(patchBase) approach inside the retry body;
ensure you return errors from the retry body directly (not wrapped) so
RetryOnConflict can detect conflicts and retry, and only wrap the final returned
error if needed after the retry loop completes.

)

var (
//TODO: evaluate and remove redundant metrics with ctrl-runtime defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

TODO comment should reference an issue.

As per coding guidelines, TODO comments should reference issues in Go code. Please add an issue reference to this TODO.

Based on learnings, TODO comments should reference issues in Go code.

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/metrics/metrics.go at line 30, The TODO comment in
metrics.go ("//TODO: evaluate and remove redundant metrics with ctrl-runtime
defaults") must reference an issue; update that comment to include the tracker
reference (e.g., issue number or URL) and a short context so it follows the Go
comment guidelineβ€”replace the plain TODO with a TODO that includes the issue id
(or link) and the same brief action so reviewers can find the discussion.

Comment on lines +328 to 330
if result, err := r.runLogCollector(ctx, healthEvent, healthEventWithStatus.ID); err != nil || !result.IsZero() {
return result, fmt.Errorf("error running log collector: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Error wrapping in runLogCollector result handling could mask nil errors.

At line 329, if err is nil but result.IsZero() returns false, the error wrapping fmt.Errorf("error running log collector: %w", err) will produce a confusing message with <nil> in it.

πŸ› Fix error handling for non-zero result without error
-	if result, err := r.runLogCollector(ctx, healthEvent, healthEventWithStatus.ID); err != nil || !result.IsZero() {
-		return result, fmt.Errorf("error running log collector: %w", err)
+	if result, err := r.runLogCollector(ctx, healthEvent, healthEventWithStatus.ID); err != nil || !result.IsZero() {
+		if err != nil {
+			return result, fmt.Errorf("error running log collector: %w", err)
+		}
+		return result, nil
 	}
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/reconciler/reconciler.go around lines 328 - 330, The
current handling of r.runLogCollector(ctx, healthEvent,
healthEventWithStatus.ID) wraps err regardless of whether err is nil, producing
messages like "<nil>" when result is non-zero but err==nil; change the logic to
first check if err != nil and return result, fmt.Errorf("error running log
collector: %w", err) in that branch, and otherwise if !result.IsZero() return
result, nil so a non-zero result is propagated without a spurious wrapped nil
error. Ensure you update the code block around the call to r.runLogCollector and
use the existing result and err variables.

Comment on lines 29 to 35
type FaultRemediationClientInterface interface {
CreateMaintenanceResource(ctx context.Context, healthEventData *HealthEventData) (bool, string)
RunLogCollectorJob(ctx context.Context, nodeName string) error
GetAnnotationManager() NodeAnnotationManagerInterface
GetStatusChecker() *crstatus.CRStatusChecker
CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
GetAnnotationManager() annotation.NodeAnnotationManagerInterface
GetStatusChecker() crstatus.CRStatusCheckerInterface
GetConfig() *config.TomlConfig
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Add godoc for the exported interface.

The FaultRemediationClientInterface and its methods lack documentation. As per coding guidelines, function comments are required for all exported Go functions and interfaces.

Based on learnings, function comments required for all exported Go functions.

πŸ“ Suggested documentation
+// FaultRemediationClientInterface defines the contract for fault remediation operations,
+// including CR creation, log collection, and state management.
 type FaultRemediationClientInterface interface {
+	// CreateMaintenanceResource creates a maintenance CR for the given health event and returns the CR name.
 	CreateMaintenanceResource(ctx context.Context, healthEventData *events.HealthEventData) (string, error)
+	// RunLogCollectorJob orchestrates log collection for a node and event, returning a reconcile result.
 	RunLogCollectorJob(ctx context.Context, nodeName string, eventId string) (ctrl.Result, error)
+	// GetAnnotationManager returns the annotation manager for node remediation state tracking.
 	GetAnnotationManager() annotation.NodeAnnotationManagerInterface
+	// GetStatusChecker returns the CR status checker for determining CR creation eligibility.
 	GetStatusChecker() crstatus.CRStatusCheckerInterface
+	// GetConfig returns the remediation configuration.
 	GetConfig() *config.TomlConfig
 }
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/fault_remediation_client_interface.go
around lines 29 - 35, Add GoDoc comments: write a descriptive comment above the
exported type FaultRemediationClientInterface explaining its purpose and
intended use, then add short GoDoc comments for each exported method
(CreateMaintenanceResource, RunLogCollectorJob, GetAnnotationManager,
GetStatusChecker, GetConfig) describing what they do, their key parameters and
return values; ensure each comment begins with the exact method/type name to
satisfy golint/go vet requirements.

Comment on lines +465 to +500
func (c *FaultRemediationClient) checkLogCollectorComplete(
ctx context.Context,
nodeName string,
job batchv1.Job,
conditions []metav1.Condition,
) (bool, error) {
completeCondition := meta.FindStatusCondition(conditions, string(batchv1.JobComplete))
//nolint:nestif // todo
if completeCondition != nil && completeCondition.Status == metav1.ConditionTrue {
slog.Info("Log collector job completed successfully", "job", job.Name)
// Use job's actual duration instead of custom tracking
// reconciliation can be called multiple times so use annotation to make sure we're not duplicate recording metrics
if job.Annotations == nil || job.Annotations[jobMetricsAlreadyCountedAnnotation] != trueStringVal {
updateJob := job.DeepCopy()
if updateJob.Annotations == nil {
updateJob.Annotations = map[string]string{}
}

updateJob.Annotations[jobMetricsAlreadyCountedAnnotation] = trueStringVal

err := c.client.Update(ctx, updateJob)
if err != nil {
return false, err
}

duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()

metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc()
metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration)
}

return true, nil
}

return false, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential nil pointer dereference in checkLogCollectorComplete.

At line 490, job.Status.StartTime could be nil if the job hasn't started yet, which would cause a panic when calling .Time.

πŸ› Add nil check for StartTime
 			updateJob.Annotations[jobMetricsAlreadyCountedAnnotation] = trueStringVal

 			err := c.client.Update(ctx, updateJob)
 			if err != nil {
 				return false, err
 			}

-			duration := job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
+			var duration float64
+			if job.Status.StartTime != nil && job.Status.CompletionTime != nil {
+				duration = job.Status.CompletionTime.Sub(job.Status.StartTime.Time).Seconds()
+			}

 			metrics.LogCollectorJobs.WithLabelValues(nodeName, "success").Inc()
 			metrics.LogCollectorJobDuration.WithLabelValues(nodeName, "success").Observe(duration)
πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/remediation.go around lines 465 - 500,
checkLogCollectorComplete can panic when computing duration because
job.Status.StartTime (and possibly job.Status.CompletionTime) can be nil; before
calling job.Status.StartTime.Time or job.Status.CompletionTime.Sub(...) add a
nil-check: only compute and call
metrics.LogCollectorJobDuration.WithLabelValues(...).Observe(duration) when both
job.Status.StartTime != nil and job.Status.CompletionTime != nil, otherwise skip
observing duration (you may still increment metrics.LogCollectorJobs count if
desired). Update the code in checkLogCollectorComplete to guard the duration
calculation and observation (referencing job.Status.StartTime,
job.Status.CompletionTime, metrics.LogCollectorJobs,
metrics.LogCollectorJobDuration, and jobMetricsAlreadyCountedAnnotation) to
avoid the nil pointer dereference.

Comment on lines +15 to +20
apiVersion: {{.ApiGroup}}/{{.Version}}
kind: RebootNode
metadata:
name: maintenance-{{.NodeName}}-{{.HealthEventID}}
spec:
nodeName: {{.NodeName}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "*rebootnode*" -o -name "*reboot*" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 575


🏁 Script executed:

git ls-files | grep -i rebootnode

Repository: NVIDIA/NVSentinel

Length of output: 559


🏁 Script executed:

git ls-files | grep -i template | grep -E "\.yaml|\.yml" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 1532


🏁 Script executed:

cat -n ./fault-remediation/pkg/remediation/templates/rebootnode-template.yaml

Repository: NVIDIA/NVSentinel

Length of output: 962


🏁 Script executed:

cat -n ./janitor/api/v1alpha1/rebootnode_types.go

Repository: NVIDIA/NVSentinel

Length of output: 9092


🏁 Script executed:

cat -n ./distros/kubernetes/nvsentinel/charts/janitor/crds/janitor.dgxc.nvidia.com_rebootnodes.yaml | head -50

Repository: NVIDIA/NVSentinel

Length of output: 2019


🏁 Script executed:

cat -n ./fault-remediation/pkg/reconciler/templates/rebootnode-template.yaml

Repository: NVIDIA/NVSentinel

Length of output: 962


🏁 Script executed:

find . -type f -name "*.go" -path "*/remediation/*" -o -path "*/reconciler/*" | xargs grep -l "template\|Template" | head -10

Repository: NVIDIA/NVSentinel

Length of output: 681


🏁 Script executed:

rg -n "ExecuteTemplate|text/template|template\.Parse" --type go | grep -i remediation | head -20

Repository: NVIDIA/NVSentinel

Length of output: 203


🏁 Script executed:

cat -n ./fault-remediation/pkg/remediation/remediation.go

Repository: NVIDIA/NVSentinel

Length of output: 22499


🏁 Script executed:

cat -n ./fault-remediation/pkg/remediation/remediation_test.go

Repository: NVIDIA/NVSentinel

Length of output: 19207


Quote template placeholders to satisfy YAML static analysis.

The template file itself violates YAML syntax when unquotedβ€”static linters flag this. Quoting the placeholders makes the template file itself parseable while keeping the rendered manifest identical.

Suggested fix
 apiVersion: {{.ApiGroup}}/{{.Version}}
 kind: RebootNode
 metadata:
-  name: maintenance-{{.NodeName}}-{{.HealthEventID}}
+  name: "maintenance-{{.NodeName}}-{{.HealthEventID}}"
 spec:
-  nodeName: {{.NodeName}}
+  nodeName: "{{.NodeName}}"

Note: RebootNode is cluster-scoped, so metadata.namespace is not applicable here.

πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
apiVersion: {{.ApiGroup}}/{{.Version}}
kind: RebootNode
metadata:
name: maintenance-{{.NodeName}}-{{.HealthEventID}}
spec:
nodeName: {{.NodeName}}
apiVersion: "{{.ApiGroup}}/{{.Version}}"
kind: RebootNode
metadata:
name: "maintenance-{{.NodeName}}-{{.HealthEventID}}"
spec:
nodeName: "{{.NodeName}}"
🧰 Tools
πŸͺ› YAMLlint (1.37.1)

[error] 15-15: syntax error: expected , but found ''

(syntax)

πŸ€– Prompt for AI Agents
In @fault-remediation/pkg/remediation/templates/rebootnode-template.yaml around
lines 15 - 20, The YAML template contains unquoted Go template placeholders
(e.g., {{.ApiGroup}}, {{.Version}}, metadata.name:
maintenance-{{.NodeName}}-{{.HealthEventID}}, and spec.nodeName: {{.NodeName}})
which make the file fail static YAML linters; fix by wrapping each placeholder
value in quotes so the template is valid YAML while rendered output remains the
same (quote the ApiGroup/Version values and the metadata.name and spec.nodeName
placeholders), and do not add metadata.namespace since RebootNode is
cluster-scoped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants