Skip to content

Conversation

@tanishagoyal2
Copy link
Contributor

@tanishagoyal2 tanishagoyal2 commented Dec 22, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Added a configurable processing strategy (EXECUTE_REMEDIATION default, STORE_ONLY option) exposed via runtime flag and deployment values and included in health event payloads.
  • Behavior Changes

    • STORE_ONLY events are observability-only: persisted/exported but skipped for node-condition updates and automated remediation; EXECUTE_REMEDIATION continues to trigger state changes.
  • Tests

    • Expanded tests and helpers and added end-to-end coverage for both strategies and rollout scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

πŸ“ Walkthrough

Walkthrough

Adds a ProcessingStrategy enum and processingStrategy field to HealthEvent; threads the strategy through monitors, handlers, exporter, store pipelines, Kubernetes connector, Helm values, and tests to distinguish STORE_ONLY vs EXECUTE_REMEDIATION processing and skip remediation for STORE_ONLY events.

Changes

Cohort / File(s) Summary
Protobuf / Generated
data-models/protobufs/health_event.proto, health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py, .../health_event_pb2.pyi
Add ProcessingStrategy enum (UNSPECIFIED, EXECUTE_REMEDIATION, STORE_ONLY) and processingStrategy field to HealthEvent (field 16); update generated descriptors/typings.
Helm / K8s Args
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml, .../templates/_helpers.tpl
New Helm value processingStrategy (default EXECUTE_REMEDIATION) and inject --processing-strategy arg into the syslog DaemonSet container.
Syslog Monitor Core
health-monitors/syslog-health-monitor/main.go, health-monitors/syslog-health-monitor/pkg/syslog-monitor/..., .../types.go
Add CLI flag parsing/validation; thread pb.ProcessingStrategy into NewSyslogMonitor/NewSyslogMonitorWithFactory; store on SyslogMonitor; include strategy in produced pb.HealthEvent.
Per-check Handlers
health-monitors/syslog-health-monitor/pkg/xid/*, .../sxid/*, .../gpufallen/*
Add processingStrategy field to handler structs, update constructors to accept strategy, set instance field, and populate HealthEvent.ProcessingStrategy; update tests.
Event Exporter
event-exporter/pkg/transformer/cloudevents.go, .../cloudevents_test.go
Add processingStrategy string into CloudEvent payload; tests assert expected value.
Store-client Pipelines & Interface
store-client/pkg/client/pipeline_builder.go, .../mongodb_pipeline_builder.go, .../postgresql_pipeline_builder.go, .../pipeline_builder_test.go
Add BuildProcessableHealthEventInsertsPipeline() to interface and builders; pipelines filter inserts where processingStrategy == EXECUTE_REMEDIATION.
Platform Connector (Kubernetes)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go, .../k8s_platform_connector_test.go
Filter out STORE_ONLY events via filterProcessableEvents; create createK8sEvent; process only processable events for node conditions and Kubernetes events; add tests for strategy behaviors.
Fault Quarantine / Initializer
fault-quarantine/pkg/initializer/init.go, fault-quarantine/pkg/evaluator/rule_evaluator_test.go
Switch pipeline usage to processable-events pipeline; tests updated to include processingStrategy in expected payloads.
Tests & Helpers
tests/helpers/*, tests/event_exporter_test.go, tests/syslog_health_monitor_test.go, tests/fault_quarantine_test.go, tests/data/*
Extend helpers to set/expect processingStrategy (HealthEventTemplate, ValidateCloudEvent); add DaemonSet rollout/arg helpers and syslog setup/teardown helpers; add/update tests and test data to cover STORE_ONLY vs EXECUTE_REMEDIATION.
Misc / CSP Monitor
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go, .../trigger_test.go
Set ProcessingStrategy_EXECUTE_REMEDIATION on mapped maintenance HealthEvents; tests updated.
SQL Field Mapping
store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
Add mapping "processingstrategy" β†’ "processingStrategy" for JSON field translations.

Sequence Diagram

sequenceDiagram
    actor Operator
    participant Helm as Helm/DaemonSet
    participant Syslog as Syslog Monitor
    participant Handler as Handler (XID/SXID/GPU)
    participant Exporter as Event Exporter
    participant Store as Store (change stream)
    participant K8s as K8s Connector
    participant Kube as Kubernetes API

    Operator->>Helm: set processingStrategy value
    Helm->>Syslog: start with CLI arg (strategy)
    Syslog->>Handler: init handler with strategy

    rect rgba(180,220,200,0.12)
      Handler->>Handler: detect issue -> build pb.HealthEvent (includes processingStrategy)
      Handler-->>Syslog: emit HealthEvent
    end

    rect rgba(200,200,255,0.10)
      Syslog->>Exporter: send HealthEvent
      Exporter->>Store: persist/stream HealthEvent (contains processingStrategy)
    end

    rect rgba(255,220,200,0.10)
      Store->>Store: change stream pipeline filters for EXECUTE_REMEDIATION
      alt EXECUTE_REMEDIATION
        Store->>K8s: forward processable events
        K8s->>Kube: create/update node conditions & events
      else STORE_ONLY
        Store->>Store: filtered out β€” no remediation actions
      end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"I hopped through fields and flags today,
A tiny enum to show the way.
STORE to keep, EXECUTE to mend,
Events now know which path to send.
β€” Rabbit πŸ‡"

πŸš₯ Pre-merge checks | βœ… 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.60% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title 'feat: add event handling strategy changes in syslog health monitor' is specific and clearly summarizes the main feature addition of processing strategy handling across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • πŸ“ Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

πŸ”§ golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (10)
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl (1)

103-104: Consider quoting the processing strategy value for consistency.

The processingStrategy value is passed unquoted, while similar string arguments like $kataLabel at line 94 use the quote function. For consistency and to follow YAML best practices, consider quoting this value.

πŸ”Ž Proposed fix
             - "--processing-strategy"
-            - {{ $root.Values.processingStrategy }}
+            - {{ $root.Values.processingStrategy | quote }}
tests/helpers/healthevent.go (2)

153-156: Add godoc comment for the exported method.

Per coding guidelines, exported Go functions should have function comments. Other With* methods in this file lack comments too, but new code should set a better example.

Suggested documentation
+// WithProcessingStrategy sets the processing strategy for the health event.
+// Use protos.ProcessingStrategy_EXECUTE_REMEDIATION (0) or protos.ProcessingStrategy_STORE_ONLY (1).
 func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate {
 	h.ProcessingStrategy = strategy
 	return h
 }

48-48: Consider using the protobuf type for consistency.

The protobuf-generated Go code uses int32 for ProcessingStrategy. While int works for JSON serialization in tests, using int32 would be more consistent with the generated types.

Suggested change
-	ProcessingStrategy  int                  `json:"processingStrategy,omitempty"`
+	ProcessingStrategy  int32                `json:"processingStrategy,omitempty"`

This would require updating the setter signature as well:

-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate {
+func (h *HealthEventTemplate) WithProcessingStrategy(strategy int32) *HealthEventTemplate {
data-models/protobufs/health_event.proto (1)

77-77: Add field-level comment for processingStrategy.

Per coding guidelines for protobuf files: "Include comprehensive comments for all fields in Protocol Buffer messages." The enum declaration has good comments, but the field itself should also have a brief comment explaining its purpose in the context of the HealthEvent message.

Suggested addition
   BehaviourOverrides drainOverrides = 15;
+  // Determines how downstream modules should handle this event (remediation vs observability-only).
   ProcessingStrategy processingStrategy = 16;
 }
health-monitors/syslog-health-monitor/main.go (1)

164-171: Consider enhancing the error message with valid options.

The validation correctly uses the generated pb.ProcessingStrategy_value map. Consider listing valid values in the error message to improve user experience.

Suggested improvement
 	value, ok := pb.ProcessingStrategy_value[*processingStrategyFlag]
 	if !ok {
-		return fmt.Errorf("unexpected processingStrategy value: %q", *processingStrategyFlag)
+		return fmt.Errorf("unexpected processingStrategy value: %q (valid values: EXECUTE_REMEDIATION, STORE_ONLY)", *processingStrategyFlag)
 	}
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)

239-242: Consider using the enum constant directly instead of accessing the handler field.

Line 241 accesses handler2.processingStrategy directly for the assertion. While this works, using the expected enum constant (pb.ProcessingStrategy_EXECUTE_REMEDIATION) would be more explicit and resilient to future encapsulation changes.

πŸ”Ž Suggested change
-		assert.Equal(t, handler2.processingStrategy, events.Events[0].ProcessingStrategy)
+		assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, events.Events[0].ProcessingStrategy)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (1)

191-195: Brittle entity index assertions may break if entity order changes.

The assertions use hardcoded indices (EntitiesImpacted[3], EntitiesImpacted[4]) to access GPU entities. If the entity ordering in ProcessLine changes, these tests will fail with confusing errors.

Consider iterating over entities by type, similar to the pattern used in gpufallen_handler_test.go (lines 124-136).

πŸ”Ž Suggested approach
-	// Verify GPU entity
-	assert.Equal(t, "GPU", event.EntitiesImpacted[3].EntityType)
-	assert.Equal(t, "1", event.EntitiesImpacted[3].EntityValue)
-	assert.Equal(t, "GPU_UUID", event.EntitiesImpacted[4].EntityType)
-	assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", event.EntitiesImpacted[4].EntityValue)
+	// Verify GPU entities by type rather than assuming order
+	var hasGPU, hasGPUUUID bool
+	for _, entity := range event.EntitiesImpacted {
+		switch entity.EntityType {
+		case "GPU":
+			hasGPU = true
+			assert.Equal(t, "1", entity.EntityValue)
+		case "GPU_UUID":
+			hasGPUUUID = true
+			assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", entity.EntityValue)
+		}
+	}
+	assert.True(t, hasGPU, "Should have GPU entity")
+	assert.True(t, hasGPUUUID, "Should have GPU_UUID entity")
tests/helpers/syslog-health-monitor.go (1)

15-15: Missing package-level godoc comment.

As per coding guidelines, package-level godoc is required for all Go packages.

Proposed fix
+// Package helpers provides test utility functions for NVSentinel integration tests.
 package helpers
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)

427-446: Consider adding a test case to validate ProcessingStrategy is stored in handler.

The existing test cases verify handler construction but don't assert that processingStrategy is correctly stored. Consider adding an assertion.

Proposed enhancement
 			} else {
 				assert.NoError(t, err)
 				require.NotNil(t, handler)
 				assert.Equal(t, tc.nodeName, handler.nodeName)
 				assert.Equal(t, tc.agentName, handler.defaultAgentName)
 				assert.Equal(t, tc.componentClass, handler.defaultComponentClass)
 				assert.Equal(t, tc.checkName, handler.checkName)
+				assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, handler.processingStrategy)
 				assert.NotNil(t, handler.pciToGPUUUID)
 				assert.NotNil(t, handler.parser)
 				assert.NotNil(t, handler.metadataReader)
 			}
tests/helpers/kube.go (1)

2346-2390: Consider clarifying the daemonsetName parameter usage.

The daemonsetName parameter is only used in the error message but not for any actual filtering logic (which relies on podNamePattern). This could be confusing to callers. Consider either using it for filtering or documenting this clearly.

πŸ”Ž Suggested documentation improvement
+// GetDaemonSetPodOnWorkerNode returns a running, ready pod matching the given name pattern on a real worker node.
+// Note: daemonsetName is used for error messages; filtering is done via podNamePattern.
 func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string, podNamePattern string) (*v1.Pod, error) {
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 61f47cb and a216b57.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (34)
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
  • distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-quarantine/pkg/initializer/init.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • health-monitors/syslog-health-monitor/main.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/types.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/pipeline_builder_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • tests/event_exporter_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
  • tests/helpers/syslog-health-monitor.go
  • tests/syslog_health_monitor_test.go
🧰 Additional context used
πŸ““ Path-based instructions (5)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/event_exporter.go
  • store-client/pkg/client/pipeline_builder_test.go
  • tests/helpers/healthevent.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
  • event-exporter/pkg/transformer/cloudevents.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • health-monitors/syslog-health-monitor/pkg/sxid/types.go
  • fault-quarantine/pkg/initializer/init.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • store-client/pkg/client/pipeline_builder.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • health-monitors/syslog-health-monitor/main.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
  • tests/helpers/syslog-health-monitor.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
  • tests/event_exporter_test.go
  • tests/helpers/kube.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • tests/syslog_health_monitor_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • store-client/pkg/client/pipeline_builder_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/event_exporter_test.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • tests/syslog_health_monitor_test.go
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🧠 Learnings (8)
πŸ“š Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/sxid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/main.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/sxid/types.go
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • fault-quarantine/pkg/initializer/init.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • tests/event_exporter_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/event_exporter_test.go
  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests

Applied to files:

  • tests/event_exporter_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/syslog_health_monitor_test.go
🧬 Code graph analysis (17)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
  • PipelineBuilder (26-47)
store-client/pkg/client/mongodb_pipeline_builder.go (1)
  • NewMongoDBPipelineBuilder (29-31)
store-client/pkg/client/postgresql_pipeline_builder.go (1)
  • NewPostgreSQLPipelineBuilder (29-31)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-39)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
health-monitors/syslog-health-monitor/main.go (1)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_value (56-59)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
data-models/pkg/protos/health_event.pb.go (17)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • Entity (208-214)
  • Entity (227-227)
  • Entity (242-244)
  • RecommendedAction (89-89)
  • RecommendedAction (139-141)
  • RecommendedAction (143-145)
  • RecommendedAction (152-154)
  • RecommendedAction_CONTACT_SUPPORT (94-94)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
  • NewK8sConnector (47-58)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)
data-models/pkg/protos/health_event.pb.go (6)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
  • NewXIDHandler (33-58)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-39)
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go (1)
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (1)
  • GPUFallenHandler (47-57)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (2)
data-models/pkg/protos/health_event.pb.go (11)
  • HealthEvents (156-162)
  • HealthEvents (175-175)
  • HealthEvents (190-192)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
  • K8sConnector (37-45)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
  • ValidateCloudEvent (221-247)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (2)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (1)
  • NewSXIDHandler (30-42)
data-models/pkg/protos/health_event.pb.go (3)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • RecommendedAction_CONTACT_SUPPORT (94-94)
tests/syslog_health_monitor_test.go (3)
tests/helpers/syslog-health-monitor.go (3)
  • SetUpSyslogHealthMonitor (34-74)
  • StubJournalHTTPPort (28-28)
  • TearDownSyslogHealthMonitor (77-116)
tests/helpers/kube.go (2)
  • InjectSyslogMessages (1796-1819)
  • PortForwardPod (1590-1634)
tests/helpers/fault_quarantine.go (2)
  • AssertQuarantineState (315-382)
  • QuarantineAssertion (56-60)
πŸ”‡ Additional comments (65)
fault-quarantine/pkg/initializer/init.go (1)

66-66: Verify the pipeline filtering aligns with fault-quarantine requirements.

The switch from BuildAllHealthEventInsertsPipeline() to BuildProcessableHealthEventInsertsPipeline() means the fault-quarantine module will now only observe health events marked with the EXECUTE_REMEDIATION processing strategy. Events marked STORE_ONLY will be excluded from triggering quarantine actions.

Run the following script to confirm the new pipeline builder exists and verify its filtering logic:

#!/bin/bash
# Verify BuildProcessableHealthEventInsertsPipeline implementation and filtering behavior

# Find the pipeline builder implementation
echo "=== Locating BuildProcessableHealthEventInsertsPipeline implementation ==="
ast-grep --pattern $'func ($$$) BuildProcessableHealthEventInsertsPipeline() $$$'

# Find usages to understand the filtering behavior
echo -e "\n=== Finding references to ProcessingStrategy enum ==="
rg -n -C3 --type=go 'ProcessingStrategy|EXECUTE_REMEDIATION|STORE_ONLY' store-client/

# Find the builder interface/struct definition
echo -e "\n=== Locating PipelineBuilder interface ==="
rg -n -C3 --type=go 'type.*PipelineBuilder|BuildProcessableHealthEventInsertsPipeline'
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: LGTM! Default enum value correctly validated.

The test correctly expects processingStrategy to be 0 (EXECUTE_REMEDIATION), which is the protobuf default for the new enum field.

event-exporter/pkg/transformer/cloudevents.go (1)

66-66: LGTM! Consistent enum-to-string conversion.

The implementation correctly uses ProcessingStrategy.String() to include the strategy in CloudEvent data, matching the pattern used for recommendedAction at line 61.

store-client/pkg/client/pipeline_builder_test.go (1)

69-86: LGTM! Well-structured table-driven test.

The test correctly validates the new BuildProcessableHealthEventInsertsPipeline() method for both MongoDB and PostgreSQL builders, following the established patterns in this test file.

health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)

53-53: LGTM! Safe addition of unexported field.

The processingStrategy field is correctly added as an unexported field to store the health event processing strategy for this monitor instance.

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

20-20: LGTM! Consistent import and field addition.

The protobuf import and processingStrategy field are correctly added to enable strategy propagation through the XID handler to health events.

Also applies to: 34-34

health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (1)

23-23: LGTM! Consistent with other handler implementations.

The protobuf import and processingStrategy field addition follow the same pattern used in the XID and SXID handlers.

Also applies to: 52-52

distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml (1)

51-56: LGTM! Well-documented configuration option.

The processingStrategy configuration is clearly documented with valid values, default behavior, and behavioral differences. The default value EXECUTE_REMEDIATION maintains backward compatibility with existing deployments.

As per coding guidelines, the inline documentation follows Helm best practices for values.yaml files.

data-models/protobufs/health_event.proto (1)

32-38: Good enum design with clear documentation.

The enum is well-designed with EXECUTE_REMEDIATION=0 as the default (zero value), ensuring backward compatibility for existing events that don't set this field. The comments clearly explain the behavioral difference between the two strategies.

store-client/pkg/client/mongodb_pipeline_builder.go (1)

87-100: Verify behavior when processingStrategy field is missing or null.

The filter matches documents where fullDocument.healthevent.processingstrategy equals 0 (EXECUTE_REMEDIATION). In MongoDB, documents missing this field will have null for that path, which does not equal 0. This could cause backward compatibility issues with existing health events that were created before this field was added.

If existing events should be treated as EXECUTE_REMEDIATION (the intended default), consider updating the filter to handle missing fields:

Potential fix if missing fields need to match
 func (b *MongoDBPipelineBuilder) BuildProcessableHealthEventInsertsPipeline() datastore.Pipeline {
 	return datastore.ToPipeline(
 		datastore.D(
 			datastore.E("$match", datastore.D(
 				datastore.E("operationType", datastore.D(
 					datastore.E("$in", datastore.A("insert")),
 				)),
-				datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION)),
+				datastore.E("$or", datastore.A(
+					datastore.D(datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION))),
+					datastore.D(datastore.E("fullDocument.healthevent.processingstrategy", datastore.D(datastore.E("$exists", false)))),
+				)),
 			)),
 		),
 	)
 }

Please verify whether existing health events in the database have this field set, or if all events going forward will include it.

store-client/pkg/client/pipeline_builder.go (1)

35-38: LGTM!

The new interface method is well-documented with clear explanation of:

  • What it does: watches for "processable" health event inserts with processingStrategy=EXECUTE_REMEDIATION
  • Who uses it: fault-quarantine
  • Why: to ignore observability-only events (STORE_ONLY)

This follows the established pattern of the other pipeline builder methods.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-6: This file is auto-generated by the protobuf compiler and is correctly checked into the repository. The "NO CHECKED-IN PROTOBUF GENCODE" comment is a standard protobuf compiler-generated header, not a project policy directive. For Python (non-C++), checking in generated protobuf files is a standard practice supported by the protobuf project. The repository properly manages this via the protos-generate Makefile target for regeneration and the protos-lint target for CI validation to ensure generated files stay up to date with their source .proto files.

Likely an incorrect or invalid review comment.

tests/helpers/event_exporter.go (1)

221-246: LGTM!

The new expectedProcessingStrategy parameter integrates cleanly with the existing validation pattern. The assertion at line 245 correctly validates the new field in the CloudEvent payload.

health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)

20-36: LGTM!

The new processingStrategy field is correctly typed and logically positioned among the configuration fields in SXIDHandler.

event-exporter/pkg/transformer/cloudevents_test.go (1)

69-108: LGTM!

The test correctly validates the new processingStrategy field in CloudEvent output. The explicit STORE_ONLY value in the first comprehensive test case provides good coverage, while other focused test cases appropriately default to the zero value (EXECUTE_REMEDIATION).

tests/event_exporter_test.go (1)

85-85: LGTM!

The test correctly expects "EXECUTE_REMEDIATION" as the processing strategy, which aligns with the default behavior for health events.

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go (2)

29-51: LGTM!

The processingStrategy parameter is correctly threaded through the constructor to the handler struct, following the established pattern for other configuration fields.


225-239: LGTM!

The ProcessingStrategy field is correctly populated from the handler's configuration when creating health events.

health-monitors/syslog-health-monitor/main.go (2)

70-71: LGTM!

The new --processing-strategy flag with sensible default and clear help text follows the existing flag pattern in this file.


173-184: LGTM!

The processingStrategy is correctly converted and passed to NewSyslogMonitor. The overall wiring integrates cleanly with the existing initialization flow.

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)

30-42: LGTM! Constructor correctly extended with ProcessingStrategy parameter.

The constructor signature update properly accepts the new processingStrategy parameter and assigns it to the handler struct. This is consistent with the pattern used across other handlers (XID, GPUFallen) in this PR.


109-109: ProcessingStrategy correctly propagated to emitted HealthEvent.

The handler's processingStrategy is properly assigned to the HealthEvent.ProcessingStrategy field, ensuring events carry the configured strategy for downstream processing.

health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (3)

167-175: LGTM! Test updated to validate ProcessingStrategy propagation.

The test correctly passes pb.ProcessingStrategy_STORE_ONLY to the handler constructor, following the table-driven test pattern and using testify/require for assertions as per coding guidelines.


185-185: Good assertion coverage for ProcessingStrategy field.

This assertion ensures the emitted event carries the correct processing strategy, validating the end-to-end propagation from handler construction to event emission.


193-319: Comprehensive XID tracking tests properly updated with ProcessingStrategy.

All test cases in TestXIDTracking consistently use pb.ProcessingStrategy_EXECUTE_REMEDIATION, maintaining test consistency while validating the new parameter integration.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (3)

1391-1506: Well-structured table-driven test for ProcessingStrategy behavior.

The test covers key scenarios:

  • STORE_ONLY events should not create node conditions or K8s events
  • EXECUTE_REMEDIATION events should create node conditions
  • Mixed strategies properly filter only EXECUTE_REMEDIATION events

This follows the coding guidelines for table-driven tests with descriptive names.


1550-1562: Reasonable approach to count NVSentinel-specific conditions.

The logic excludes standard Kubernetes node conditions (NodeReady, NodeMemoryPressure, etc.) to isolate NVSentinel-added conditions. The logging at line 1560 aids debugging.


1508-1589: Good test isolation with per-case setup.

Each test case creates its own localClientSet, ringBuffer, and connector, preventing state leakage between test runs. The defer close(stopCh) ensures proper cleanup.

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (3)

18-25: Imports correctly updated for ProcessingStrategy testing.

New imports for os, filepath, and the pb alias are properly added to support the temp file creation and protobuf type usage in tests.


143-196: Good integration test with realistic metadata setup.

This test creates a temporary gpu_metadata.json, exercises the full ProcessLine flow with valid NVSwitch topology, and validates the emitted event including the new ProcessingStrategy field.


225-249: Test cases properly updated with ProcessingStrategy parameter.

The TestProcessLine table-driven tests correctly pass pb.ProcessingStrategy_EXECUTE_REMEDIATION and validate it in assertions, following the established pattern. Uses testify/assert and testify/require per coding guidelines.

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (7)

309-317: Test correctly validates ProcessingStrategy initialization.

The test verifies that NewSyslogMonitor properly stores the processingStrategy parameter. The assertion at line 317 confirms the monitor's field is set to the expected value.


329-334: Good coverage testing both constructor variants.

Testing NewSyslogMonitorWithFactory with EXECUTE_REMEDIATION ensures both constructor paths properly handle the new parameter.


391-403: TestJournalProcessingLogic updated with ProcessingStrategy.

The test correctly passes pb.ProcessingStrategy_EXECUTE_REMEDIATION to maintain consistency with the expected handler behavior during journal processing.


494-544: TestJournalStateManagement properly tests state persistence with ProcessingStrategy.

Both monitor instances (sm and sm2) are created with consistent EXECUTE_REMEDIATION strategy, ensuring the state persistence test accurately reflects real-world usage.


574-593: TestBootIDChangeHandling correctly includes ProcessingStrategy.

The boot ID detection test is updated to include the new parameter, maintaining test consistency across the test suite.


624-644: TestRunMultipleChecks validates handler initialization with ProcessingStrategy.

The test confirms that both XID and SXID handlers are properly initialized when multiple checks are configured, with the processing strategy parameter correctly passed through.


665-679: TestGPUFallenOffHandlerInitialization updated for completeness.

The GPU Fallen Off handler initialization test includes the ProcessingStrategy parameter, ensuring all handler types are tested with the new field.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (4)

325-343: LGTM: Clean filtering implementation for STORE_ONLY events.

The function correctly filters out STORE_ONLY health events with appropriate info-level logging for observability. The implementation aligns with the broader PR changes introducing ProcessingStrategy.


345-370: LGTM: Good extraction of K8s event creation logic.

The function consolidates event creation into a reusable helper, improving code organization. All required fields are properly populated from the health event.


388-400: LGTM: Correct integration of filtered events with node condition updates.

The timing instrumentation and error handling are properly applied to the filtered processableEvents set.


402-413: LGTM: Node event creation correctly uses filtered events.

The loop properly iterates over processableEvents and uses the new createK8sEvent helper for event creation.

tests/syslog_health_monitor_test.go (6)

25-31: LGTM: Import organization for centralized helpers.

The imports correctly bring in the test helpers package and required Kubernetes types for DaemonSet management.


36-41: LGTM: Context keys for test state management.

The context keys are well-named and follow a consistent pattern for storing test state across setup, assess, and teardown phases.


163-175: LGTM: Teardown correctly uses centralized helper.

The teardown properly retrieves context values and delegates to TearDownSyslogHealthMonitor for cleanup.


200-208: LGTM: Consistent use of helper constant for port.

Using helpers.StubJournalHTTPPort instead of a local constant improves maintainability.


368-385: LGTM: New test for STORE_ONLY strategy.

The test setup correctly passes true for updatedDaemonSet to configure the STORE_ONLY processing strategy.


411-422: LGTM: Teardown correctly restores DaemonSet.

The teardown passes true for updatedDaemonSet to restore the original DaemonSet configuration after the test.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (2)

33-58: LGTM: Clean addition of processingStrategy parameter.

The new parameter is properly integrated into the constructor and stored in the handler struct. This follows the same pattern used in other handlers (SXID, GPUFallen) as per the PR objectives.


189-189: LGTM: ProcessingStrategy correctly propagated to HealthEvent.

The strategy is properly assigned to the generated health event, enabling downstream filtering in the platform connector.

tests/helpers/syslog-health-monitor.go (3)

27-31: LGTM: Well-defined constants for test configuration.

The constants provide clear, maintainable values for port and resource names used across syslog health monitor tests.


33-74: LGTM: Comprehensive test setup helper.

The function properly:

  • Conditionally updates DaemonSet processing strategy
  • Locates the syslog pod on a worker node
  • Injects test metadata
  • Establishes port-forward
  • Sets node management label

Good use of require for critical setup steps that should fail the test if unsuccessful.


76-116: Consider handling originalDaemonSet nil case when updatedDaemonSet is true.

If SetUpSyslogHealthMonitor returns a nil originalDaemonSet when updatedDaemonSet is true (due to an error during setup that wasn't caught), RestoreDaemonSet could receive an unexpected nil. However, since SetUpSyslogHealthMonitor uses require.NoError, this scenario is unlikely in practice.

The teardown logic is otherwise well-structured with proper error handling and cleanup sequencing.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (5)

88-91: LGTM: Updated constructor call with processing strategy.

The test correctly passes pb.ProcessingStrategy_EXECUTE_REMEDIATION to the updated NewXIDHandler signature.


142-142: Good coverage of both processing strategies in test cases.

The test cases appropriately use different strategies (EXECUTE_REMEDIATION on line 142, STORE_ONLY on line 157) to ensure both paths are validated.

Also applies to: 157-157


194-194: LGTM: ProcessingStrategy assertions added to validation.

The new assertions correctly verify that the ProcessingStrategy field on generated HealthEvent objects matches the strategy configured on the handler.

Also applies to: 231-231


306-308: LGTM: Test handler initialized with processing strategy.

The test correctly initializes the handler with EXECUTE_REMEDIATION strategy for the TestCreateHealthEventFromResponse test cases.


348-348: LGTM: Consistent ProcessingStrategy validation in event creation tests.

Both test cases properly validate that the handler's processing strategy is propagated to the created health events.

Also applies to: 378-378

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

14-17: LGTM - Protobuf stubs correctly reflect ProcessingStrategy additions.

The generated .pyi stub file properly exposes the new ProcessingStrategy enum with EXECUTE_REMEDIATION and STORE_ONLY values, and extends HealthEvent with the processingStrategy field. The type hints follow protobuf Python conventions.

Also applies to: 31-32, 78-78, 104-104, 120-120, 138-138

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)

41-57: LGTM - Constructor signatures correctly extended with processingStrategy.

The processingStrategy parameter is properly threaded through both NewSyslogMonitor and NewSyslogMonitorWithFactory, maintaining the constructor chain. Adding as the last parameter is good practice for API evolution.


90-136: LGTM - ProcessingStrategy correctly propagated to all handlers.

The processingStrategy is properly stored in the SyslogMonitor struct and consistently passed to all three handler types (XID, SXID, GPUFallen). This ensures uniform processing behavior across different error check types.


802-814: LGTM - HealthEvent correctly includes ProcessingStrategy field.

The prepareHealthEventWithAction method properly sets ProcessingStrategy: sm.processingStrategy when constructing health events, ensuring the configured strategy is propagated through emitted events.

tests/helpers/kube.go (3)

2210-2249: LGTM - DaemonSet rollout wait logic is correct.

The function properly checks DesiredNumberScheduled, UpdatedNumberScheduled, and NumberReady status fields to determine rollout completion. The logging provides good visibility into rollout progress.


2252-2268: LGTM - Container argument update logic handles both flag formats.

The function correctly handles both --processing-strategy=VALUE and --processing-strategy VALUE argument formats, and properly appends the flag if it doesn't exist. The early return after modification prevents duplicate additions.


2312-2344: LGTM - RestoreDaemonSet properly restores original container configuration.

The function correctly handles nil input, uses retry.RetryOnConflict for safe updates, and waits for rollout completion. Replacing the entire containers slice ensures complete restoration of the original state.

@tanishagoyal2 tanishagoyal2 changed the title 390 event handling in syslog feat: add event handling strategy changes in syslog health monitor Dec 22, 2025
@github-actions
Copy link

Merging this branch will decrease overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator 41.95% (-0.57%) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/tests 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer/init.go 0.00% (ΓΈ) 280 0 280
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/main.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/types.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/types.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/types.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/event_exporter.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/healthevent.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/syslog-health-monitor.go 0.00% (ΓΈ) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • github.com/nvidia/nvsentinel/health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • github.com/nvidia/nvsentinel/tests/event_exporter_test.go
  • github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/syslog_health_monitor_test.go (1)

51-52: Remove redundant error check.

Line 52 checks require.NoError(t, err, ...) but SetUpSyslogHealthMonitor returns (string, *v1.Pod, chan struct{}) with no error. The err variable is undefined.

πŸ”Ž Proposed fix
 	testNodeName, syslogPod, stopChan := helpers.SetUpSyslogHealthMonitor(ctx, t, client, nil)
-	require.NoError(t, err, "failed to set up syslog health monitor")
tests/helpers/kube.go (1)

2272-2277: Missing error when container is not found.

The loop that searches for the container by name silently does nothing if no match is found. This could mask test configuration errors where the wrong container name is provided.

πŸ”Ž Proposed fix to validate container exists
+	found := false
 	containers := daemonSet.Spec.Template.Spec.Containers
 	for i := range containers {
 		if containers[i].Name == containerName {
 			setArgsOnContainer(t, &containers[i], args)
+			found = true
 			break
 		}
 	}
+
+	if !found {
+		return fmt.Errorf("container %q not found in daemonset %s/%s", containerName, NVSentinelNamespace, daemonsetName)
+	}
 
 	return client.Resources().Update(ctx, daemonSet)

Based on learnings, this issue was previously flagged but remains unresolved.

🧹 Nitpick comments (4)
tests/helpers/kube.go (4)

2251-2255: Add godoc comment for exported function.

As per coding guidelines, exported functions require documentation comments. Add a godoc comment explaining the purpose, parameters, and behavior of this function.

πŸ”Ž Proposed godoc comment
+// UpdateDaemonSetArgs updates the specified container arguments in a DaemonSet and waits for rollout completion.
+// If containerName is empty, updates all containers. Otherwise, updates only the named container.
+// Uses retry.RetryOnConflict for automatic retry handling.
 func UpdateDaemonSetArgs(ctx context.Context, t *testing.T,
 	client klient.Client, daemonsetName string, containerName string,
 	args map[string]string) error {

As per coding guidelines, function comments are required for all exported Go functions.


2288-2289: Consider removing hard-coded sleep.

The 10-second sleep after rollout completion may not be necessary since waitForDaemonSetRollout already ensures pods are ready. If pods need additional time to stabilize, consider using require.Eventually to poll for the specific condition instead of a fixed delay.


2294-2297: Add godoc comment for exported function.

Add documentation for this exported function explaining its purpose, parameters, and behavior.

πŸ”Ž Proposed godoc comment
+// RemoveDaemonSetArgs removes the specified arguments from a DaemonSet container and waits for rollout completion.
+// If containerName is empty, removes from all containers. Otherwise, removes only from the named container.
+// Uses retry.RetryOnConflict for automatic retry handling.
 func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string,
 	containerName string, args map[string]string,

As per coding guidelines, function comments are required for all exported Go functions.


2411-2413: Add godoc comment for exported function.

Add documentation for this exported function explaining its purpose, parameters, return values, and the validation it performs.

πŸ”Ž Proposed godoc comment
+// GetDaemonSetPodOnWorkerNode returns a ready, running pod from the specified DaemonSet on a real worker node.
+// It waits for a pod matching podNamePattern that is not being deleted and has all containers ready.
+// Returns an error if no suitable pod is found within the timeout period.
 func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string, podNamePattern string) (*v1.Pod, error) {

As per coding guidelines, function comments are required for all exported Go functions.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between a216b57 and 776a34b.

πŸ“’ Files selected for processing (3)
  • tests/helpers/kube.go
  • tests/helpers/syslog-health-monitor.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/helpers/syslog-health-monitor.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/helpers/kube.go
  • tests/syslog_health_monitor_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/syslog_health_monitor_test.go
🧠 Learnings (7)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

  • tests/helpers/kube.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

  • tests/helpers/kube.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • tests/syslog_health_monitor_test.go
🧬 Code graph analysis (1)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (3)
  • SetUpSyslogHealthMonitor (33-71)
  • StubJournalHTTPPort (27-27)
  • TearDownSyslogHealthMonitor (74-113)
tests/helpers/kube.go (1)
  • PortForwardPod (1590-1634)
πŸ”‡ Additional comments (8)
tests/syslog_health_monitor_test.go (8)

89-89: LGTM: Consistent use of StubJournalHTTPPort.

The test correctly uses the centralized helpers.StubJournalHTTPPort constant for syslog message injection.

Also applies to: 123-123


160-171: LGTM: Teardown properly centralized.

The teardown correctly extracts context values and delegates to helpers.TearDownSyslogHealthMonitor.


185-216: LGTM: Manual setup appropriate for metadata-less test.

This test intentionally skips metadata injection to verify XID detection without metadata, so the manual setup (without using SetUpSyslogHealthMonitor) is appropriate.


296-306: LGTM: Proper use of setup helper.

The setup correctly delegates to helpers.SetUpSyslogHealthMonitor and stores context values.


346-360: LGTM: Proper use of teardown helper.

The teardown correctly extracts context values and delegates to helpers.TearDownSyslogHealthMonitor.


362-380: LGTM: Well-structured STORE_ONLY strategy test.

The new test properly configures the syslog health monitor with STORE_ONLY strategy via setup helper arguments and stores necessary context values for teardown.


382-404: LGTM: Accurate test description and assertions.

The assess description correctly describes the test intent: verifying that XID errors in STORE_ONLY mode do not create node conditions or cordon nodes. The assertions properly validate this behavior.


406-419: LGTM: Proper strategy restoration in teardown.

The teardown correctly restores EXECUTE_REMEDIATION strategy to avoid impacting subsequent tests.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from 776a34b to 0a1dc9a Compare January 7, 2026 09:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

πŸ€– Fix all issues with AI agents
In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Around line 1606-1612: The test currently indexes nvsentinelConditions[0] when
tc.expectNodeConditions is true which can panic if the slice is empty; update
the assertion to first assert the slice has at least one element (e.g.,
assert.Greater(t, len(nvsentinelConditions), 0) or assert.Len/NotEmpty) and only
then assert the Type value against tc.expectedConditionType, referring to the
existing test variables tc.expectNodeConditions and nvsentinelConditions to
locate where to add the length check before accessing index 0.
- Around line 1620-1626: The test accesses events.Items[0] directly which can
panic if the slice is empty; when tc.expectKubernetesEvents is true first assert
that events.Items is non-empty (e.g. use assert.Greater(t, len(events.Items), 0)
or assert.NotEmpty) and only then compare events.Items[0].Type to
tc.expectedEventType, ensuring the check references tc.expectKubernetesEvents,
events.Items and tc.expectedEventType so the test fails gracefully instead of
panicking.

In @tests/data/healthy-event.json:
- Around line 19-20: The processingStrategy for the healthy event is wrong:
update the JSON key "processingStrategy" in the healthy event object (the one
containing "nodeName": "NODE_NAME" and "isHealthy": true / "message": "No health
failures") from 1 to 2 so it uses STORE_ONLY rather than EXECUTE_REMEDIATION;
ensure the numeric value is changed to 2 to reflect observability-only behavior.

In @tests/helpers/kube.go:
- Around line 2334-2338: The loop over containers that calls
removeArgsFromContainer(&containers[i], args) must validate the target container
was found; add a boolean found flag (e.g., found := false) set to true when
containers[i].Name == containerName, break, and after the loop check found and
return an error (or propagate an appropriate error) if false so the caller knows
the container wasn't presentβ€”mirror the behavior in UpdateDaemonSetArgs and
reference the containerName, removeArgsFromContainer, and UpdateDaemonSetArgs
symbols when making the change.
- Around line 2296-2301: The loop over containers that compares
containers[i].Name to containerName and calls setArgsOnContainer(t,
&containers[i], args) misses handling when no match is found; update the code
that iterates over containers (the block referencing containers, containerName,
and setArgsOnContainer) to track whether a container was found (e.g., found :=
false), set found = true when you call setArgsOnContainer, and after the loop
either call t.Fatalf (or return an error from the helper) with a clear message
including containerName so the test fails loudly if no container matched.

In @tests/syslog_health_monitor_test.go:
- Around line 51-52: Remove the redundant error assertion: the call to
helpers.SetUpSyslogHealthMonitor(ctx, t, client, nil) returns (testNodeName,
syslogPod, stopChan) and not an error, so delete the require.NoError(t, err,
"failed to set up syslog health monitor") line (which references an
uninitialized err); ensure any later references to err in this test are handled
or removed as well.
🧹 Nitpick comments (5)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1502-1524: Consider consolidating or differentiating duplicate test scenarios.

This test case ("STORE_ONLY non fatal event should not create Kubernetes event") appears to duplicate the scenario at lines 1423-1442. Both test STORE_ONLY with IsFatal: false expecting no node conditions and no Kubernetes events. If you intend to test different CheckName values, consider clarifying this in the test name or description; otherwise, consolidate to reduce redundancy.

data-models/protobufs/health_event.proto (1)

78-78: Add field-level documentation for processingStrategy.

While the enum itself is well-documented, the field lacks a comment explaining its purpose and usage context within the HealthEvent message.

πŸ“ Suggested documentation
   BehaviourOverrides quarantineOverrides = 14;
   BehaviourOverrides drainOverrides = 15;
+  // Directs how downstream modules (platform connectors, event stores) should handle this event.
+  // Defaults to UNSPECIFIED if not explicitly set.
   ProcessingStrategy processingStrategy = 16;

Based on coding guidelines: "Include comprehensive comments for all fields in Protocol Buffer messages."

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

33-58: ProcessingStrategy correctly threaded through constructor.

The processingStrategy parameter is properly captured and stored in the XIDHandler struct. The initialization follows the same pattern as other configuration fields.

Consider whether processingStrategy should be validated in the constructor to reject ProcessingStrategy_UNSPECIFIED (the proto default value of 0). While validation currently occurs in main.go before constructors are called, adding validation here would provide defensive programming and catch configuration errors at the point of construction, consistent with best practices for enum parameters that must have meaningful values.

tests/helpers/healthevent.go (1)

31-31: Consider using the enum type for type safety.

The ProcessingStrategy field is declared as int rather than protos.ProcessingStrategy. While this may be intentional for JSON marshaling flexibility, using the enum type would provide stronger type safety and prevent invalid values at compile time.

♻️ Suggested refactor for type safety

If JSON marshaling works correctly with the enum type, consider:

 type HealthEventTemplate struct {
 	Version             int                  `json:"version"`
 	Agent               string               `json:"agent"`
 	ComponentClass      string               `json:"componentClass,omitempty"`
 	CheckName           string               `json:"checkName"`
 	IsFatal             bool                 `json:"isFatal"`
 	IsHealthy           bool                 `json:"isHealthy"`
 	Message             string               `json:"message"`
 	RecommendedAction   int                  `json:"recommendedAction,omitempty"`
 	ErrorCode           []string             `json:"errorCode,omitempty"`
 	EntitiesImpacted    []EntityImpacted     `json:"entitiesImpacted,omitempty"`
 	Metadata            map[string]string    `json:"metadata,omitempty"`
 	QuarantineOverrides *QuarantineOverrides `json:"quarantineOverrides,omitempty"`
 	NodeName            string               `json:"nodeName"`
-	ProcessingStrategy  int                  `json:"processingStrategy,omitempty"`
+	ProcessingStrategy  protos.ProcessingStrategy `json:"processingStrategy,omitempty"`
 }

And update the constructor:

 func NewHealthEvent(nodeName string) *HealthEventTemplate {
 	return &HealthEventTemplate{
 		Version:        1,
 		Agent:          "gpu-health-monitor",
 		ComponentClass: "GPU",
 		CheckName:      "GpuXidError",
 		IsFatal:        true,
 		IsHealthy:      false,
 		NodeName:       nodeName,
 		EntitiesImpacted: []EntityImpacted{
 			{
 				EntityType:  "GPU",
 				EntityValue: "0",
 			},
 		},
-		ProcessingStrategy: int(protos.ProcessingStrategy_EXECUTE_REMEDIATION),
+		ProcessingStrategy: protos.ProcessingStrategy_EXECUTE_REMEDIATION,
 	}
 }

And update the fluent setter signature:

-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate {
+func (h *HealthEventTemplate) WithProcessingStrategy(strategy protos.ProcessingStrategy) *HealthEventTemplate {
 	h.ProcessingStrategy = strategy
 	return h
 }

Also applies to: 49-49, 76-76

tests/fault_quarantine_test.go (1)

334-346: Teardown only resets SysLogsXIDError - consider resetting GpuPowerWatch as well.

The teardown sends a healthy event for SysLogsXIDError but doesn't explicitly clear the GpuPowerWatch state. While TeardownQuarantineTest may handle general cleanup, explicitly resetting both check types would make the test more self-contained and prevent potential state leakage to subsequent tests.

♻️ Suggested improvement
 feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
     event := helpers.NewHealthEvent(testCtx.NodeName).
         WithErrorCode("79").
         WithHealthy(true).
         WithAgent(helpers.SYSLOG_HEALTH_MONITOR_AGENT).
         WithCheckName("SysLogsXIDError")
     helpers.SendHealthEvent(ctx, t, event)

+    // Also reset GpuPowerWatch state
+    powerEvent := helpers.NewHealthEvent(testCtx.NodeName).
+        WithErrorCode("DCGM_FR_CLOCK_THROTTLE_POWER").
+        WithHealthy(true).
+        WithCheckName("GpuPowerWatch")
+    helpers.SendHealthEvent(ctx, t, powerEvent)
+
     return helpers.TeardownQuarantineTest(ctx, t, c)
 })
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 776a34b and 0a1dc9a.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (35)
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
  • distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-quarantine/pkg/initializer/init.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • health-monitors/syslog-health-monitor/main.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/types.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • tests/data/fatal-health-event.json
  • tests/data/healthy-event.json
  • tests/data/managed-by-nvsentinel-configmap.yaml
  • tests/data/unsupported-health-event.json
  • tests/event_exporter_test.go
  • tests/fault_quarantine_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
  • tests/helpers/syslog-health-monitor.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
  • distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • tests/helpers/syslog-health-monitor.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
  • fault-quarantine/pkg/initializer/init.go
  • distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
  • tests/event_exporter_test.go
🧰 Additional context used
πŸ““ Path-based instructions (4)
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-monitors/syslog-health-monitor/main.go
  • tests/helpers/healthevent.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • tests/fault_quarantine_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • health-monitors/syslog-health-monitor/pkg/sxid/types.go
  • tests/helpers/event_exporter.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • tests/helpers/kube.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
  • health-monitors/syslog-health-monitor/pkg/gpufallen/types.go
  • tests/syslog_health_monitor_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • tests/fault_quarantine_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • tests/syslog_health_monitor_test.go
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🧠 Learnings (14)
πŸ“š Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • health-monitors/syslog-health-monitor/main.go
  • health-monitors/syslog-health-monitor/pkg/sxid/types.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go
  • tests/syslog_health_monitor_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go
  • health-monitors/syslog-health-monitor/pkg/xid/types.go
πŸ“š Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

  • tests/helpers/healthevent.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/helpers/healthevent.go
  • tests/fault_quarantine_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go
  • tests/fault_quarantine_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
  • health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/fault_quarantine_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/fault_quarantine_test.go
  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go
  • tests/helpers/event_exporter.go
  • health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • tests/data/managed-by-nvsentinel-configmap.yaml
πŸ“š Learning: 2025-12-12T07:41:27.339Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:27.339Z
Learning: In tests/data/health-events-analyzer-config.yaml, the XID74Reg2Bit13Set rule intentionally omits the time window filter; tests should verify only the register bit pattern (bit 13 in REG2) on the incoming XID 74 event and should not rely on historical events or counts of repeats. If adding similar rules elsewhere, apply the same pattern and document that the time window filter is unnecessary for single-event bit checks.

Applied to files:

  • tests/data/managed-by-nvsentinel-configmap.yaml
πŸ“š Learning: 2025-12-12T07:36:26.109Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:26.109Z
Learning: In NVSentinel health-events analyzer config files (e.g., tests/data/health-events-analyzer-config.yaml and similar), XID 74 errors are guaranteed to include a GPU_UUID entry within healthevent.entitiesimpacted. Therefore, when writing XID 74-specific rules that filter for GPU_UUID entities, you can skip null checks (no $ifNull) for GPU_UUID. Ensure tests relying on this assumption are updated accordingly and document the guarantee in the rule descriptions.

Applied to files:

  • tests/data/managed-by-nvsentinel-configmap.yaml
πŸ“š Learning: 2025-12-12T07:38:37.023Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:37.023Z
Learning: In NVSentinel, XID 74 errors always include an NVLINK entry in healthevent.entitiesimpacted, so null-checking with $ifNull is unnecessary when filtering for NVLINK entities in XID 74-specific rules. Apply this rule to YAML test fixtures under tests/ data (e.g., tests/data/health-events-analyzer-config.yaml) and any similar health-event configuration tests. If applying in code, ensure downstream filters rely on the presence of NVLINK in entitiesimpacted for XID 74 only, but continue to guard other fields and XIDs with appropriate null checks.

Applied to files:

  • tests/data/managed-by-nvsentinel-configmap.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`

Applied to files:

  • health-monitors/syslog-health-monitor/pkg/sxid/types.go
πŸ“š Learning: 2025-12-23T05:02:22.108Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.

Applied to files:

  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/syslog_health_monitor_test.go
🧬 Code graph analysis (10)
health-monitors/syslog-health-monitor/main.go (1)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_value (58-62)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
  • ProcessingStrategy_EXECUTE_REMEDIATION (47-47)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (48-48)
  • ProcessingStrategy_EXECUTE_REMEDIATION (47-47)
tests/fault_quarantine_test.go (5)
tests/helpers/fault_quarantine.go (3)
  • QuarantineTestContext (51-54)
  • SetupQuarantineTest (107-112)
  • TeardownQuarantineTest (181-242)
tests/helpers/kube.go (3)
  • SetNodeManagedByNVSentinel (1389-1408)
  • WaitForNodeConditionWithCheckName (1662-1706)
  • WaitForNodeEvent (349-385)
tests/helpers/healthevent.go (1)
  • NewHealthEvent (61-78)
tests/helpers/health_events_analyzer.go (1)
  • SYSLOG_HEALTH_MONITOR_AGENT (38-38)
data-models/pkg/protos/health_event.pb.go (2)
  • ProcessingStrategy_STORE_ONLY (48-48)
  • ProcessingStrategy_EXECUTE_REMEDIATION (47-47)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
  • ProcessingStrategy_EXECUTE_REMEDIATION (47-47)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
  • ProcessingStrategy_EXECUTE_REMEDIATION (47-47)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
  • XIDHandler (29-39)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (3)
  • SetUpSyslogHealthMonitor (33-71)
  • StubJournalHTTPPort (27-27)
  • TearDownSyslogHealthMonitor (74-113)
tests/helpers/kube.go (3)
  • InjectSyslogMessages (1820-1843)
  • PortForwardPod (1614-1658)
  • EnsureNodeConditionNotPresent (1797-1818)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (3)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-18)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
  • SXIDHandler (29-36)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
πŸ”‡ Additional comments (57)
tests/data/fatal-health-event.json (1)

20-20: LGTM! Processing strategy addition is appropriate for fatal events.

The addition of processingStrategy: 1 (EXECUTE_REMEDIATION) aligns well with the fatal nature of this test event, as fatal health events would typically require remediation actions.

store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go (1)

404-404: LGTM! Field mapping follows established conventions.

The new processingstrategy β†’ processingStrategy mapping correctly handles the case conversion between MongoDB's lowercase bson field names and PostgreSQL's camelCase JSON field names, consistent with the existing pattern and aligns with the project's field naming conventions.

Based on learnings, this ensures consistency with how protobuf fields are stored and queried across different data stores.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1550-1559: LGTM on test setup and isolation.

Good use of local context, clientset, and ring buffer per test case to ensure test isolation. The deferred close(stopCh) ensures proper cleanup.

tests/data/managed-by-nvsentinel-configmap.yaml (1)

43-59: LGTM! The new rule-set structure is correct and properly mirrors the existing GPU health monitor rule.

Using the same taint and cordon settings for both gpu-health-monitor and syslog-health-monitor XID errors ensures consistent node quarantine behavior regardless of the detection source. The syslog-health-monitor component emits HealthEvents with agent == 'syslog-health-monitor' and checkName == 'SysLogsXIDError' as expected (extensively verified in the test suite).

health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (3)

243-246: LGTM: ProcessingStrategy correctly set for maintenance events.

The test expectations properly include ProcessingStrategy: pb.ProcessingStrategy_EXECUTE_REMEDIATION for CSP maintenance events, which aligns with the intended behavior where these events should trigger actual remediation workflows.


272-274: LGTM: Healthy event test updated correctly.

The ProcessingStrategy field is appropriately set to EXECUTE_REMEDIATION for healthy/recovery events, maintaining consistency with the quarantine event behavior.


336-338: LGTM: Unknown action case handles ProcessingStrategy consistently.

Even for events with unknown recommended actions (which default to NONE), the ProcessingStrategy is correctly set to EXECUTE_REMEDIATION, ensuring the event still participates in the remediation workflow.

data-models/protobufs/health_event.proto (1)

32-39: LGTM: Well-documented ProcessingStrategy enum.

The enum definition clearly documents the behavioral distinction between EXECUTE_REMEDIATION (modifies cluster state) and STORE_ONLY (observability-only). The three-value design (including UNSPECIFIED) follows protobuf best practices.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

14-18: LGTM: Generated Python type stubs correctly reflect proto changes.

The auto-generated Python type stub properly includes the new ProcessingStrategy enum and integrates it into the HealthEvent message with appropriate type annotations. The structure is consistent with the protobuf definition.

Also applies to: 32-34, 80-80, 106-106, 122-122, 140-140

health-monitors/syslog-health-monitor/pkg/xid/types.go (1)

20-20: LGTM: ProcessingStrategy field added cleanly to XIDHandler.

The import and struct field addition follow Go conventions. The unexported field appropriately stores the processing strategy configuration for the handler, consistent with other configuration fields like nodeName and checkName.

Also applies to: 34-34

health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)

189-189: LGTM: ProcessingStrategy propagated to HealthEvent.

The handler's processingStrategy is correctly assigned to the generated HealthEvent, completing the data flow from configuration to emitted events.

tests/helpers/healthevent.go (1)

155-158: LGTM: Fluent setter follows existing pattern.

The WithProcessingStrategy method correctly follows the established fluent builder pattern used by other methods in this struct.

health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)

362-364: LGTM: TODO properly references tracking PR.

The hardcoded ProcessingStrategy_EXECUTE_REMEDIATION value is acceptable as an interim implementation, and the TODO comment appropriately references PR #641 for making it configurable via the config file.

tests/data/unsupported-health-event.json (1)

20-20: LGTM: Test data includes ProcessingStrategy field.

The processingStrategy value of 1 corresponds to EXECUTE_REMEDIATION and aligns with the new ProcessingStrategy enum support.

tests/helpers/event_exporter.go (1)

221-226: LGTM: Validation logic correctly extended.

The ValidateCloudEvent function signature and implementation are correctly updated to validate the processingStrategy field in CloudEvent payloads, following the existing validation pattern.

Also applies to: 245-245

health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)

20-20: LGTM: Proper use of enum type for ProcessingStrategy.

The processingStrategy field correctly uses the pb.ProcessingStrategy enum type rather than int, providing type safety for production code. The import alias pb follows Go conventions.

Also applies to: 34-34

health-monitors/syslog-health-monitor/main.go (3)

70-72: LGTM!

The new --processing-strategy flag is well-defined with a sensible default (EXECUTE_REMEDIATION) and clear documentation of allowed values.


164-171: LGTM!

The validation logic correctly uses the protobuf-generated ProcessingStrategy_value map to validate the flag input, provides a clear error message for invalid values, and logs the configured strategy for observability.


173-184: LGTM!

The processingStrategy is correctly passed to NewSyslogMonitor, threading the configuration through to the monitor instance.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Generated protobuf code - no manual review needed.

This file is auto-generated by the protocol buffer compiler. The changes correctly reflect the addition of the ProcessingStrategy enum and the new processingStrategy field in HealthEvent. Ensure this file is regenerated from the source .proto file rather than manually edited.

tests/fault_quarantine_test.go (4)

26-32: LGTM!

Import for the protos package is correctly added to access ProcessingStrategy enum values.


234-251: LGTM!

Test setup correctly initializes the quarantine test context and enables NVSentinel management on the node before running assertions.


253-289: LGTM!

Good coverage for STORE_ONLY behavior - the test correctly verifies that events with STORE_ONLY processing strategy do not trigger quarantine actions (no node condition, no cordoning, no annotation).


291-332: LGTM!

Good coverage for EXECUTE_REMEDIATION behavior - the test correctly verifies that events with EXECUTE_REMEDIATION processing strategy trigger the expected quarantine actions (node condition present, cordoned with annotation).

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)

30-42: LGTM!

The constructor correctly accepts the new processingStrategy parameter and initializes the handler field. The signature change aligns with the pattern used in other handlers (XID, GPUFallen).


95-110: LGTM!

The ProcessingStrategy is correctly propagated to the HealthEvent payload, ensuring the strategy configured at initialization flows through to emitted events.

health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (7)

309-317: LGTM!

Test correctly verifies that NewSyslogMonitor accepts and stores the processingStrategy parameter. The assertion on line 317 confirms the field is properly set.


329-334: LGTM!

Test correctly verifies NewSyslogMonitorWithFactory with a different strategy (EXECUTE_REMEDIATION), providing coverage for both factory variants.


391-403: LGTM!

TestJournalProcessingLogic updated to include the processingStrategy parameter.


494-507: LGTM!

TestJournalStateManagement updated to include the processingStrategy parameter in both monitor instances.


574-586: LGTM!

TestBootIDChangeHandling updated to include the processingStrategy parameter.


624-637: LGTM!

TestRunMultipleChecks updated to include the processingStrategy parameter.


665-679: LGTM!

TestGPUFallenOffHandlerInitialization updated to include the processingStrategy parameter.

health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (8)

88-91: LGTM!

TestDetermineFatality correctly updated to include the processingStrategy parameter.


141-148: LGTM!

Table-driven test case for NVRM GPU Map Line correctly includes processingStrategy.


156-195: Good test coverage for STORE_ONLY processing strategy.

The "Valid XID Message" test case correctly uses STORE_ONLY and validates that the ProcessingStrategy is propagated to the event. This provides coverage for the non-default strategy path.


200-232: LGTM!

"Valid XID with GPU UUID" test case correctly validates EXECUTE_REMEDIATION propagation.


237-280: LGTM!

Error path test cases correctly updated with processingStrategy parameter.


306-349: LGTM!

TestCreateHealthEventFromResponse correctly validates that ProcessingStrategy is included in the generated health event.


351-379: LGTM!

"XID Event with GPU UUID" test case validates ProcessingStrategy propagation with entity enrichment.


427-446: LGTM!

TestNewXIDHandler table-driven tests correctly updated with processingStrategy parameter.

health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (4)

17-25: LGTM!

New imports for os, filepath, and protobuf package are correctly added to support the new test functionality.


44-56: LGTM!

TestNewSXIDHandler correctly updated to include the processingStrategy parameter.


143-196: Excellent test coverage for SXID handler with valid topology.

This test validates the complete flow: metadata loading, SXID parsing, GPU resolution via NVSwitch link topology, and ProcessingStrategy propagation. The temporary metadata file approach is appropriate for isolated testing.

One minor observation: the test hardcodes entity indices (3, 4) for GPU assertions. If the entity order changes in the handler, these assertions would break. Consider using a helper to find entities by type, or document why these indices are expected.


198-249: LGTM!

TestProcessLine table-driven test correctly updated to include processingStrategy parameter and validate its propagation in the event.

tests/syslog_health_monitor_test.go (1)

362-422: LGTM! Well-structured STORE_ONLY strategy test.

The new test correctly:

  • Configures the syslog monitor with STORE_ONLY strategy during setup
  • Verifies that XID errors do not trigger node conditions or cordoning
  • Restores EXECUTE_REMEDIATION strategy during teardown to avoid affecting subsequent tests
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)

41-58: LGTM! Processing strategy parameter correctly threaded.

The processingStrategy parameter is properly added to NewSyslogMonitor and forwarded to NewSyslogMonitorWithFactory.


63-151: LGTM! Processing strategy initialization is correct.

The processingStrategy parameter is:

  • Added to the function signature
  • Stored in the SyslogMonitor struct (line 96)
  • Passed to all handler constructors (lines 110, 120, 130)

This correctly threads the strategy through the monitor initialization.


792-820: LGTM! Processing strategy correctly set on health events.

Line 813 correctly assigns sm.processingStrategy to event.ProcessingStrategy, ensuring the strategy is propagated to all health events generated by the monitor.

tests/helpers/kube.go (6)

387-409: LGTM!

The function correctly uses require.Never to assert that a specified event never appears on the node during the timeout period. The logic and error handling are appropriate for a negative assertion in tests.


2232-2273: LGTM!

The rollout wait logic correctly verifies that all DaemonSet pods are scheduled, updated, and ready before proceeding. The progress logging is helpful for test debugging.


2353-2383: LGTM!

The function correctly handles both --flag=value and --flag value argument styles, including proper bounds checking when inserting values.


2385-2407: LGTM!

The function correctly updates existing arguments in place or appends new ones, handling both value and boolean flag styles appropriately.


2409-2433: LGTM!

The function correctly removes arguments in both --flag=value and --flag value styles, properly handling the deletion of both the flag and its associated value when applicable.


2435-2479: LGTM!

The function correctly waits for a DaemonSet pod to be fully ready on a worker node, with proper validation that the pod is not being deleted, is running, and has all readiness checks passing.

health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (2)

52-52: LGTM!

The processingStrategy field is well-integrated into the struct. The field naming follows Go conventions, and its placement before the mutex fields is appropriate for configuration data that's set during initialization.


22-23: LGTM! The import alias pb for the protos package follows Go conventions and is correctly structured.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from 0a1dc9a to 1df5558 Compare January 7, 2026 10:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Line 1509: Two test cases in k8s_platform_connector_test.go set ErrorCode to a
slice containing an empty string (ErrorCode: []string{""}) which should be an
empty slice; update both instances to use an empty slice literal (ErrorCode:
[]string{}) so they match other tests and semantic intent. Locate the ErrorCode
field in the failing test case structs (the occurrences near the existing DCGM
error-code tests) and replace []string{""} with []string{} in both places.
🧹 Nitpick comments (6)
tests/helpers/kube.go (3)

387-409: Consider failing fast on persistent API errors.

If GetNodeEvents fails repeatedly due to API issues, require.Never will continue checking until timeout, masking the real error. Consider tracking consecutive failures and calling t.Fatalf after a threshold.

♻️ Proposed enhancement
 func EnsureNodeEventNotPresent(ctx context.Context, t *testing.T,
 	c klient.Client, nodeName string, eventType, eventReason string) {
 	t.Helper()
 
+	consecutiveErrors := 0
+	maxConsecutiveErrors := 3
+
 	require.Never(t, func() bool {
 		events, err := GetNodeEvents(ctx, c, nodeName, eventType)
 		if err != nil {
 			t.Logf("failed to get events for node %s: %v", nodeName, err)
+			consecutiveErrors++
+			if consecutiveErrors >= maxConsecutiveErrors {
+				t.Fatalf("failed to get events %d times in a row, likely API issue", consecutiveErrors)
+			}
 			return false
 		}
+		consecutiveErrors = 0
 
 		for _, event := range events.Items {
 			if event.Type == eventType && event.Reason == eventReason {
 				t.Logf("node %s has event %v", nodeName, event)
 				return true
 			}
 		}
 
 		t.Logf("node %s does not have event %v", nodeName, eventType)
 
 		return false
 	}, NeverWaitTimeout, WaitInterval, "node %s should not have event %v", nodeName, eventType)
 }

2317-2318: Clarify or remove the hard-coded sleep.

The 10-second sleep after rollout completion seems arbitrary, since waitForDaemonSetRollout already verifies all pods are updated and ready. If this addresses a specific race condition (e.g., waiting for containers to initialize), document it. Otherwise, consider removing it.


2362-2392: Consider simplifying the nested append.

Line 2384 uses nested append calls which create two allocations. For test helper code this is acceptable, but it could be slightly more efficient and clearer.

♻️ Alternative approach
 	// Match --flag or --flag value style
 	if existingArg == flag {
 		if value != "" {
 			if j+1 < len(container.Args) && !strings.HasPrefix(container.Args[j+1], "-") {
 				container.Args[j+1] = value
 			} else {
-				container.Args = append(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...)
+				// Insert value after flag
+				container.Args = append(container.Args, "")
+				copy(container.Args[j+2:], container.Args[j+1:])
+				container.Args[j+1] = value
 			}
 		}
 
 		return true
 	}
tests/helpers/syslog-health-monitor.go (3)

15-15: Add package-level godoc comment.

Per Go conventions and coding guidelines, packages should have a package-level documentation comment explaining their purpose.

πŸ“ Suggested documentation
+// Package helpers provides test utilities for setting up and tearing down
+// test environments for NVSentinel health monitors.
 package helpers

Based on coding guidelines, package-level godoc is required for all Go packages.


32-34: Follow Go godoc conventions for exported function comments.

Function comments for exported functions should start with the function name according to Go documentation conventions.

πŸ“ Proposed fix
-// helper function to set up syslog health monitor and port forward to it
+// SetUpSyslogHealthMonitor sets up the syslog health monitor test environment by
+// optionally updating DaemonSet args, injecting metadata, establishing port-forwarding,
+// and configuring node labels. Returns the test node name, syslog pod, and stop channel.
 func SetUpSyslogHealthMonitor(ctx context.Context, t *testing.T,
 	client klient.Client, args map[string]string) (string, *v1.Pod, chan struct{}) {

Based on coding guidelines, function comments are required for all exported Go functions and should follow standard Go conventions.


73-76: Follow Go godoc conventions for exported function comments.

Function comments for exported functions should start with the function name and clearly describe what the function does.

πŸ“ Proposed fix
-// helper function to roll back syslog health monitor daemonset and stop the port forward
+// TearDownSyslogHealthMonitor tears down the syslog health monitor test environment by
+// stopping port-forwarding, optionally restoring DaemonSet args, restarting the pod,
+// cleaning up metadata, and removing node labels.
 func TearDownSyslogHealthMonitor(ctx context.Context, t *testing.T, client klient.Client,
 	nodeName string, stopChan chan struct{},
 	args map[string]string, podName string) {

Based on coding guidelines, function comments are required for all exported Go functions and should follow standard Go conventions.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 0a1dc9a and 1df5558.

πŸ“’ Files selected for processing (4)
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/helpers/kube.go
  • tests/helpers/syslog-health-monitor.go
  • tests/syslog_health_monitor_test.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/helpers/syslog-health-monitor.go
  • tests/syslog_health_monitor_test.go
  • tests/helpers/kube.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/syslog_health_monitor_test.go
🧠 Learnings (8)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:40.921Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

  • tests/helpers/kube.go
🧬 Code graph analysis (2)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (4)
data-models/pkg/protos/health_event.pb.go (17)
  • HealthEvent (263-283)
  • HealthEvent (296-296)
  • HealthEvent (311-313)
  • Entity (211-217)
  • Entity (230-230)
  • Entity (245-247)
  • RecommendedAction (92-92)
  • RecommendedAction (142-144)
  • RecommendedAction (146-148)
  • RecommendedAction (155-157)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (75-77)
  • ProcessingStrategy (79-81)
  • ProcessingStrategy (88-90)
  • HealthEvents (159-165)
  • HealthEvents (178-178)
  • HealthEvents (193-195)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (5)
  • HealthEvent (63-141)
  • Entity (55-61)
  • RecommendedAction (20-30)
  • ProcessingStrategy (14-18)
  • HealthEvents (45-53)
platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
  • NewK8sConnector (47-58)
platform-connectors/pkg/pipeline/factory.go (1)
  • Create (32-39)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (2)
  • SetUpSyslogHealthMonitor (33-71)
  • StubJournalHTTPPort (27-27)
tests/helpers/kube.go (2)
  • InjectSyslogMessages (1820-1843)
  • PortForwardPod (1614-1658)
πŸ”‡ Additional comments (9)
tests/helpers/kube.go (5)

2232-2273: LGTM!

The rollout verification logic correctly checks that all DaemonSet pods are scheduled, updated, and ready before proceeding.


2323-2360: LGTM!

Container validation is properly implemented, and the function correctly handles the retry-on-conflict pattern.


2394-2416: LGTM!

The function correctly updates existing arguments or appends new ones.


2418-2442: LGTM!

The function correctly removes arguments in both --flag=value and --flag value formats. The break after removal ensures only the first occurrence is removed, which is appropriate for typical container args.


2444-2488: LGTM!

The function properly validates the pod is from the current rollout, not being deleted, running, and fully ready before returning it.

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (3)

1606-1614: Previous concern properly addressed.

The code now uses require.NotEmpty before accessing nvsentinelConditions[0], which correctly prevents the index-out-of-range panic that was flagged in earlier reviews. The test will fail with a clear message if the expected condition is not found.


1622-1630: Previous concern properly addressed.

The code now uses require.NotEmpty before accessing events.Items[0], which correctly prevents the index-out-of-range panic that was flagged in earlier reviews. The test will fail with a clear message if the expected event is not found.


1391-1635: Well-structured test for processing strategy validation.

The test provides comprehensive coverage of the new ProcessingStrategy field:

  • STORE_ONLY events correctly bypass node condition and Kubernetes event creation
  • EXECUTE_REMEDIATION events properly create node conditions (fatal) or Kubernetes events (non-fatal)
  • Mixed strategy batches are handled correctly

The test follows Go best practices with table-driven structure, proper test isolation via dedicated clientsets, and clear assertions using testify (consistent with existing patterns in this file).

tests/syslog_health_monitor_test.go (1)

361-421: Well-structured test for STORE_ONLY strategy.

The new test correctly validates STORE_ONLY behavior by:

  • Setting up the monitor with --processing-strategy: STORE_ONLY
  • Verifying no node conditions are created after XID injection
  • Verifying no node cordoning occurs
  • Restoring EXECUTE_REMEDIATION strategy in teardown for proper cleanup

This follows the established test patterns and properly exercises the new processing strategy feature.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from 1df5558 to e5a8295 Compare January 7, 2026 14:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In @tests/helpers/kube.go:
- Around line 2326-2366: The RemoveDaemonSetArgs function uses
require.NoError(t, err, ...) which fails the test immediately and prevents
returning the error to the caller; change this to propagate the error like
UpdateDaemonSetArgs by replacing the require.NoError call with standard error
handling (if err != nil { t.Logf("failed to remove args from daemonset %s/%s:
%v", NVSentinelNamespace, daemonsetName, err); return err }) so the function
returns the error to callers while preserving the existing logs and rollout wait
behavior; keep the function signature as RemoveDaemonSetArgs(... ) error and
ensure callers handle the returned error.
🧹 Nitpick comments (3)
tests/helpers/kube.go (2)

387-409: Consider improving log message specificity.

Line 405 logs only eventType, but the function also receives eventReason as a parameter. Including both in the log message would provide more context for debugging.

♻️ Proposed improvement
-		t.Logf("node %s does not have event %v", nodeName, eventType)
+		t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)

2320-2321: Consider replacing fixed sleep with condition-based wait.

The hardcoded 10-second sleep after rollout completion could be replaced with polling for the specific condition you're waiting for (e.g., pod readiness, specific container state). This would make tests more reliable and potentially faster.

tests/syslog_health_monitor_test.go (1)

361-421: LGTM: Well-structured STORE_ONLY strategy test.

The test correctly verifies that STORE_ONLY mode prevents both condition creation and node cordoning when XID errors are injected. The setup/teardown properly configure and restore the processing strategy.

Optional: Make assess description more concise and complete

The description could mention both aspects being verified (condition and cordoning) more concisely:

-	feature.Assess("Inject XID errors and verify no node condition is created when running in STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
+	feature.Assess("Verify STORE_ONLY strategy prevents condition creation and cordoning", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 1df5558 and e5a8295.

πŸ“’ Files selected for processing (4)
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/helpers/kube.go
  • tests/helpers/syslog-health-monitor.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/helpers/syslog-health-monitor.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/syslog_health_monitor_test.go
  • tests/helpers/kube.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/syslog_health_monitor_test.go
🧠 Learnings (6)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/syslog_health_monitor_test.go
🧬 Code graph analysis (1)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (3)
  • SetUpSyslogHealthMonitor (33-71)
  • StubJournalHTTPPort (27-27)
  • TearDownSyslogHealthMonitor (74-113)
tests/helpers/kube.go (3)
  • InjectSyslogMessages (1820-1843)
  • PortForwardPod (1614-1658)
  • EnsureNodeConditionNotPresent (1797-1818)
πŸ”‡ Additional comments (6)
tests/helpers/kube.go (3)

2232-2273: LGTM!

The rollout verification logic correctly checks all necessary DaemonSet status fields (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) to ensure complete rollout.


2368-2448: LGTM!

The container argument manipulation helpers correctly handle multiple argument styles (--flag=value, --flag value, --flag) with proper slice operations for updates, insertions, and removals.


2450-2494: LGTM!

The function properly validates that the DaemonSet pod is running, ready, and not being deleted, ensuring tests interact with stable pods from the current rollout generation.

tests/syslog_health_monitor_test.go (3)

25-39: LGTM: Clean imports and context keys.

The helper import and updated context key names improve code clarity and maintainability.


42-173: LGTM: Well-refactored setup and teardown.

The test now uses centralized helper functions consistently, improving maintainability. The context-based value passing follows the e2e-framework pattern correctly.


290-359: LGTM: Consistent refactoring pattern.

The SXID detection test follows the same centralized setup/teardown pattern as the XID detection test, ensuring consistency across the test suite.

@tanishagoyal2 tanishagoyal2 force-pushed the 390-event-handling-in-syslog branch from e5a8295 to 0752eb5 Compare January 7, 2026 15:04
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes 82.99% (+0.21%) πŸ‘
github.com/nvidia/nvsentinel/tests 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/tests/helpers 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/tests/helpers/kube.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/tests/helpers/syslog-health-monitor.go 0.00% (ΓΈ) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • github.com/nvidia/nvsentinel/tests/syslog_health_monitor_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

πŸ€– Fix all issues with AI agents
In @tests/syslog_health_monitor_test.go:
- Around line 405-418: The args map passed to TearDownSyslogHealthMonitor uses a
non-empty value for "--processing-strategy" even though removeArgsFromContainer
(and RemoveDaemonSetArgs) only inspects map keys; change the map entry to
"--processing-strategy": "" (empty string) or add an inline comment next to the
map literal documenting that only keys are used by
removeArgsFromContainer/RemoveDaemonSetArgs so values are ignored, so readers
aren't misled by a non-empty value.
🧹 Nitpick comments (4)
tests/helpers/kube.go (3)

2281-2293: Unused variable originalDaemonSet.

The variable originalDaemonSet is assigned on line 2292 but never used. This appears to be leftover code, possibly from a previous implementation that stored the original state for rollback.

♻️ Remove unused variable
 func UpdateDaemonSetArgs(ctx context.Context, t *testing.T,
 	client klient.Client, daemonsetName string, containerName string,
 	args map[string]string) error {
 	t.Helper()
 
-	var originalDaemonSet *appsv1.DaemonSet
-
 	t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args)
 
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		daemonSet := &appsv1.DaemonSet{}
 		if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil {
 			return err
 		}
 
-		if originalDaemonSet == nil {
-			originalDaemonSet = daemonSet.DeepCopy()
-		}
-
 		containers := daemonSet.Spec.Template.Spec.Containers

2321-2322: Magic sleep after rollout.

The 10-second hardcoded sleep after rollout completion is a code smell. If pods need additional time after becoming "Ready", consider either:

  1. Adding a specific readiness check for the functionality you're waiting for
  2. Extracting this as a named constant with documentation explaining why it's needed

2452-2496: Parameter daemonsetName is underutilized.

The daemonsetName parameter is only used in the error message on line 2492. The function doesn't verify that the found pod actually belongs to the specified DaemonSet - it relies solely on podNamePattern matching.

This could be intentional if the pattern is specific enough, but consider either:

  1. Adding validation that the pod's owner reference matches the DaemonSet
  2. Removing daemonsetName if it's truly not needed
tests/helpers/syslog-health-monitor.go (1)

73-113: Consider adding error handling for cleanup operations.

The teardown function logs warnings for errors during cleanup (lines 89, 111) but uses require.NoError for RemoveDaemonSetArgs (line 82). This inconsistency could cause test failures during teardown if the arg removal fails, while other cleanup steps are more lenient.

For consistency in teardown, consider either:

  1. Making all cleanup steps warn-only to ensure full cleanup attempts
  2. Or keeping require.NoError only for critical restoration steps
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between e5a8295 and 0752eb5.

πŸ“’ Files selected for processing (4)
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • tests/helpers/kube.go
  • tests/helpers/syslog-health-monitor.go
  • tests/syslog_health_monitor_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/syslog_health_monitor_test.go
  • tests/helpers/syslog-health-monitor.go
  • tests/helpers/kube.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/syslog_health_monitor_test.go
🧠 Learnings (7)
πŸ““ Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/syslog_health_monitor_test.go
πŸ“š Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

  • tests/helpers/kube.go
🧬 Code graph analysis (1)
tests/syslog_health_monitor_test.go (3)
tests/helpers/syslog-health-monitor.go (3)
  • SetUpSyslogHealthMonitor (33-71)
  • StubJournalHTTPPort (27-27)
  • TearDownSyslogHealthMonitor (74-113)
tests/helpers/kube.go (2)
  • InjectSyslogMessages (1820-1843)
  • PortForwardPod (1614-1658)
tests/helpers/fault_quarantine.go (1)
  • AssertQuarantineState (315-382)
πŸ”‡ Additional comments (15)
tests/helpers/kube.go (5)

387-409: LGTM - Clean implementation of negative event assertion.

The function correctly uses require.Never to assert that a specific event type/reason never appears on a node within the timeout period. The error handling on line 393-396 appropriately returns false (meaning "condition not met, keep checking") rather than failing the test.


2232-2273: LGTM - Robust rollout wait implementation.

The function correctly checks all necessary conditions for a complete DaemonSet rollout:

  1. DesiredNumberScheduled > 0
  2. UpdatedNumberScheduled == DesiredNumberScheduled
  3. NumberReady == DesiredNumberScheduled

Good logging at each step for debugging.


2370-2400: Slice insertion logic is correct but complex.

The slice manipulation on lines 2391-2393 correctly inserts a value after the flag when the next argument is another flag (starts with -). The logic handles both --flag=value and --flag value styles appropriately.


2402-2424: LGTM - Clean implementation of arg setting.

The function correctly:

  1. Attempts to update existing args via tryUpdateExistingArg
  2. Falls back to appending new args if not found
  3. Handles both --flag=value and boolean flag styles

2426-2450: LGTM - Correct arg removal implementation.

The function properly handles both --flag=value and --flag value styles when removing arguments. Breaking after modification avoids issues with concurrent slice modification during iteration.

tests/syslog_health_monitor_test.go (7)

36-39: LGTM - Clear context key definitions.

The context keys follow Go naming conventions and provide clear identifiers for passing test state between setup, assess, and teardown phases.


47-58: LGTM - Centralized setup pattern.

The setup correctly uses the new helpers.SetUpSyslogHealthMonitor helper, storing node name, pod name, and stop channel in context for use in assess and teardown phases.


159-170: LGTM - Centralized teardown pattern.

The teardown correctly retrieves context values and delegates to helpers.TearDownSyslogHealthMonitor. Passing nil for args indicates no DaemonSet argument restoration is needed.


195-203: LGTM - Consistent port configuration.

Using helpers.StubJournalHTTPPort for both local and remote ports ensures consistency across tests and centralizes the port configuration.


295-359: LGTM - Consistent setup/teardown pattern.

The SXID detection test follows the same centralized setup/teardown pattern as the XID detection test, maintaining consistency across the test suite.


361-379: Well-structured STORE_ONLY strategy test.

The test correctly:

  1. Follows naming conventions per coding guidelines
  2. Sets up with --processing-strategy=STORE_ONLY argument
  3. Stores context values for teardown

Good use of the centralized setup helper with custom arguments.


381-403: LGTM - Correct STORE_ONLY behavior validation.

The assess phase correctly verifies that when running in STORE_ONLY mode:

  1. XID messages are injected
  2. No node condition is created (EnsureNodeConditionNotPresent)
  3. Node is not quarantined/cordoned (AssertQuarantineState)

This aligns with the expected behavior where STORE_ONLY events should be stored but not trigger remediation.

tests/helpers/syslog-health-monitor.go (3)

26-30: LGTM - Clear constant definitions.

The constants properly define the port and resource names used by the syslog health monitor test helpers, making them easy to reference consistently across tests.


32-71: LGTM - Well-structured setup helper.

The function provides a clean setup sequence:

  1. Optionally updates DaemonSet args for test configuration
  2. Locates the syslog health monitor pod on a worker node
  3. Injects test metadata
  4. Establishes port-forwarding
  5. Configures the node label

Good use of require.NoError for critical setup steps and returning all necessary values for teardown.


91-101: No changes needed - the condition status check logic is correct.

The code at lines 93-100 properly checks for condition.Status == v1.ConditionFalse with reason "SysLogsXIDErrorIsHealthy" to verify the error has been cleared. This is consistent with the codebase pattern:

  • Status=True with reason "IsNotHealthy" indicates an active error
  • Status=False with reason "IsHealthy" indicates the error has been resolved and the condition is no longer active

The check is semantically correct and follows established conventions throughout the test suite.

Comment on lines +405 to 418
feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
client, err := c.NewClient()
require.NoError(t, err, "failed to create kubernetes client")

nodeName := ctx.Value(keySyslogNodeName).(string)
stopChan := ctx.Value(keyStopChan).(chan struct{})
syslogPod := ctx.Value(keySyslogPodName).(string)

helpers.TearDownSyslogHealthMonitor(ctx, t, client, nodeName, stopChan, map[string]string{
"--processing-strategy": "EXECUTE_REMEDIATION",
}, syslogPod)

return ctx
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

🧩 Analysis chain

🏁 Script executed:

# Find RemoveDaemonSetArgs function and examine its implementation
fd -e go tests/helpers | head -20

Repository: NVIDIA/NVSentinel

Length of output: 418


🏁 Script executed:

# Search for RemoveDaemonSetArgs function definition and usage
rg -n "RemoveDaemonSetArgs" --type go -A 15

Repository: NVIDIA/NVSentinel

Length of output: 2451


🏁 Script executed:

# Also check the kube.go helper file directly
fd -n "kube.go" | xargs wc -l

Repository: NVIDIA/NVSentinel

Length of output: 310


🏁 Script executed:

# Read the RemoveDaemonSetArgs function implementation
sed -n '2327,2380p' tests/helpers/kube.go

Repository: NVIDIA/NVSentinel

Length of output: 1768


🏁 Script executed:

# Search for removeArgsFromContainer function
rg -n "func removeArgsFromContainer" --type go -A 30

Repository: NVIDIA/NVSentinel

Length of output: 1791


🏁 Script executed:

# Find all calls to RemoveDaemonSetArgs
rg -n "RemoveDaemonSetArgs" --type go -B 2 -A 5

Repository: NVIDIA/NVSentinel

Length of output: 1070


Use empty string or document that RemoveDaemonSetArgs only uses map keys.

The removeArgsFromContainer function iterates only over flag names (keys) in the args mapβ€”the values are completely ignored. Passing "--processing-strategy": "EXECUTE_REMEDIATION" works correctly (removes the flag), but the non-empty value is misleading. Clarify intent by passing an empty string ("") or add a comment explaining that only keys matter.

πŸ€– Prompt for AI Agents
In @tests/syslog_health_monitor_test.go around lines 405 - 418, The args map
passed to TearDownSyslogHealthMonitor uses a non-empty value for
"--processing-strategy" even though removeArgsFromContainer (and
RemoveDaemonSetArgs) only inspects map keys; change the map entry to
"--processing-strategy": "" (empty string) or add an inline comment next to the
map literal documenting that only keys are used by
removeArgsFromContainer/RemoveDaemonSetArgs so values are ignored, so readers
aren't misled by a non-empty value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant