-
Notifications
You must be signed in to change notification settings - Fork 33
feat: add event handling strategy changes in syslog health monitor #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add event handling strategy changes in syslog health monitor #609
Conversation
Signed-off-by: Tanisha goyal <[email protected]>
π WalkthroughWalkthroughAdds a ProcessingStrategy enum and processingStrategy field to HealthEvent; threads the strategy through monitors, handlers, exporter, store pipelines, Kubernetes connector, Helm values, and tests to distinguish STORE_ONLY vs EXECUTE_REMEDIATION processing and skip remediation for STORE_ONLY events. Changes
Sequence DiagramsequenceDiagram
actor Operator
participant Helm as Helm/DaemonSet
participant Syslog as Syslog Monitor
participant Handler as Handler (XID/SXID/GPU)
participant Exporter as Event Exporter
participant Store as Store (change stream)
participant K8s as K8s Connector
participant Kube as Kubernetes API
Operator->>Helm: set processingStrategy value
Helm->>Syslog: start with CLI arg (strategy)
Syslog->>Handler: init handler with strategy
rect rgba(180,220,200,0.12)
Handler->>Handler: detect issue -> build pb.HealthEvent (includes processingStrategy)
Handler-->>Syslog: emit HealthEvent
end
rect rgba(200,200,255,0.10)
Syslog->>Exporter: send HealthEvent
Exporter->>Store: persist/stream HealthEvent (contains processingStrategy)
end
rect rgba(255,220,200,0.10)
Store->>Store: change stream pipeline filters for EXECUTE_REMEDIATION
alt EXECUTE_REMEDIATION
Store->>K8s: forward processable events
K8s->>Kube: create/update node conditions & events
else STORE_ONLY
Store->>Store: filtered out β no remediation actions
end
end
Estimated code review effortπ― 4 (Complex) | β±οΈ ~45 minutes
π₯ Pre-merge checks | β 2 | β 1β Failed checks (1 warning)
β Passed checks (2 passed)
βοΈ Tip: You can configure your own custom pre-merge checks in the settings. β¨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
π§Ή Nitpick comments (10)
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl (1)
103-104: Consider quoting the processing strategy value for consistency.The
processingStrategyvalue is passed unquoted, while similar string arguments like$kataLabelat line 94 use thequotefunction. For consistency and to follow YAML best practices, consider quoting this value.π Proposed fix
- "--processing-strategy" - - {{ $root.Values.processingStrategy }} + - {{ $root.Values.processingStrategy | quote }}tests/helpers/healthevent.go (2)
153-156: Add godoc comment for the exported method.Per coding guidelines, exported Go functions should have function comments. Other
With*methods in this file lack comments too, but new code should set a better example.Suggested documentation
+// WithProcessingStrategy sets the processing strategy for the health event. +// Use protos.ProcessingStrategy_EXECUTE_REMEDIATION (0) or protos.ProcessingStrategy_STORE_ONLY (1). func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate { h.ProcessingStrategy = strategy return h }
48-48: Consider using the protobuf type for consistency.The protobuf-generated Go code uses
int32forProcessingStrategy. Whileintworks for JSON serialization in tests, usingint32would be more consistent with the generated types.Suggested change
- ProcessingStrategy int `json:"processingStrategy,omitempty"` + ProcessingStrategy int32 `json:"processingStrategy,omitempty"`This would require updating the setter signature as well:
-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate { +func (h *HealthEventTemplate) WithProcessingStrategy(strategy int32) *HealthEventTemplate {data-models/protobufs/health_event.proto (1)
77-77: Add field-level comment forprocessingStrategy.Per coding guidelines for protobuf files: "Include comprehensive comments for all fields in Protocol Buffer messages." The enum declaration has good comments, but the field itself should also have a brief comment explaining its purpose in the context of the
HealthEventmessage.Suggested addition
BehaviourOverrides drainOverrides = 15; + // Determines how downstream modules should handle this event (remediation vs observability-only). ProcessingStrategy processingStrategy = 16; }health-monitors/syslog-health-monitor/main.go (1)
164-171: Consider enhancing the error message with valid options.The validation correctly uses the generated
pb.ProcessingStrategy_valuemap. Consider listing valid values in the error message to improve user experience.Suggested improvement
value, ok := pb.ProcessingStrategy_value[*processingStrategyFlag] if !ok { - return fmt.Errorf("unexpected processingStrategy value: %q", *processingStrategyFlag) + return fmt.Errorf("unexpected processingStrategy value: %q (valid values: EXECUTE_REMEDIATION, STORE_ONLY)", *processingStrategyFlag) }health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)
239-242: Consider using the enum constant directly instead of accessing the handler field.Line 241 accesses
handler2.processingStrategydirectly for the assertion. While this works, using the expected enum constant (pb.ProcessingStrategy_EXECUTE_REMEDIATION) would be more explicit and resilient to future encapsulation changes.π Suggested change
- assert.Equal(t, handler2.processingStrategy, events.Events[0].ProcessingStrategy) + assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, events.Events[0].ProcessingStrategy)health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (1)
191-195: Brittle entity index assertions may break if entity order changes.The assertions use hardcoded indices (
EntitiesImpacted[3],EntitiesImpacted[4]) to access GPU entities. If the entity ordering inProcessLinechanges, these tests will fail with confusing errors.Consider iterating over entities by type, similar to the pattern used in
gpufallen_handler_test.go(lines 124-136).π Suggested approach
- // Verify GPU entity - assert.Equal(t, "GPU", event.EntitiesImpacted[3].EntityType) - assert.Equal(t, "1", event.EntitiesImpacted[3].EntityValue) - assert.Equal(t, "GPU_UUID", event.EntitiesImpacted[4].EntityType) - assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", event.EntitiesImpacted[4].EntityValue) + // Verify GPU entities by type rather than assuming order + var hasGPU, hasGPUUUID bool + for _, entity := range event.EntitiesImpacted { + switch entity.EntityType { + case "GPU": + hasGPU = true + assert.Equal(t, "1", entity.EntityValue) + case "GPU_UUID": + hasGPUUUID = true + assert.Equal(t, "GPU-aaaabbbb-cccc-dddd-eeee-ffffffffffff", entity.EntityValue) + } + } + assert.True(t, hasGPU, "Should have GPU entity") + assert.True(t, hasGPUUUID, "Should have GPU_UUID entity")tests/helpers/syslog-health-monitor.go (1)
15-15: Missing package-level godoc comment.As per coding guidelines, package-level godoc is required for all Go packages.
Proposed fix
+// Package helpers provides test utility functions for NVSentinel integration tests. package helpershealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (1)
427-446: Consider adding a test case to validate ProcessingStrategy is stored in handler.The existing test cases verify handler construction but don't assert that
processingStrategyis correctly stored. Consider adding an assertion.Proposed enhancement
} else { assert.NoError(t, err) require.NotNil(t, handler) assert.Equal(t, tc.nodeName, handler.nodeName) assert.Equal(t, tc.agentName, handler.defaultAgentName) assert.Equal(t, tc.componentClass, handler.defaultComponentClass) assert.Equal(t, tc.checkName, handler.checkName) + assert.Equal(t, pb.ProcessingStrategy_EXECUTE_REMEDIATION, handler.processingStrategy) assert.NotNil(t, handler.pciToGPUUUID) assert.NotNil(t, handler.parser) assert.NotNil(t, handler.metadataReader) }tests/helpers/kube.go (1)
2346-2390: Consider clarifying the daemonsetName parameter usage.The
daemonsetNameparameter is only used in the error message but not for any actual filtering logic (which relies onpodNamePattern). This could be confusing to callers. Consider either using it for filtering or documenting this clearly.π Suggested documentation improvement
+// GetDaemonSetPodOnWorkerNode returns a running, ready pod matching the given name pattern on a real worker node. +// Note: daemonsetName is used for error messages; filtering is done via podNamePattern. func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, podNamePattern string) (*v1.Pod, error) {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (34)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpldistros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/event_exporter_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§° Additional context used
π Path-based instructions (5)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/event_exporter.gostore-client/pkg/client/pipeline_builder_test.gotests/helpers/healthevent.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.goevent-exporter/pkg/transformer/cloudevents.gostore-client/pkg/client/mongodb_pipeline_builder.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gofault-quarantine/pkg/initializer/init.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.gostore-client/pkg/client/pipeline_builder.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gostore-client/pkg/client/postgresql_pipeline_builder.gohealth-monitors/syslog-health-monitor/main.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gotests/helpers/syslog-health-monitor.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gotests/event_exporter_test.gotests/helpers/kube.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
store-client/pkg/client/pipeline_builder_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/event_exporter_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
π§ Learnings (8)
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`
Applied to files:
health-monitors/syslog-health-monitor/pkg/sxid/types.go
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
fault-quarantine/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gotests/event_exporter_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests
Applied to files:
tests/event_exporter_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
𧬠Code graph analysis (17)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
PipelineBuilder(26-47)store-client/pkg/client/mongodb_pipeline_builder.go (1)
NewMongoDBPipelineBuilder(29-31)store-client/pkg/client/postgresql_pipeline_builder.go (1)
NewPostgreSQLPipelineBuilder(29-31)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-39)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/syslog-health-monitor/main.go (1)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(56-59)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
data-models/pkg/protos/health_event.pb.go (17)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)Entity(208-214)Entity(227-227)Entity(242-244)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)RecommendedAction_CONTACT_SUPPORT(94-94)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
NewK8sConnector(47-58)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (3)
data-models/pkg/protos/health_event.pb.go (6)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
NewXIDHandler(33-58)health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-39)
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go (1)
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (1)
GPUFallenHandler(47-57)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (2)
data-models/pkg/protos/health_event.pb.go (11)
HealthEvents(156-162)HealthEvents(175-175)HealthEvents(190-192)HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
K8sConnector(37-45)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(221-247)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (2)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (1)
NewSXIDHandler(30-42)data-models/pkg/protos/health_event.pb.go (3)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)ProcessingStrategy_STORE_ONLY(47-47)RecommendedAction_CONTACT_SUPPORT(94-94)
tests/syslog_health_monitor_test.go (3)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(34-74)StubJournalHTTPPort(28-28)TearDownSyslogHealthMonitor(77-116)tests/helpers/kube.go (2)
InjectSyslogMessages(1796-1819)PortForwardPod(1590-1634)tests/helpers/fault_quarantine.go (2)
AssertQuarantineState(315-382)QuarantineAssertion(56-60)
π Additional comments (65)
fault-quarantine/pkg/initializer/init.go (1)
66-66: Verify the pipeline filtering aligns with fault-quarantine requirements.The switch from
BuildAllHealthEventInsertsPipeline()toBuildProcessableHealthEventInsertsPipeline()means the fault-quarantine module will now only observe health events marked with theEXECUTE_REMEDIATIONprocessing strategy. Events markedSTORE_ONLYwill be excluded from triggering quarantine actions.Run the following script to confirm the new pipeline builder exists and verify its filtering logic:
#!/bin/bash # Verify BuildProcessableHealthEventInsertsPipeline implementation and filtering behavior # Find the pipeline builder implementation echo "=== Locating BuildProcessableHealthEventInsertsPipeline implementation ===" ast-grep --pattern $'func ($$$) BuildProcessableHealthEventInsertsPipeline() $$$' # Find usages to understand the filtering behavior echo -e "\n=== Finding references to ProcessingStrategy enum ===" rg -n -C3 --type=go 'ProcessingStrategy|EXECUTE_REMEDIATION|STORE_ONLY' store-client/ # Find the builder interface/struct definition echo -e "\n=== Locating PipelineBuilder interface ===" rg -n -C3 --type=go 'type.*PipelineBuilder|BuildProcessableHealthEventInsertsPipeline'fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: LGTM! Default enum value correctly validated.The test correctly expects
processingStrategyto be0(EXECUTE_REMEDIATION), which is the protobuf default for the new enum field.event-exporter/pkg/transformer/cloudevents.go (1)
66-66: LGTM! Consistent enum-to-string conversion.The implementation correctly uses
ProcessingStrategy.String()to include the strategy in CloudEvent data, matching the pattern used forrecommendedActionat line 61.store-client/pkg/client/pipeline_builder_test.go (1)
69-86: LGTM! Well-structured table-driven test.The test correctly validates the new
BuildProcessableHealthEventInsertsPipeline()method for both MongoDB and PostgreSQL builders, following the established patterns in this test file.health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go (1)
53-53: LGTM! Safe addition of unexported field.The
processingStrategyfield is correctly added as an unexported field to store the health event processing strategy for this monitor instance.health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
20-20: LGTM! Consistent import and field addition.The protobuf import and
processingStrategyfield are correctly added to enable strategy propagation through the XID handler to health events.Also applies to: 34-34
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (1)
23-23: LGTM! Consistent with other handler implementations.The protobuf import and
processingStrategyfield addition follow the same pattern used in the XID and SXID handlers.Also applies to: 52-52
distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml (1)
51-56: LGTM! Well-documented configuration option.The
processingStrategyconfiguration is clearly documented with valid values, default behavior, and behavioral differences. The default valueEXECUTE_REMEDIATIONmaintains backward compatibility with existing deployments.As per coding guidelines, the inline documentation follows Helm best practices for values.yaml files.
data-models/protobufs/health_event.proto (1)
32-38: Good enum design with clear documentation.The enum is well-designed with
EXECUTE_REMEDIATION=0as the default (zero value), ensuring backward compatibility for existing events that don't set this field. The comments clearly explain the behavioral difference between the two strategies.store-client/pkg/client/mongodb_pipeline_builder.go (1)
87-100: Verify behavior whenprocessingStrategyfield is missing or null.The filter matches documents where
fullDocument.healthevent.processingstrategyequals0(EXECUTE_REMEDIATION). In MongoDB, documents missing this field will havenullfor that path, which does not equal0. This could cause backward compatibility issues with existing health events that were created before this field was added.If existing events should be treated as
EXECUTE_REMEDIATION(the intended default), consider updating the filter to handle missing fields:Potential fix if missing fields need to match
func (b *MongoDBPipelineBuilder) BuildProcessableHealthEventInsertsPipeline() datastore.Pipeline { return datastore.ToPipeline( datastore.D( datastore.E("$match", datastore.D( datastore.E("operationType", datastore.D( datastore.E("$in", datastore.A("insert")), )), - datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION)), + datastore.E("$or", datastore.A( + datastore.D(datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION))), + datastore.D(datastore.E("fullDocument.healthevent.processingstrategy", datastore.D(datastore.E("$exists", false)))), + )), )), ), ) }Please verify whether existing health events in the database have this field set, or if all events going forward will include it.
store-client/pkg/client/pipeline_builder.go (1)
35-38: LGTM!The new interface method is well-documented with clear explanation of:
- What it does: watches for "processable" health event inserts with
processingStrategy=EXECUTE_REMEDIATION- Who uses it: fault-quarantine
- Why: to ignore observability-only events (
STORE_ONLY)This follows the established pattern of the other pipeline builder methods.
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-6: This file is auto-generated by the protobuf compiler and is correctly checked into the repository. The "NO CHECKED-IN PROTOBUF GENCODE" comment is a standard protobuf compiler-generated header, not a project policy directive. For Python (non-C++), checking in generated protobuf files is a standard practice supported by the protobuf project. The repository properly manages this via theprotos-generateMakefile target for regeneration and theprotos-linttarget for CI validation to ensure generated files stay up to date with their source.protofiles.Likely an incorrect or invalid review comment.
tests/helpers/event_exporter.go (1)
221-246: LGTM!The new
expectedProcessingStrategyparameter integrates cleanly with the existing validation pattern. The assertion at line 245 correctly validates the new field in the CloudEvent payload.health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
20-36: LGTM!The new
processingStrategyfield is correctly typed and logically positioned among the configuration fields inSXIDHandler.event-exporter/pkg/transformer/cloudevents_test.go (1)
69-108: LGTM!The test correctly validates the new
processingStrategyfield in CloudEvent output. The explicitSTORE_ONLYvalue in the first comprehensive test case provides good coverage, while other focused test cases appropriately default to the zero value (EXECUTE_REMEDIATION).tests/event_exporter_test.go (1)
85-85: LGTM!The test correctly expects
"EXECUTE_REMEDIATION"as the processing strategy, which aligns with the default behavior for health events.health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go (2)
29-51: LGTM!The
processingStrategyparameter is correctly threaded through the constructor to the handler struct, following the established pattern for other configuration fields.
225-239: LGTM!The
ProcessingStrategyfield is correctly populated from the handler's configuration when creating health events.health-monitors/syslog-health-monitor/main.go (2)
70-71: LGTM!The new
--processing-strategyflag with sensible default and clear help text follows the existing flag pattern in this file.
173-184: LGTM!The
processingStrategyis correctly converted and passed toNewSyslogMonitor. The overall wiring integrates cleanly with the existing initialization flow.health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)
30-42: LGTM! Constructor correctly extended with ProcessingStrategy parameter.The constructor signature update properly accepts the new
processingStrategyparameter and assigns it to the handler struct. This is consistent with the pattern used across other handlers (XID, GPUFallen) in this PR.
109-109: ProcessingStrategy correctly propagated to emitted HealthEvent.The handler's
processingStrategyis properly assigned to theHealthEvent.ProcessingStrategyfield, ensuring events carry the configured strategy for downstream processing.health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go (3)
167-175: LGTM! Test updated to validate ProcessingStrategy propagation.The test correctly passes
pb.ProcessingStrategy_STORE_ONLYto the handler constructor, following the table-driven test pattern and usingtestify/requirefor assertions as per coding guidelines.
185-185: Good assertion coverage for ProcessingStrategy field.This assertion ensures the emitted event carries the correct processing strategy, validating the end-to-end propagation from handler construction to event emission.
193-319: Comprehensive XID tracking tests properly updated with ProcessingStrategy.All test cases in
TestXIDTrackingconsistently usepb.ProcessingStrategy_EXECUTE_REMEDIATION, maintaining test consistency while validating the new parameter integration.platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (3)
1391-1506: Well-structured table-driven test for ProcessingStrategy behavior.The test covers key scenarios:
STORE_ONLYevents should not create node conditions or K8s eventsEXECUTE_REMEDIATIONevents should create node conditions- Mixed strategies properly filter only
EXECUTE_REMEDIATIONeventsThis follows the coding guidelines for table-driven tests with descriptive names.
1550-1562: Reasonable approach to count NVSentinel-specific conditions.The logic excludes standard Kubernetes node conditions (NodeReady, NodeMemoryPressure, etc.) to isolate NVSentinel-added conditions. The logging at line 1560 aids debugging.
1508-1589: Good test isolation with per-case setup.Each test case creates its own
localClientSet,ringBuffer, andconnector, preventing state leakage between test runs. Thedefer close(stopCh)ensures proper cleanup.health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (3)
18-25: Imports correctly updated for ProcessingStrategy testing.New imports for
os,filepath, and thepbalias are properly added to support the temp file creation and protobuf type usage in tests.
143-196: Good integration test with realistic metadata setup.This test creates a temporary
gpu_metadata.json, exercises the fullProcessLineflow with valid NVSwitch topology, and validates the emitted event including the newProcessingStrategyfield.
225-249: Test cases properly updated with ProcessingStrategy parameter.The
TestProcessLinetable-driven tests correctly passpb.ProcessingStrategy_EXECUTE_REMEDIATIONand validate it in assertions, following the established pattern. Usestestify/assertandtestify/requireper coding guidelines.health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (7)
309-317: Test correctly validates ProcessingStrategy initialization.The test verifies that
NewSyslogMonitorproperly stores theprocessingStrategyparameter. The assertion at line 317 confirms the monitor's field is set to the expected value.
329-334: Good coverage testing both constructor variants.Testing
NewSyslogMonitorWithFactorywithEXECUTE_REMEDIATIONensures both constructor paths properly handle the new parameter.
391-403: TestJournalProcessingLogic updated with ProcessingStrategy.The test correctly passes
pb.ProcessingStrategy_EXECUTE_REMEDIATIONto maintain consistency with the expected handler behavior during journal processing.
494-544: TestJournalStateManagement properly tests state persistence with ProcessingStrategy.Both monitor instances (sm and sm2) are created with consistent
EXECUTE_REMEDIATIONstrategy, ensuring the state persistence test accurately reflects real-world usage.
574-593: TestBootIDChangeHandling correctly includes ProcessingStrategy.The boot ID detection test is updated to include the new parameter, maintaining test consistency across the test suite.
624-644: TestRunMultipleChecks validates handler initialization with ProcessingStrategy.The test confirms that both XID and SXID handlers are properly initialized when multiple checks are configured, with the processing strategy parameter correctly passed through.
665-679: TestGPUFallenOffHandlerInitialization updated for completeness.The GPU Fallen Off handler initialization test includes the
ProcessingStrategyparameter, ensuring all handler types are tested with the new field.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (4)
325-343: LGTM: Clean filtering implementation for STORE_ONLY events.The function correctly filters out
STORE_ONLYhealth events with appropriate info-level logging for observability. The implementation aligns with the broader PR changes introducingProcessingStrategy.
345-370: LGTM: Good extraction of K8s event creation logic.The function consolidates event creation into a reusable helper, improving code organization. All required fields are properly populated from the health event.
388-400: LGTM: Correct integration of filtered events with node condition updates.The timing instrumentation and error handling are properly applied to the filtered
processableEventsset.
402-413: LGTM: Node event creation correctly uses filtered events.The loop properly iterates over
processableEventsand uses the newcreateK8sEventhelper for event creation.tests/syslog_health_monitor_test.go (6)
25-31: LGTM: Import organization for centralized helpers.The imports correctly bring in the test helpers package and required Kubernetes types for DaemonSet management.
36-41: LGTM: Context keys for test state management.The context keys are well-named and follow a consistent pattern for storing test state across setup, assess, and teardown phases.
163-175: LGTM: Teardown correctly uses centralized helper.The teardown properly retrieves context values and delegates to
TearDownSyslogHealthMonitorfor cleanup.
200-208: LGTM: Consistent use of helper constant for port.Using
helpers.StubJournalHTTPPortinstead of a local constant improves maintainability.
368-385: LGTM: New test for STORE_ONLY strategy.The test setup correctly passes
trueforupdatedDaemonSetto configure the STORE_ONLY processing strategy.
411-422: LGTM: Teardown correctly restores DaemonSet.The teardown passes
trueforupdatedDaemonSetto restore the original DaemonSet configuration after the test.health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (2)
33-58: LGTM: Clean addition ofprocessingStrategyparameter.The new parameter is properly integrated into the constructor and stored in the handler struct. This follows the same pattern used in other handlers (SXID, GPUFallen) as per the PR objectives.
189-189: LGTM: ProcessingStrategy correctly propagated to HealthEvent.The strategy is properly assigned to the generated health event, enabling downstream filtering in the platform connector.
tests/helpers/syslog-health-monitor.go (3)
27-31: LGTM: Well-defined constants for test configuration.The constants provide clear, maintainable values for port and resource names used across syslog health monitor tests.
33-74: LGTM: Comprehensive test setup helper.The function properly:
- Conditionally updates DaemonSet processing strategy
- Locates the syslog pod on a worker node
- Injects test metadata
- Establishes port-forward
- Sets node management label
Good use of
requirefor critical setup steps that should fail the test if unsuccessful.
76-116: Consider handlingoriginalDaemonSetnil case whenupdatedDaemonSetis true.If
SetUpSyslogHealthMonitorreturns a niloriginalDaemonSetwhenupdatedDaemonSetis true (due to an error during setup that wasn't caught),RestoreDaemonSetcould receive an unexpected nil. However, sinceSetUpSyslogHealthMonitorusesrequire.NoError, this scenario is unlikely in practice.The teardown logic is otherwise well-structured with proper error handling and cleanup sequencing.
health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (5)
88-91: LGTM: Updated constructor call with processing strategy.The test correctly passes
pb.ProcessingStrategy_EXECUTE_REMEDIATIONto the updatedNewXIDHandlersignature.
142-142: Good coverage of both processing strategies in test cases.The test cases appropriately use different strategies (
EXECUTE_REMEDIATIONon line 142,STORE_ONLYon line 157) to ensure both paths are validated.Also applies to: 157-157
194-194: LGTM: ProcessingStrategy assertions added to validation.The new assertions correctly verify that the
ProcessingStrategyfield on generatedHealthEventobjects matches the strategy configured on the handler.Also applies to: 231-231
306-308: LGTM: Test handler initialized with processing strategy.The test correctly initializes the handler with
EXECUTE_REMEDIATIONstrategy for theTestCreateHealthEventFromResponsetest cases.
348-348: LGTM: Consistent ProcessingStrategy validation in event creation tests.Both test cases properly validate that the handler's processing strategy is propagated to the created health events.
Also applies to: 378-378
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
14-17: LGTM - Protobuf stubs correctly reflect ProcessingStrategy additions.The generated
.pyistub file properly exposes the newProcessingStrategyenum withEXECUTE_REMEDIATIONandSTORE_ONLYvalues, and extendsHealthEventwith theprocessingStrategyfield. The type hints follow protobuf Python conventions.Also applies to: 31-32, 78-78, 104-104, 120-120, 138-138
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)
41-57: LGTM - Constructor signatures correctly extended with processingStrategy.The
processingStrategyparameter is properly threaded through bothNewSyslogMonitorandNewSyslogMonitorWithFactory, maintaining the constructor chain. Adding as the last parameter is good practice for API evolution.
90-136: LGTM - ProcessingStrategy correctly propagated to all handlers.The
processingStrategyis properly stored in theSyslogMonitorstruct and consistently passed to all three handler types (XID, SXID, GPUFallen). This ensures uniform processing behavior across different error check types.
802-814: LGTM - HealthEvent correctly includes ProcessingStrategy field.The
prepareHealthEventWithActionmethod properly setsProcessingStrategy: sm.processingStrategywhen constructing health events, ensuring the configured strategy is propagated through emitted events.tests/helpers/kube.go (3)
2210-2249: LGTM - DaemonSet rollout wait logic is correct.The function properly checks
DesiredNumberScheduled,UpdatedNumberScheduled, andNumberReadystatus fields to determine rollout completion. The logging provides good visibility into rollout progress.
2252-2268: LGTM - Container argument update logic handles both flag formats.The function correctly handles both
--processing-strategy=VALUEand--processing-strategy VALUEargument formats, and properly appends the flag if it doesn't exist. The early return after modification prevents duplicate additions.
2312-2344: LGTM - RestoreDaemonSet properly restores original container configuration.The function correctly handles nil input, uses
retry.RetryOnConflictfor safe updates, and waits for rollout completion. Replacing the entire containers slice ensures complete restoration of the original state.
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
β»οΈ Duplicate comments (2)
tests/syslog_health_monitor_test.go (1)
51-52: Remove redundant error check.Line 52 checks
require.NoError(t, err, ...)butSetUpSyslogHealthMonitorreturns(string, *v1.Pod, chan struct{})with no error. Theerrvariable is undefined.π Proposed fix
testNodeName, syslogPod, stopChan := helpers.SetUpSyslogHealthMonitor(ctx, t, client, nil) - require.NoError(t, err, "failed to set up syslog health monitor")tests/helpers/kube.go (1)
2272-2277: Missing error when container is not found.The loop that searches for the container by name silently does nothing if no match is found. This could mask test configuration errors where the wrong container name is provided.
π Proposed fix to validate container exists
+ found := false containers := daemonSet.Spec.Template.Spec.Containers for i := range containers { if containers[i].Name == containerName { setArgsOnContainer(t, &containers[i], args) + found = true break } } + + if !found { + return fmt.Errorf("container %q not found in daemonset %s/%s", containerName, NVSentinelNamespace, daemonsetName) + } return client.Resources().Update(ctx, daemonSet)Based on learnings, this issue was previously flagged but remains unresolved.
π§Ή Nitpick comments (4)
tests/helpers/kube.go (4)
2251-2255: Add godoc comment for exported function.As per coding guidelines, exported functions require documentation comments. Add a godoc comment explaining the purpose, parameters, and behavior of this function.
π Proposed godoc comment
+// UpdateDaemonSetArgs updates the specified container arguments in a DaemonSet and waits for rollout completion. +// If containerName is empty, updates all containers. Otherwise, updates only the named container. +// Uses retry.RetryOnConflict for automatic retry handling. func UpdateDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string) error {As per coding guidelines, function comments are required for all exported Go functions.
2288-2289: Consider removing hard-coded sleep.The 10-second sleep after rollout completion may not be necessary since
waitForDaemonSetRolloutalready ensures pods are ready. If pods need additional time to stabilize, consider usingrequire.Eventuallyto poll for the specific condition instead of a fixed delay.
2294-2297: Add godoc comment for exported function.Add documentation for this exported function explaining its purpose, parameters, and behavior.
π Proposed godoc comment
+// RemoveDaemonSetArgs removes the specified arguments from a DaemonSet container and waits for rollout completion. +// If containerName is empty, removes from all containers. Otherwise, removes only from the named container. +// Uses retry.RetryOnConflict for automatic retry handling. func RemoveDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string,As per coding guidelines, function comments are required for all exported Go functions.
2411-2413: Add godoc comment for exported function.Add documentation for this exported function explaining its purpose, parameters, return values, and the validation it performs.
π Proposed godoc comment
+// GetDaemonSetPodOnWorkerNode returns a ready, running pod from the specified DaemonSet on a real worker node. +// It waits for a pod matching podNamePattern that is not being deleted and has all containers ready. +// Returns an error if no suitable pod is found within the timeout period. func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, podNamePattern string) (*v1.Pod, error) {As per coding guidelines, function comments are required for all exported Go functions.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
tests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (1)
- tests/helpers/syslog-health-monitor.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/kube.gotests/syslog_health_monitor_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/syslog_health_monitor_test.go
π§ Learnings (7)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.go
𧬠Code graph analysis (1)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)TearDownSyslogHealthMonitor(74-113)tests/helpers/kube.go (1)
PortForwardPod(1590-1634)
π Additional comments (8)
tests/syslog_health_monitor_test.go (8)
89-89: LGTM: Consistent use of StubJournalHTTPPort.The test correctly uses the centralized
helpers.StubJournalHTTPPortconstant for syslog message injection.Also applies to: 123-123
160-171: LGTM: Teardown properly centralized.The teardown correctly extracts context values and delegates to
helpers.TearDownSyslogHealthMonitor.
185-216: LGTM: Manual setup appropriate for metadata-less test.This test intentionally skips metadata injection to verify XID detection without metadata, so the manual setup (without using
SetUpSyslogHealthMonitor) is appropriate.
296-306: LGTM: Proper use of setup helper.The setup correctly delegates to
helpers.SetUpSyslogHealthMonitorand stores context values.
346-360: LGTM: Proper use of teardown helper.The teardown correctly extracts context values and delegates to
helpers.TearDownSyslogHealthMonitor.
362-380: LGTM: Well-structured STORE_ONLY strategy test.The new test properly configures the syslog health monitor with
STORE_ONLYstrategy via setup helper arguments and stores necessary context values for teardown.
382-404: LGTM: Accurate test description and assertions.The assess description correctly describes the test intent: verifying that XID errors in STORE_ONLY mode do not create node conditions or cordon nodes. The assertions properly validate this behavior.
406-419: LGTM: Proper strategy restoration in teardown.The teardown correctly restores
EXECUTE_REMEDIATIONstrategy to avoid impacting subsequent tests.
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
776a34b to
0a1dc9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
π€ Fix all issues with AI agents
In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Around line 1606-1612: The test currently indexes nvsentinelConditions[0] when
tc.expectNodeConditions is true which can panic if the slice is empty; update
the assertion to first assert the slice has at least one element (e.g.,
assert.Greater(t, len(nvsentinelConditions), 0) or assert.Len/NotEmpty) and only
then assert the Type value against tc.expectedConditionType, referring to the
existing test variables tc.expectNodeConditions and nvsentinelConditions to
locate where to add the length check before accessing index 0.
- Around line 1620-1626: The test accesses events.Items[0] directly which can
panic if the slice is empty; when tc.expectKubernetesEvents is true first assert
that events.Items is non-empty (e.g. use assert.Greater(t, len(events.Items), 0)
or assert.NotEmpty) and only then compare events.Items[0].Type to
tc.expectedEventType, ensuring the check references tc.expectKubernetesEvents,
events.Items and tc.expectedEventType so the test fails gracefully instead of
panicking.
In @tests/data/healthy-event.json:
- Around line 19-20: The processingStrategy for the healthy event is wrong:
update the JSON key "processingStrategy" in the healthy event object (the one
containing "nodeName": "NODE_NAME" and "isHealthy": true / "message": "No health
failures") from 1 to 2 so it uses STORE_ONLY rather than EXECUTE_REMEDIATION;
ensure the numeric value is changed to 2 to reflect observability-only behavior.
In @tests/helpers/kube.go:
- Around line 2334-2338: The loop over containers that calls
removeArgsFromContainer(&containers[i], args) must validate the target container
was found; add a boolean found flag (e.g., found := false) set to true when
containers[i].Name == containerName, break, and after the loop check found and
return an error (or propagate an appropriate error) if false so the caller knows
the container wasn't presentβmirror the behavior in UpdateDaemonSetArgs and
reference the containerName, removeArgsFromContainer, and UpdateDaemonSetArgs
symbols when making the change.
- Around line 2296-2301: The loop over containers that compares
containers[i].Name to containerName and calls setArgsOnContainer(t,
&containers[i], args) misses handling when no match is found; update the code
that iterates over containers (the block referencing containers, containerName,
and setArgsOnContainer) to track whether a container was found (e.g., found :=
false), set found = true when you call setArgsOnContainer, and after the loop
either call t.Fatalf (or return an error from the helper) with a clear message
including containerName so the test fails loudly if no container matched.
In @tests/syslog_health_monitor_test.go:
- Around line 51-52: Remove the redundant error assertion: the call to
helpers.SetUpSyslogHealthMonitor(ctx, t, client, nil) returns (testNodeName,
syslogPod, stopChan) and not an error, so delete the require.NoError(t, err,
"failed to set up syslog health monitor") line (which references an
uninitialized err); ensure any later references to err in this test are handled
or removed as well.
π§Ή Nitpick comments (5)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1502-1524: Consider consolidating or differentiating duplicate test scenarios.This test case ("STORE_ONLY non fatal event should not create Kubernetes event") appears to duplicate the scenario at lines 1423-1442. Both test
STORE_ONLYwithIsFatal: falseexpecting no node conditions and no Kubernetes events. If you intend to test differentCheckNamevalues, consider clarifying this in the test name or description; otherwise, consolidate to reduce redundancy.data-models/protobufs/health_event.proto (1)
78-78: Add field-level documentation for processingStrategy.While the enum itself is well-documented, the field lacks a comment explaining its purpose and usage context within the HealthEvent message.
π Suggested documentation
BehaviourOverrides quarantineOverrides = 14; BehaviourOverrides drainOverrides = 15; + // Directs how downstream modules (platform connectors, event stores) should handle this event. + // Defaults to UNSPECIFIED if not explicitly set. ProcessingStrategy processingStrategy = 16;Based on coding guidelines: "Include comprehensive comments for all fields in Protocol Buffer messages."
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
33-58: ProcessingStrategy correctly threaded through constructor.The
processingStrategyparameter is properly captured and stored in theXIDHandlerstruct. The initialization follows the same pattern as other configuration fields.Consider whether
processingStrategyshould be validated in the constructor to rejectProcessingStrategy_UNSPECIFIED(the proto default value of 0). While validation currently occurs inmain.gobefore constructors are called, adding validation here would provide defensive programming and catch configuration errors at the point of construction, consistent with best practices for enum parameters that must have meaningful values.tests/helpers/healthevent.go (1)
31-31: Consider using the enum type for type safety.The
ProcessingStrategyfield is declared asintrather thanprotos.ProcessingStrategy. While this may be intentional for JSON marshaling flexibility, using the enum type would provide stronger type safety and prevent invalid values at compile time.β»οΈ Suggested refactor for type safety
If JSON marshaling works correctly with the enum type, consider:
type HealthEventTemplate struct { Version int `json:"version"` Agent string `json:"agent"` ComponentClass string `json:"componentClass,omitempty"` CheckName string `json:"checkName"` IsFatal bool `json:"isFatal"` IsHealthy bool `json:"isHealthy"` Message string `json:"message"` RecommendedAction int `json:"recommendedAction,omitempty"` ErrorCode []string `json:"errorCode,omitempty"` EntitiesImpacted []EntityImpacted `json:"entitiesImpacted,omitempty"` Metadata map[string]string `json:"metadata,omitempty"` QuarantineOverrides *QuarantineOverrides `json:"quarantineOverrides,omitempty"` NodeName string `json:"nodeName"` - ProcessingStrategy int `json:"processingStrategy,omitempty"` + ProcessingStrategy protos.ProcessingStrategy `json:"processingStrategy,omitempty"` }And update the constructor:
func NewHealthEvent(nodeName string) *HealthEventTemplate { return &HealthEventTemplate{ Version: 1, Agent: "gpu-health-monitor", ComponentClass: "GPU", CheckName: "GpuXidError", IsFatal: true, IsHealthy: false, NodeName: nodeName, EntitiesImpacted: []EntityImpacted{ { EntityType: "GPU", EntityValue: "0", }, }, - ProcessingStrategy: int(protos.ProcessingStrategy_EXECUTE_REMEDIATION), + ProcessingStrategy: protos.ProcessingStrategy_EXECUTE_REMEDIATION, } }And update the fluent setter signature:
-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate { +func (h *HealthEventTemplate) WithProcessingStrategy(strategy protos.ProcessingStrategy) *HealthEventTemplate { h.ProcessingStrategy = strategy return h }Also applies to: 49-49, 76-76
tests/fault_quarantine_test.go (1)
334-346: Teardown only resetsSysLogsXIDError- consider resettingGpuPowerWatchas well.The teardown sends a healthy event for
SysLogsXIDErrorbut doesn't explicitly clear theGpuPowerWatchstate. WhileTeardownQuarantineTestmay handle general cleanup, explicitly resetting both check types would make the test more self-contained and prevent potential state leakage to subsequent tests.β»οΈ Suggested improvement
feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { event := helpers.NewHealthEvent(testCtx.NodeName). WithErrorCode("79"). WithHealthy(true). WithAgent(helpers.SYSLOG_HEALTH_MONITOR_AGENT). WithCheckName("SysLogsXIDError") helpers.SendHealthEvent(ctx, t, event) + // Also reset GpuPowerWatch state + powerEvent := helpers.NewHealthEvent(testCtx.NodeName). + WithErrorCode("DCGM_FR_CLOCK_THROTTLE_POWER"). + WithHealthy(true). + WithCheckName("GpuPowerWatch") + helpers.SendHealthEvent(ctx, t, powerEvent) + return helpers.TeardownQuarantineTest(ctx, t, c) })
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (35)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpldistros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yamlfault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/types.gohealth-monitors/syslog-health-monitor/pkg/xid/types.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder.gotests/data/fatal-health-event.jsontests/data/healthy-event.jsontests/data/managed-by-nvsentinel-configmap.yamltests/data/unsupported-health-event.jsontests/event_exporter_test.gotests/fault_quarantine_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (9)
- health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler_test.go
- health-monitors/syslog-health-monitor/pkg/gpufallen/gpufallen_handler.go
- distros/kubernetes/nvsentinel/charts/syslog-health-monitor/templates/_helpers.tpl
- fault-quarantine/pkg/evaluator/rule_evaluator_test.go
- tests/helpers/syslog-health-monitor.go
- health-monitors/syslog-health-monitor/pkg/syslog-monitor/types.go
- fault-quarantine/pkg/initializer/init.go
- distros/kubernetes/nvsentinel/charts/syslog-health-monitor/values.yaml
- tests/event_exporter_test.go
π§° Additional context used
π Path-based instructions (4)
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/syslog-health-monitor/main.gotests/helpers/healthevent.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gotests/fault_quarantine_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gotests/helpers/event_exporter.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gotests/helpers/kube.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder.gohealth-monitors/syslog-health-monitor/pkg/gpufallen/types.gotests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/types.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gotests/fault_quarantine_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
π§ Learnings (14)
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
health-monitors/syslog-health-monitor/main.gohealth-monitors/syslog-health-monitor/pkg/sxid/types.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler.gotests/syslog_health_monitor_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.gohealth-monitors/syslog-health-monitor/pkg/xid/types.go
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
tests/helpers/healthevent.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/helpers/healthevent.gotests/fault_quarantine_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.gotests/fault_quarantine_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.gohealth-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/fault_quarantine_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/fault_quarantine_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger.gotests/helpers/event_exporter.gohealth-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
tests/data/managed-by-nvsentinel-configmap.yaml
π Learning: 2025-12-12T07:41:27.339Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:27.339Z
Learning: In tests/data/health-events-analyzer-config.yaml, the XID74Reg2Bit13Set rule intentionally omits the time window filter; tests should verify only the register bit pattern (bit 13 in REG2) on the incoming XID 74 event and should not rely on historical events or counts of repeats. If adding similar rules elsewhere, apply the same pattern and document that the time window filter is unnecessary for single-event bit checks.
Applied to files:
tests/data/managed-by-nvsentinel-configmap.yaml
π Learning: 2025-12-12T07:36:26.109Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:1730-1901
Timestamp: 2025-12-12T07:36:26.109Z
Learning: In NVSentinel health-events analyzer config files (e.g., tests/data/health-events-analyzer-config.yaml and similar), XID 74 errors are guaranteed to include a GPU_UUID entry within healthevent.entitiesimpacted. Therefore, when writing XID 74-specific rules that filter for GPU_UUID entities, you can skip null checks (no $ifNull) for GPU_UUID. Ensure tests relying on this assumption are updated accordingly and document the guarantee in the rule descriptions.
Applied to files:
tests/data/managed-by-nvsentinel-configmap.yaml
π Learning: 2025-12-12T07:38:37.023Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2025-2187
Timestamp: 2025-12-12T07:38:37.023Z
Learning: In NVSentinel, XID 74 errors always include an NVLINK entry in healthevent.entitiesimpacted, so null-checking with $ifNull is unnecessary when filtering for NVLINK entities in XID 74-specific rules. Apply this rule to YAML test fixtures under tests/ data (e.g., tests/data/health-events-analyzer-config.yaml) and any similar health-event configuration tests. If applying in code, ensure downstream filters rely on the presence of NVLINK in entitiesimpacted for XID 74 only, but continue to guard other fields and XIDs with appropriate null checks.
Applied to files:
tests/data/managed-by-nvsentinel-configmap.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Each health monitor service should implement the gRPC service from `data-models/protobufs/`
Applied to files:
health-monitors/syslog-health-monitor/pkg/sxid/types.go
π Learning: 2025-12-23T05:02:22.108Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: store-client/pkg/client/postgresql_pipeline_builder.go:119-132
Timestamp: 2025-12-23T05:02:22.108Z
Learning: In the NVSentinel codebase, protobuf fields stored in MongoDB should use lowercase field names (e.g., processingstrategy, componentclass, checkname). Ensure pipeline filters and queries that access protobuf fields in the database consistently use lowercase field names in the store-client package, avoiding camelCase mappings for MongoDB reads/writes.
Applied to files:
store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
𧬠Code graph analysis (10)
health-monitors/syslog-health-monitor/main.go (1)
data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_value(58-62)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (1)
data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(48-48)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
tests/fault_quarantine_test.go (5)
tests/helpers/fault_quarantine.go (3)
QuarantineTestContext(51-54)SetupQuarantineTest(107-112)TeardownQuarantineTest(181-242)tests/helpers/kube.go (3)
SetNodeManagedByNVSentinel(1389-1408)WaitForNodeConditionWithCheckName(1662-1706)WaitForNodeEvent(349-385)tests/helpers/healthevent.go (1)
NewHealthEvent(61-78)tests/helpers/health_events_analyzer.go (1)
SYSLOG_HEALTH_MONITOR_AGENT(38-38)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(48-48)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)ProcessingStrategy_EXECUTE_REMEDIATION(47-47)
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (3)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
XIDHandler(29-39)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)TearDownSyslogHealthMonitor(74-113)tests/helpers/kube.go (3)
InjectSyslogMessages(1820-1843)PortForwardPod(1614-1658)EnsureNodeConditionNotPresent(1797-1818)
health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (3)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-18)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
SXIDHandler(29-36)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)
π Additional comments (57)
tests/data/fatal-health-event.json (1)
20-20: LGTM! Processing strategy addition is appropriate for fatal events.The addition of
processingStrategy: 1(EXECUTE_REMEDIATION) aligns well with the fatal nature of this test event, as fatal health events would typically require remediation actions.store-client/pkg/datastore/providers/postgresql/sql_filter_builder.go (1)
404-404: LGTM! Field mapping follows established conventions.The new
processingstrategyβprocessingStrategymapping correctly handles the case conversion between MongoDB's lowercase bson field names and PostgreSQL's camelCase JSON field names, consistent with the existing pattern and aligns with the project's field naming conventions.Based on learnings, this ensures consistency with how protobuf fields are stored and queried across different data stores.
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1550-1559: LGTM on test setup and isolation.Good use of local context, clientset, and ring buffer per test case to ensure test isolation. The deferred
close(stopCh)ensures proper cleanup.tests/data/managed-by-nvsentinel-configmap.yaml (1)
43-59: LGTM! The new rule-set structure is correct and properly mirrors the existing GPU health monitor rule.Using the same taint and cordon settings for both
gpu-health-monitorandsyslog-health-monitorXID errors ensures consistent node quarantine behavior regardless of the detection source. The syslog-health-monitor component emits HealthEvents withagent == 'syslog-health-monitor'andcheckName == 'SysLogsXIDError'as expected (extensively verified in the test suite).health-monitors/csp-health-monitor/pkg/triggerengine/trigger_test.go (3)
243-246: LGTM: ProcessingStrategy correctly set for maintenance events.The test expectations properly include
ProcessingStrategy: pb.ProcessingStrategy_EXECUTE_REMEDIATIONfor CSP maintenance events, which aligns with the intended behavior where these events should trigger actual remediation workflows.
272-274: LGTM: Healthy event test updated correctly.The ProcessingStrategy field is appropriately set to EXECUTE_REMEDIATION for healthy/recovery events, maintaining consistency with the quarantine event behavior.
336-338: LGTM: Unknown action case handles ProcessingStrategy consistently.Even for events with unknown recommended actions (which default to NONE), the ProcessingStrategy is correctly set to EXECUTE_REMEDIATION, ensuring the event still participates in the remediation workflow.
data-models/protobufs/health_event.proto (1)
32-39: LGTM: Well-documented ProcessingStrategy enum.The enum definition clearly documents the behavioral distinction between
EXECUTE_REMEDIATION(modifies cluster state) andSTORE_ONLY(observability-only). The three-value design (including UNSPECIFIED) follows protobuf best practices.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
14-18: LGTM: Generated Python type stubs correctly reflect proto changes.The auto-generated Python type stub properly includes the new
ProcessingStrategyenum and integrates it into theHealthEventmessage with appropriate type annotations. The structure is consistent with the protobuf definition.Also applies to: 32-34, 80-80, 106-106, 122-122, 140-140
health-monitors/syslog-health-monitor/pkg/xid/types.go (1)
20-20: LGTM: ProcessingStrategy field added cleanly to XIDHandler.The import and struct field addition follow Go conventions. The unexported field appropriately stores the processing strategy configuration for the handler, consistent with other configuration fields like
nodeNameandcheckName.Also applies to: 34-34
health-monitors/syslog-health-monitor/pkg/xid/xid_handler.go (1)
189-189: LGTM: ProcessingStrategy propagated to HealthEvent.The handler's
processingStrategyis correctly assigned to the generatedHealthEvent, completing the data flow from configuration to emitted events.tests/helpers/healthevent.go (1)
155-158: LGTM: Fluent setter follows existing pattern.The
WithProcessingStrategymethod correctly follows the established fluent builder pattern used by other methods in this struct.health-monitors/csp-health-monitor/pkg/triggerengine/trigger.go (1)
362-364: LGTM: TODO properly references tracking PR.The hardcoded
ProcessingStrategy_EXECUTE_REMEDIATIONvalue is acceptable as an interim implementation, and the TODO comment appropriately references PR #641 for making it configurable via the config file.tests/data/unsupported-health-event.json (1)
20-20: LGTM: Test data includes ProcessingStrategy field.The
processingStrategyvalue of 1 corresponds toEXECUTE_REMEDIATIONand aligns with the new ProcessingStrategy enum support.tests/helpers/event_exporter.go (1)
221-226: LGTM: Validation logic correctly extended.The
ValidateCloudEventfunction signature and implementation are correctly updated to validate theprocessingStrategyfield in CloudEvent payloads, following the existing validation pattern.Also applies to: 245-245
health-monitors/syslog-health-monitor/pkg/sxid/types.go (1)
20-20: LGTM: Proper use of enum type for ProcessingStrategy.The
processingStrategyfield correctly uses thepb.ProcessingStrategyenum type rather thanint, providing type safety for production code. The import aliaspbfollows Go conventions.Also applies to: 34-34
health-monitors/syslog-health-monitor/main.go (3)
70-72: LGTM!The new
--processing-strategyflag is well-defined with a sensible default (EXECUTE_REMEDIATION) and clear documentation of allowed values.
164-171: LGTM!The validation logic correctly uses the protobuf-generated
ProcessingStrategy_valuemap to validate the flag input, provides a clear error message for invalid values, and logs the configured strategy for observability.
173-184: LGTM!The
processingStrategyis correctly passed toNewSyslogMonitor, threading the configuration through to the monitor instance.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: Generated protobuf code - no manual review needed.This file is auto-generated by the protocol buffer compiler. The changes correctly reflect the addition of the
ProcessingStrategyenum and the newprocessingStrategyfield inHealthEvent. Ensure this file is regenerated from the source.protofile rather than manually edited.tests/fault_quarantine_test.go (4)
26-32: LGTM!Import for the
protospackage is correctly added to accessProcessingStrategyenum values.
234-251: LGTM!Test setup correctly initializes the quarantine test context and enables NVSentinel management on the node before running assertions.
253-289: LGTM!Good coverage for
STORE_ONLYbehavior - the test correctly verifies that events withSTORE_ONLYprocessing strategy do not trigger quarantine actions (no node condition, no cordoning, no annotation).
291-332: LGTM!Good coverage for
EXECUTE_REMEDIATIONbehavior - the test correctly verifies that events withEXECUTE_REMEDIATIONprocessing strategy trigger the expected quarantine actions (node condition present, cordoned with annotation).health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler.go (2)
30-42: LGTM!The constructor correctly accepts the new
processingStrategyparameter and initializes the handler field. The signature change aligns with the pattern used in other handlers (XID, GPUFallen).
95-110: LGTM!The
ProcessingStrategyis correctly propagated to theHealthEventpayload, ensuring the strategy configured at initialization flows through to emitted events.health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor_test.go (7)
309-317: LGTM!Test correctly verifies that
NewSyslogMonitoraccepts and stores theprocessingStrategyparameter. The assertion on line 317 confirms the field is properly set.
329-334: LGTM!Test correctly verifies
NewSyslogMonitorWithFactorywith a different strategy (EXECUTE_REMEDIATION), providing coverage for both factory variants.
391-403: LGTM!
TestJournalProcessingLogicupdated to include theprocessingStrategyparameter.
494-507: LGTM!
TestJournalStateManagementupdated to include theprocessingStrategyparameter in both monitor instances.
574-586: LGTM!
TestBootIDChangeHandlingupdated to include theprocessingStrategyparameter.
624-637: LGTM!
TestRunMultipleChecksupdated to include theprocessingStrategyparameter.
665-679: LGTM!
TestGPUFallenOffHandlerInitializationupdated to include theprocessingStrategyparameter.health-monitors/syslog-health-monitor/pkg/xid/xid_handler_test.go (8)
88-91: LGTM!
TestDetermineFatalitycorrectly updated to include theprocessingStrategyparameter.
141-148: LGTM!Table-driven test case for NVRM GPU Map Line correctly includes
processingStrategy.
156-195: Good test coverage forSTORE_ONLYprocessing strategy.The "Valid XID Message" test case correctly uses
STORE_ONLYand validates that theProcessingStrategyis propagated to the event. This provides coverage for the non-default strategy path.
200-232: LGTM!"Valid XID with GPU UUID" test case correctly validates
EXECUTE_REMEDIATIONpropagation.
237-280: LGTM!Error path test cases correctly updated with
processingStrategyparameter.
306-349: LGTM!
TestCreateHealthEventFromResponsecorrectly validates thatProcessingStrategyis included in the generated health event.
351-379: LGTM!"XID Event with GPU UUID" test case validates
ProcessingStrategypropagation with entity enrichment.
427-446: LGTM!
TestNewXIDHandlertable-driven tests correctly updated withprocessingStrategyparameter.health-monitors/syslog-health-monitor/pkg/sxid/sxid_handler_test.go (4)
17-25: LGTM!New imports for
os,filepath, and protobuf package are correctly added to support the new test functionality.
44-56: LGTM!
TestNewSXIDHandlercorrectly updated to include theprocessingStrategyparameter.
143-196: Excellent test coverage for SXID handler with valid topology.This test validates the complete flow: metadata loading, SXID parsing, GPU resolution via NVSwitch link topology, and
ProcessingStrategypropagation. The temporary metadata file approach is appropriate for isolated testing.One minor observation: the test hardcodes entity indices (3, 4) for GPU assertions. If the entity order changes in the handler, these assertions would break. Consider using a helper to find entities by type, or document why these indices are expected.
198-249: LGTM!
TestProcessLinetable-driven test correctly updated to includeprocessingStrategyparameter and validate its propagation in the event.tests/syslog_health_monitor_test.go (1)
362-422: LGTM! Well-structured STORE_ONLY strategy test.The new test correctly:
- Configures the syslog monitor with STORE_ONLY strategy during setup
- Verifies that XID errors do not trigger node conditions or cordoning
- Restores EXECUTE_REMEDIATION strategy during teardown to avoid affecting subsequent tests
health-monitors/syslog-health-monitor/pkg/syslog-monitor/syslogmonitor.go (3)
41-58: LGTM! Processing strategy parameter correctly threaded.The
processingStrategyparameter is properly added toNewSyslogMonitorand forwarded toNewSyslogMonitorWithFactory.
63-151: LGTM! Processing strategy initialization is correct.The
processingStrategyparameter is:
- Added to the function signature
- Stored in the
SyslogMonitorstruct (line 96)- Passed to all handler constructors (lines 110, 120, 130)
This correctly threads the strategy through the monitor initialization.
792-820: LGTM! Processing strategy correctly set on health events.Line 813 correctly assigns
sm.processingStrategytoevent.ProcessingStrategy, ensuring the strategy is propagated to all health events generated by the monitor.tests/helpers/kube.go (6)
387-409: LGTM!The function correctly uses
require.Neverto assert that a specified event never appears on the node during the timeout period. The logic and error handling are appropriate for a negative assertion in tests.
2232-2273: LGTM!The rollout wait logic correctly verifies that all DaemonSet pods are scheduled, updated, and ready before proceeding. The progress logging is helpful for test debugging.
2353-2383: LGTM!The function correctly handles both
--flag=valueand--flag valueargument styles, including proper bounds checking when inserting values.
2385-2407: LGTM!The function correctly updates existing arguments in place or appends new ones, handling both value and boolean flag styles appropriately.
2409-2433: LGTM!The function correctly removes arguments in both
--flag=valueand--flag valuestyles, properly handling the deletion of both the flag and its associated value when applicable.
2435-2479: LGTM!The function correctly waits for a DaemonSet pod to be fully ready on a worker node, with proper validation that the pod is not being deleted, is running, and has all readiness checks passing.
health-monitors/syslog-health-monitor/pkg/gpufallen/types.go (2)
52-52: LGTM!The
processingStrategyfield is well-integrated into the struct. The field naming follows Go conventions, and its placement before the mutex fields is appropriate for configuration data that's set during initialization.
22-23: LGTM! The import aliaspbfor the protos package follows Go conventions and is correctly structured.
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Show resolved
Hide resolved
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Show resolved
Hide resolved
0a1dc9a to
1df5558
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In
@platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go:
- Line 1509: Two test cases in k8s_platform_connector_test.go set ErrorCode to a
slice containing an empty string (ErrorCode: []string{""}) which should be an
empty slice; update both instances to use an empty slice literal (ErrorCode:
[]string{}) so they match other tests and semantic intent. Locate the ErrorCode
field in the failing test case structs (the occurrences near the existing DCGM
error-code tests) and replace []string{""} with []string{} in both places.
π§Ή Nitpick comments (6)
tests/helpers/kube.go (3)
387-409: Consider failing fast on persistent API errors.If
GetNodeEventsfails repeatedly due to API issues,require.Neverwill continue checking until timeout, masking the real error. Consider tracking consecutive failures and callingt.Fatalfafter a threshold.β»οΈ Proposed enhancement
func EnsureNodeEventNotPresent(ctx context.Context, t *testing.T, c klient.Client, nodeName string, eventType, eventReason string) { t.Helper() + consecutiveErrors := 0 + maxConsecutiveErrors := 3 + require.Never(t, func() bool { events, err := GetNodeEvents(ctx, c, nodeName, eventType) if err != nil { t.Logf("failed to get events for node %s: %v", nodeName, err) + consecutiveErrors++ + if consecutiveErrors >= maxConsecutiveErrors { + t.Fatalf("failed to get events %d times in a row, likely API issue", consecutiveErrors) + } return false } + consecutiveErrors = 0 for _, event := range events.Items { if event.Type == eventType && event.Reason == eventReason { t.Logf("node %s has event %v", nodeName, event) return true } } t.Logf("node %s does not have event %v", nodeName, eventType) return false }, NeverWaitTimeout, WaitInterval, "node %s should not have event %v", nodeName, eventType) }
2317-2318: Clarify or remove the hard-coded sleep.The 10-second sleep after rollout completion seems arbitrary, since
waitForDaemonSetRolloutalready verifies all pods are updated and ready. If this addresses a specific race condition (e.g., waiting for containers to initialize), document it. Otherwise, consider removing it.
2362-2392: Consider simplifying the nested append.Line 2384 uses nested
appendcalls which create two allocations. For test helper code this is acceptable, but it could be slightly more efficient and clearer.β»οΈ Alternative approach
// Match --flag or --flag value style if existingArg == flag { if value != "" { if j+1 < len(container.Args) && !strings.HasPrefix(container.Args[j+1], "-") { container.Args[j+1] = value } else { - container.Args = append(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...) + // Insert value after flag + container.Args = append(container.Args, "") + copy(container.Args[j+2:], container.Args[j+1:]) + container.Args[j+1] = value } } return true }tests/helpers/syslog-health-monitor.go (3)
15-15: Add package-level godoc comment.Per Go conventions and coding guidelines, packages should have a package-level documentation comment explaining their purpose.
π Suggested documentation
+// Package helpers provides test utilities for setting up and tearing down +// test environments for NVSentinel health monitors. package helpersBased on coding guidelines, package-level godoc is required for all Go packages.
32-34: Follow Go godoc conventions for exported function comments.Function comments for exported functions should start with the function name according to Go documentation conventions.
π Proposed fix
-// helper function to set up syslog health monitor and port forward to it +// SetUpSyslogHealthMonitor sets up the syslog health monitor test environment by +// optionally updating DaemonSet args, injecting metadata, establishing port-forwarding, +// and configuring node labels. Returns the test node name, syslog pod, and stop channel. func SetUpSyslogHealthMonitor(ctx context.Context, t *testing.T, client klient.Client, args map[string]string) (string, *v1.Pod, chan struct{}) {Based on coding guidelines, function comments are required for all exported Go functions and should follow standard Go conventions.
73-76: Follow Go godoc conventions for exported function comments.Function comments for exported functions should start with the function name and clearly describe what the function does.
π Proposed fix
-// helper function to roll back syslog health monitor daemonset and stop the port forward +// TearDownSyslogHealthMonitor tears down the syslog health monitor test environment by +// stopping port-forwarding, optionally restoring DaemonSet args, restarting the pod, +// cleaning up metadata, and removing node labels. func TearDownSyslogHealthMonitor(ctx context.Context, t *testing.T, client klient.Client, nodeName string, stopChan chan struct{}, args map[string]string, podName string) {Based on coding guidelines, function comments are required for all exported Go functions and should follow standard Go conventions.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (4)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/syslog_health_monitor_test.go
π§ Learnings (8)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:40.921Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
tests/helpers/kube.go
𧬠Code graph analysis (2)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (4)
data-models/pkg/protos/health_event.pb.go (17)
HealthEvent(263-283)HealthEvent(296-296)HealthEvent(311-313)Entity(211-217)Entity(230-230)Entity(245-247)RecommendedAction(92-92)RecommendedAction(142-144)RecommendedAction(146-148)RecommendedAction(155-157)ProcessingStrategy(43-43)ProcessingStrategy(75-77)ProcessingStrategy(79-81)ProcessingStrategy(88-90)HealthEvents(159-165)HealthEvents(178-178)HealthEvents(193-195)health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (5)
HealthEvent(63-141)Entity(55-61)RecommendedAction(20-30)ProcessingStrategy(14-18)HealthEvents(45-53)platform-connectors/pkg/connectors/kubernetes/k8s_connector.go (1)
NewK8sConnector(47-58)platform-connectors/pkg/pipeline/factory.go (1)
Create(32-39)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (2)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)tests/helpers/kube.go (2)
InjectSyslogMessages(1820-1843)PortForwardPod(1614-1658)
π Additional comments (9)
tests/helpers/kube.go (5)
2232-2273: LGTM!The rollout verification logic correctly checks that all DaemonSet pods are scheduled, updated, and ready before proceeding.
2323-2360: LGTM!Container validation is properly implemented, and the function correctly handles the retry-on-conflict pattern.
2394-2416: LGTM!The function correctly updates existing arguments or appends new ones.
2418-2442: LGTM!The function correctly removes arguments in both
--flag=valueand--flag valueformats. Thebreakafter removal ensures only the first occurrence is removed, which is appropriate for typical container args.
2444-2488: LGTM!The function properly validates the pod is from the current rollout, not being deleted, running, and fully ready before returning it.
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (3)
1606-1614: Previous concern properly addressed.The code now uses
require.NotEmptybefore accessingnvsentinelConditions[0], which correctly prevents the index-out-of-range panic that was flagged in earlier reviews. The test will fail with a clear message if the expected condition is not found.
1622-1630: Previous concern properly addressed.The code now uses
require.NotEmptybefore accessingevents.Items[0], which correctly prevents the index-out-of-range panic that was flagged in earlier reviews. The test will fail with a clear message if the expected event is not found.
1391-1635: Well-structured test for processing strategy validation.The test provides comprehensive coverage of the new
ProcessingStrategyfield:
- STORE_ONLY events correctly bypass node condition and Kubernetes event creation
- EXECUTE_REMEDIATION events properly create node conditions (fatal) or Kubernetes events (non-fatal)
- Mixed strategy batches are handled correctly
The test follows Go best practices with table-driven structure, proper test isolation via dedicated clientsets, and clear assertions using testify (consistent with existing patterns in this file).
tests/syslog_health_monitor_test.go (1)
361-421: Well-structured test for STORE_ONLY strategy.The new test correctly validates STORE_ONLY behavior by:
- Setting up the monitor with
--processing-strategy: STORE_ONLY- Verifying no node conditions are created after XID injection
- Verifying no node cordoning occurs
- Restoring EXECUTE_REMEDIATION strategy in teardown for proper cleanup
This follows the established test patterns and properly exercises the new processing strategy feature.
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Show resolved
Hide resolved
1df5558 to
e5a8295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In @tests/helpers/kube.go:
- Around line 2326-2366: The RemoveDaemonSetArgs function uses
require.NoError(t, err, ...) which fails the test immediately and prevents
returning the error to the caller; change this to propagate the error like
UpdateDaemonSetArgs by replacing the require.NoError call with standard error
handling (if err != nil { t.Logf("failed to remove args from daemonset %s/%s:
%v", NVSentinelNamespace, daemonsetName, err); return err }) so the function
returns the error to callers while preserving the existing logs and rollout wait
behavior; keep the function signature as RemoveDaemonSetArgs(... ) error and
ensure callers handle the returned error.
π§Ή Nitpick comments (3)
tests/helpers/kube.go (2)
387-409: Consider improving log message specificity.Line 405 logs only
eventType, but the function also receiveseventReasonas a parameter. Including both in the log message would provide more context for debugging.β»οΈ Proposed improvement
- t.Logf("node %s does not have event %v", nodeName, eventType) + t.Logf("node %s does not have event type=%s reason=%s", nodeName, eventType, eventReason)
2320-2321: Consider replacing fixed sleep with condition-based wait.The hardcoded 10-second sleep after rollout completion could be replaced with polling for the specific condition you're waiting for (e.g., pod readiness, specific container state). This would make tests more reliable and potentially faster.
tests/syslog_health_monitor_test.go (1)
361-421: LGTM: Well-structured STORE_ONLY strategy test.The test correctly verifies that STORE_ONLY mode prevents both condition creation and node cordoning when XID errors are injected. The setup/teardown properly configure and restore the processing strategy.
Optional: Make assess description more concise and complete
The description could mention both aspects being verified (condition and cordoning) more concisely:
- feature.Assess("Inject XID errors and verify no node condition is created when running in STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { + feature.Assess("Verify STORE_ONLY strategy prevents condition creation and cordoning", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (4)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (2)
- tests/helpers/syslog-health-monitor.go
- platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/syslog_health_monitor_test.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/syslog_health_monitor_test.go
π§ Learnings (6)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/syslog_health_monitor_test.go
𧬠Code graph analysis (1)
tests/syslog_health_monitor_test.go (2)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)TearDownSyslogHealthMonitor(74-113)tests/helpers/kube.go (3)
InjectSyslogMessages(1820-1843)PortForwardPod(1614-1658)EnsureNodeConditionNotPresent(1797-1818)
π Additional comments (6)
tests/helpers/kube.go (3)
2232-2273: LGTM!The rollout verification logic correctly checks all necessary DaemonSet status fields (DesiredNumberScheduled, UpdatedNumberScheduled, NumberReady) to ensure complete rollout.
2368-2448: LGTM!The container argument manipulation helpers correctly handle multiple argument styles (
--flag=value,--flag value,--flag) with proper slice operations for updates, insertions, and removals.
2450-2494: LGTM!The function properly validates that the DaemonSet pod is running, ready, and not being deleted, ensuring tests interact with stable pods from the current rollout generation.
tests/syslog_health_monitor_test.go (3)
25-39: LGTM: Clean imports and context keys.The helper import and updated context key names improve code clarity and maintainability.
42-173: LGTM: Well-refactored setup and teardown.The test now uses centralized helper functions consistently, improving maintainability. The context-based value passing follows the e2e-framework pattern correctly.
290-359: LGTM: Consistent refactoring pattern.The SXID detection test follows the same centralized setup/teardown pattern as the XID detection test, ensuring consistency across the test suite.
Signed-off-by: Tanisha goyal <[email protected]>
e5a8295 to
0752eb5
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
π€ Fix all issues with AI agents
In @tests/syslog_health_monitor_test.go:
- Around line 405-418: The args map passed to TearDownSyslogHealthMonitor uses a
non-empty value for "--processing-strategy" even though removeArgsFromContainer
(and RemoveDaemonSetArgs) only inspects map keys; change the map entry to
"--processing-strategy": "" (empty string) or add an inline comment next to the
map literal documenting that only keys are used by
removeArgsFromContainer/RemoveDaemonSetArgs so values are ignored, so readers
aren't misled by a non-empty value.
π§Ή Nitpick comments (4)
tests/helpers/kube.go (3)
2281-2293: Unused variableoriginalDaemonSet.The variable
originalDaemonSetis assigned on line 2292 but never used. This appears to be leftover code, possibly from a previous implementation that stored the original state for rollback.β»οΈ Remove unused variable
func UpdateDaemonSetArgs(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string, args map[string]string) error { t.Helper() - var originalDaemonSet *appsv1.DaemonSet - t.Logf("Updating daemonset %s/%s with args %v", NVSentinelNamespace, daemonsetName, args) err := retry.RetryOnConflict(retry.DefaultRetry, func() error { daemonSet := &appsv1.DaemonSet{} if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil { return err } - if originalDaemonSet == nil { - originalDaemonSet = daemonSet.DeepCopy() - } - containers := daemonSet.Spec.Template.Spec.Containers
2321-2322: Magic sleep after rollout.The 10-second hardcoded sleep after rollout completion is a code smell. If pods need additional time after becoming "Ready", consider either:
- Adding a specific readiness check for the functionality you're waiting for
- Extracting this as a named constant with documentation explaining why it's needed
2452-2496: ParameterdaemonsetNameis underutilized.The
daemonsetNameparameter is only used in the error message on line 2492. The function doesn't verify that the found pod actually belongs to the specified DaemonSet - it relies solely onpodNamePatternmatching.This could be intentional if the pattern is specific enough, but consider either:
- Adding validation that the pod's owner reference matches the DaemonSet
- Removing
daemonsetNameif it's truly not neededtests/helpers/syslog-health-monitor.go (1)
73-113: Consider adding error handling for cleanup operations.The teardown function logs warnings for errors during cleanup (lines 89, 111) but uses
require.NoErrorforRemoveDaemonSetArgs(line 82). This inconsistency could cause test failures during teardown if the arg removal fails, while other cleanup steps are more lenient.For consistency in teardown, consider either:
- Making all cleanup steps warn-only to ensure full cleanup attempts
- Or keeping
require.NoErroronly for critical restoration steps
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (4)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gotests/helpers/kube.gotests/helpers/syslog-health-monitor.gotests/syslog_health_monitor_test.go
π§ Files skipped from review as they are similar to previous changes (1)
- platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/syslog_health_monitor_test.gotests/helpers/syslog-health-monitor.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/syslog_health_monitor_test.go
π§ Learnings (7)
π Common learnings
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 609
File: tests/data/healthy-event.json:19-20
Timestamp: 2026-01-07T09:54:49.335Z
Learning: In NVSentinel, healthy events (isHealthy: true) can legitimately use processingStrategy: 1 (EXECUTE_REMEDIATION) when the Fault Quarantine Manager needs to act on them to clear previous fault states or update cluster resources to reflect the healthy status.
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/syslog_health_monitor_test.go
π Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
tests/helpers/kube.go
𧬠Code graph analysis (1)
tests/syslog_health_monitor_test.go (3)
tests/helpers/syslog-health-monitor.go (3)
SetUpSyslogHealthMonitor(33-71)StubJournalHTTPPort(27-27)TearDownSyslogHealthMonitor(74-113)tests/helpers/kube.go (2)
InjectSyslogMessages(1820-1843)PortForwardPod(1614-1658)tests/helpers/fault_quarantine.go (1)
AssertQuarantineState(315-382)
π Additional comments (15)
tests/helpers/kube.go (5)
387-409: LGTM - Clean implementation of negative event assertion.The function correctly uses
require.Neverto assert that a specific event type/reason never appears on a node within the timeout period. The error handling on line 393-396 appropriately returnsfalse(meaning "condition not met, keep checking") rather than failing the test.
2232-2273: LGTM - Robust rollout wait implementation.The function correctly checks all necessary conditions for a complete DaemonSet rollout:
DesiredNumberScheduled > 0UpdatedNumberScheduled == DesiredNumberScheduledNumberReady == DesiredNumberScheduledGood logging at each step for debugging.
2370-2400: Slice insertion logic is correct but complex.The slice manipulation on lines 2391-2393 correctly inserts a value after the flag when the next argument is another flag (starts with
-). The logic handles both--flag=valueand--flag valuestyles appropriately.
2402-2424: LGTM - Clean implementation of arg setting.The function correctly:
- Attempts to update existing args via
tryUpdateExistingArg- Falls back to appending new args if not found
- Handles both
--flag=valueand boolean flag styles
2426-2450: LGTM - Correct arg removal implementation.The function properly handles both
--flag=valueand--flag valuestyles when removing arguments. Breaking after modification avoids issues with concurrent slice modification during iteration.tests/syslog_health_monitor_test.go (7)
36-39: LGTM - Clear context key definitions.The context keys follow Go naming conventions and provide clear identifiers for passing test state between setup, assess, and teardown phases.
47-58: LGTM - Centralized setup pattern.The setup correctly uses the new
helpers.SetUpSyslogHealthMonitorhelper, storing node name, pod name, and stop channel in context for use in assess and teardown phases.
159-170: LGTM - Centralized teardown pattern.The teardown correctly retrieves context values and delegates to
helpers.TearDownSyslogHealthMonitor. Passingnilfor args indicates no DaemonSet argument restoration is needed.
195-203: LGTM - Consistent port configuration.Using
helpers.StubJournalHTTPPortfor both local and remote ports ensures consistency across tests and centralizes the port configuration.
295-359: LGTM - Consistent setup/teardown pattern.The SXID detection test follows the same centralized setup/teardown pattern as the XID detection test, maintaining consistency across the test suite.
361-379: Well-structured STORE_ONLY strategy test.The test correctly:
- Follows naming conventions per coding guidelines
- Sets up with
--processing-strategy=STORE_ONLYargument- Stores context values for teardown
Good use of the centralized setup helper with custom arguments.
381-403: LGTM - Correct STORE_ONLY behavior validation.The assess phase correctly verifies that when running in STORE_ONLY mode:
- XID messages are injected
- No node condition is created (
EnsureNodeConditionNotPresent)- Node is not quarantined/cordoned (
AssertQuarantineState)This aligns with the expected behavior where STORE_ONLY events should be stored but not trigger remediation.
tests/helpers/syslog-health-monitor.go (3)
26-30: LGTM - Clear constant definitions.The constants properly define the port and resource names used by the syslog health monitor test helpers, making them easy to reference consistently across tests.
32-71: LGTM - Well-structured setup helper.The function provides a clean setup sequence:
- Optionally updates DaemonSet args for test configuration
- Locates the syslog health monitor pod on a worker node
- Injects test metadata
- Establishes port-forwarding
- Configures the node label
Good use of
require.NoErrorfor critical setup steps and returning all necessary values for teardown.
91-101: No changes needed - the condition status check logic is correct.The code at lines 93-100 properly checks for
condition.Status == v1.ConditionFalsewith reason"SysLogsXIDErrorIsHealthy"to verify the error has been cleared. This is consistent with the codebase pattern:
Status=Truewith reason"IsNotHealthy"indicates an active errorStatus=Falsewith reason"IsHealthy"indicates the error has been resolved and the condition is no longer activeThe check is semantically correct and follows established conventions throughout the test suite.
| feature.Teardown(func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { | ||
| client, err := c.NewClient() | ||
| require.NoError(t, err, "failed to create kubernetes client") | ||
|
|
||
| nodeName := ctx.Value(keySyslogNodeName).(string) | ||
| stopChan := ctx.Value(keyStopChan).(chan struct{}) | ||
| syslogPod := ctx.Value(keySyslogPodName).(string) | ||
|
|
||
| helpers.TearDownSyslogHealthMonitor(ctx, t, client, nodeName, stopChan, map[string]string{ | ||
| "--processing-strategy": "EXECUTE_REMEDIATION", | ||
| }, syslogPod) | ||
|
|
||
| return ctx | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π§© Analysis chain
π Script executed:
# Find RemoveDaemonSetArgs function and examine its implementation
fd -e go tests/helpers | head -20Repository: NVIDIA/NVSentinel
Length of output: 418
π Script executed:
# Search for RemoveDaemonSetArgs function definition and usage
rg -n "RemoveDaemonSetArgs" --type go -A 15Repository: NVIDIA/NVSentinel
Length of output: 2451
π Script executed:
# Also check the kube.go helper file directly
fd -n "kube.go" | xargs wc -lRepository: NVIDIA/NVSentinel
Length of output: 310
π Script executed:
# Read the RemoveDaemonSetArgs function implementation
sed -n '2327,2380p' tests/helpers/kube.goRepository: NVIDIA/NVSentinel
Length of output: 1768
π Script executed:
# Search for removeArgsFromContainer function
rg -n "func removeArgsFromContainer" --type go -A 30Repository: NVIDIA/NVSentinel
Length of output: 1791
π Script executed:
# Find all calls to RemoveDaemonSetArgs
rg -n "RemoveDaemonSetArgs" --type go -B 2 -A 5Repository: NVIDIA/NVSentinel
Length of output: 1070
Use empty string or document that RemoveDaemonSetArgs only uses map keys.
The removeArgsFromContainer function iterates only over flag names (keys) in the args mapβthe values are completely ignored. Passing "--processing-strategy": "EXECUTE_REMEDIATION" works correctly (removes the flag), but the non-empty value is misleading. Clarify intent by passing an empty string ("") or add a comment explaining that only keys matter.
π€ Prompt for AI Agents
In @tests/syslog_health_monitor_test.go around lines 405 - 418, The args map
passed to TearDownSyslogHealthMonitor uses a non-empty value for
"--processing-strategy" even though removeArgsFromContainer (and
RemoveDaemonSetArgs) only inspects map keys; change the map entry to
"--processing-strategy": "" (empty string) or add an inline comment next to the
map literal documenting that only keys are used by
removeArgsFromContainer/RemoveDaemonSetArgs so values are ignored, so readers
aren't misled by a non-empty value.
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Behavior Changes
Tests
βοΈ Tip: You can customize this high-level summary in your review settings.