-
Notifications
You must be signed in to change notification settings - Fork 33
feat: add event handling strategy changes in gpu health monitor #611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add event handling strategy changes in gpu health monitor #611
Conversation
Signed-off-by: Tanisha goyal <[email protected]>
π WalkthroughWalkthroughAdds a ProcessingStrategy enum and processingStrategy field to HealthEvent and threads the strategy through CLI, event creation, exporters, storage pipelines, Kubernetes processing, and tests so STORE_ONLY events are stored without triggering remediation. Changes
Sequence DiagramsequenceDiagram
participant CLI as GPU Monitor CLI
participant Init as _init_event_processor
participant PC as PlatformConnectorEventProcessor
participant HE as HealthEvent (protobuf)
participant Exporter as Event Exporter
participant Store as Storage (DB / pipeline)
participant K8s as K8s Connector
CLI->>Init: parse --processing-strategy
Init->>PC: init(..., processing_strategy)
PC->>PC: store self._processing_strategy
PC->>HE: create HealthEvent(processingStrategy=self._processing_strategy)
HE->>Exporter: emit CloudEvent (includes processingStrategy)
Exporter->>Store: persist event (all strategies)
rect rgba(200,230,200,0.6)
note right of Store: Watch pipeline filters EXECUTE_REMEDIATION
Store->>K8s: deliver only EXECUTE_REMEDIATION events
K8s->>K8s: update Node conditions / create Node events
end
rect rgba(230,200,200,0.6)
note right of Store: STORE_ONLY path
Store->>Store: stored for observability (no K8s remediation)
end
Estimated code review effortπ― 4 (Complex) | β±οΈ ~45 minutes
Pre-merge checks and finishing touchesβ Failed checks (1 warning)
β Passed checks (2 passed)
β¨ Finishing touches
π§ͺ Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
π§Ή Nitpick comments (5)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
325-343: Add godoc for the exported function.The function
filterProcessableEventsis exported (starts with uppercase) but lacks a proper godoc comment. Per coding guidelines, function comments are required for all exported Go functions. The comment on line 325 is present but should follow godoc format.Also, consider nil-safety: if
healthEventsorhealthEvents.Eventsis nil, this could panic.π Suggested improvement
-// filterProcessableEvents filters out STORE_ONLY events that should not create node conditions or K8s events. -func filterProcessableEvents(healthEvents *protos.HealthEvents) []*protos.HealthEvent { +// filterProcessableEvents filters out STORE_ONLY events that should not create node conditions or K8s events. +// It returns only events with processing strategy other than STORE_ONLY. +func filterProcessableEvents(healthEvents *protos.HealthEvents) []*protos.HealthEvent { + if healthEvents == nil || len(healthEvents.Events) == 0 { + return nil + } + var processableEvents []*protos.HealthEvent for _, healthEvent := range healthEvents.Events {tests/gpu_health_monitor_test.go (1)
502-510: Teardown clears "Memory" error that was never injected.The setup only injects an Inforom error (field 84), but the teardown
clearCommandsincludes both Inforom and Memory errors. While this is harmless (clearing a non-existent error is a no-op), it adds unnecessary operations. Consider matching the teardown to what was actually injected.π Suggested simplification
clearCommands := []struct { name string fieldID string value string condition string }{ {"Inforom", "84", "1", "GpuInforomWatch"}, - {"Memory", "395", "0", "GpuMemWatch"}, }health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)
125-132: Consider logging the strategy name instead of the integer value.Line 132 logs
processing_strategy_value, which is the integer enum value (e.g.,0or1). For better readability in logs, consider logging the original string name or usingProcessingStrategy.Name().π Suggested improvement
- log.info(f"Event handling strategy configured to: {processing_strategy_value}") + log.info(f"Event handling strategy configured to: {processing_strategy}")This logs the human-readable string (e.g.,
EXECUTE_REMEDIATION) instead of the integer value.tests/helpers/kube.go (2)
2306-2307: Hardcoded sleep after rollout completion is fragile.The 10-second sleep after
waitForDaemonSetRolloutadds unnecessary delay. The rollout wait already ensures pods are ready. If additional startup time is genuinely needed, consider polling for a specific readiness indicator instead.π Suggested improvement
Remove the hardcoded sleep or replace it with a condition-based wait if there's a specific startup behavior to await:
t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName) waitForDaemonSetRollout(ctx, t, client, daemonsetName) - t.Logf("Waiting 10 seconds for daemonset pods to start") - time.Sleep(10 * time.Second) - return originalDaemonSet, nil
2270-2272: Add godoc comments for exported functions.Per coding guidelines, exported Go functions require function comments. The new exported functions
UpdateDaemonSetProcessingStrategy,RestoreDaemonSet, andGetDaemonSetPodOnWorkerNodeare missing godoc comments.π Suggested improvement
+// UpdateDaemonSetProcessingStrategy updates the specified container in a DaemonSet to use +// STORE_ONLY processing strategy, waits for rollout completion, and returns the original DaemonSet. func UpdateDaemonSetProcessingStrategy(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, containerName string) (*appsv1.DaemonSet, error) {+// RestoreDaemonSet restores a DaemonSet's containers to their original state and waits for rollout. func RestoreDaemonSet(ctx context.Context, t *testing.T, client klient.Client, originalDaemonSet *appsv1.DaemonSet, daemonsetName string, ) error {+// GetDaemonSetPodOnWorkerNode returns a running, ready pod from the DaemonSet on a real worker node. func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, daemonsetName string, podNamePattern string) (*v1.Pod, error) {As per coding guidelines, function comments are required for all exported Go functions.
Also applies to: 2312-2314, 2346-2348
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (23)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyihealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pyplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gostore-client/pkg/client/postgresql_pipeline_builder.gotests/event_exporter_test.gotests/gpu_health_monitor_test.gotests/helpers/healthevent.gotests/helpers/kube.go
π§° Additional context used
π Path-based instructions (7)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-quarantine/pkg/initializer/init.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/postgresql_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gotests/helpers/healthevent.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/event_exporter_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goevent-exporter/pkg/transformer/cloudevents_test.gotests/helpers/kube.goevent-exporter/pkg/transformer/cloudevents.gotests/gpu_health_monitor_test.go
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
store-client/pkg/client/pipeline_builder_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/event_exporter_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goevent-exporter/pkg/transformer/cloudevents_test.gotests/gpu_health_monitor_test.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
**/daemonset*.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/**/*daemonset*.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
π§ Learnings (6)
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
fault-quarantine/pkg/initializer/init.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.gotests/gpu_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/event_exporter_test.gotests/gpu_health_monitor_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
tests/helpers/kube.go
𧬠Code graph analysis (10)
store-client/pkg/client/mongodb_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
store-client/pkg/client/postgresql_pipeline_builder.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
store-client/pkg/client/pipeline_builder_test.go (3)
store-client/pkg/client/pipeline_builder.go (1)
PipelineBuilder(26-47)store-client/pkg/client/mongodb_pipeline_builder.go (1)
NewMongoDBPipelineBuilder(29-31)store-client/pkg/client/postgresql_pipeline_builder.go (1)
NewPostgreSQLPipelineBuilder(29-31)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(221-245)
platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
data-models/pkg/protos/health_event.pb.go (11)
HealthEvents(156-162)HealthEvents(175-175)HealthEvents(190-192)HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/gpu_health_monitor_test.go (3)
tests/helpers/kube.go (10)
UpdateDaemonSetProcessingStrategy(2271-2310)GetDaemonSetPodOnWorkerNode(2346-2390)NVSentinelNamespace(64-64)SetNodeManagedByNVSentinel(1365-1384)ExecInPod(1539-1574)EnsureNodeConditionNotPresent(1773-1794)RestoreDaemonSet(2312-2344)CheckNodeConditionExists(1259-1276)EventuallyWaitTimeout(61-61)WaitInterval(63-63)tests/helpers/metadata.go (3)
CreateTestMetadata(59-108)InjectMetadata(110-181)DeleteMetadata(183-229)tests/helpers/fault_quarantine.go (2)
AssertQuarantineState(315-382)QuarantineAssertion(56-60)
π Additional comments (44)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)
1391-1506: Well-structured test suite for processing strategy.The test suite properly validates the new
ProcessingStrategyfiltering behavior with good coverage:
- STORE_ONLY events (fatal and non-fatal) correctly don't create node conditions or K8s events
- EXECUTE_REMEDIATION events properly create node conditions
- Mixed strategy scenarios validate selective processing
The table-driven approach and per-test isolation with local clientsets align well with the coding guidelines.
1508-1587: Test implementation looks correct.The test harness correctly:
- Creates isolated environments per test case
- Filters standard K8s node conditions (NodeReady, NodeMemoryPressure, etc.) when counting NVSentinel conditions
- Validates both the presence/absence of node conditions and K8s events
One minor observation: the
descriptionfield is logged at the end but could be incorporated into the assertion failure messages for better debugging context, though the current approach is acceptable.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
372-416: Implementation correctly integrates filtering.The
processHealthEventsfunction properly:
- Filters events using
filterProcessableEventsfirst- Uses only
processableEventsfor both node condition updates and K8s event creation- Maintains the existing logic for fatal vs non-fatal event handling
One edge case to consider: if
processableEventsis empty (all events are STORE_ONLY), the function returns early without errors, which appears to be the intended behavior.event-exporter/pkg/transformer/cloudevents.go (1)
66-66: Correctly propagates processingStrategy to CloudEvent.The addition of
processingStrategyto the CloudEvent data payload is consistent with how other enum fields (e.g.,recommendedActionon line 61) are handled, using.String()for serialization.fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: Test correctly updated for new processingStrategy field.The expected map now includes
processingStrategy: float64(0), which correctly reflects:
- The default enum value
EXECUTE_REMEDIATION = 0when no explicit value is set- The JSON unmarshaling behavior where numbers become
float64ininterface{}store-client/pkg/client/pipeline_builder_test.go (1)
69-86: Test follows established patterns correctly.The new test
TestProcessableHealthEventInsertsPipelineis well-structured and consistent with the existing test patterns in this file:
- Uses table-driven tests for both MongoDB and PostgreSQL implementations
- Properly validates pipeline is non-nil, non-empty, and has exactly 1 stage
- Uses
requirefor critical assertions andassertfor validationsAs per coding guidelines, consider using a more descriptive test name format like
TestPipelineBuilder_ProcessableHealthEventInsertsto align withTestFunctionName_Scenario_ExpectedBehavior.fault-quarantine/pkg/initializer/init.go (1)
66-66: LGTM! Pipeline switch correctly filters for processable events.The change from
BuildAllHealthEventInsertsPipeline()toBuildProcessableHealthEventInsertsPipeline()correctly ensures that fault-quarantine only processes health events withprocessingStrategy=EXECUTE_REMEDIATION, ignoring observability-only (STORE_ONLY) events.distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml (1)
61-62: LGTM! Processing strategy argument properly configured.The
--processing-strategyargument is correctly added with the value sourced from.Values.processingStrategyand properly quoted. This aligns with the PR objectives to enable configurable event handling strategy.distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)
57-62: LGTM! Well-documented configuration option.The
processingStrategyconfiguration is clearly documented with:
- Valid values (EXECUTE_REMEDIATION, STORE_ONLY)
- Default value that maintains backward compatibility
- Clear explanations of each mode's behavior
This follows the coding guidelines for Helm chart documentation.
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml (1)
61-62: LGTM! Consistent with DCGM 3.x template.The
--processing-strategyargument is added consistently with the DCGM 3.x DaemonSet template, ensuring uniform behavior across DCGM versions.store-client/pkg/client/postgresql_pipeline_builder.go (2)
19-19: LGTM! Import added for ProcessingStrategy enum.The import of
github.com/nvidia/nvsentinel/data-models/pkg/protosis necessary to reference theProcessingStrategy_EXECUTE_REMEDIATIONenum value used in the new pipeline.
119-132: LGTM! Pipeline correctly filters for processable events.The new
BuildProcessableHealthEventInsertsPipeline()method:
- Follows the established pipeline pattern from
BuildAllHealthEventInsertsPipeline()- Correctly filters INSERT operations where
processingStrategyequalsEXECUTE_REMEDIATION- Uses the appropriate
int32cast for the protobuf enum valueThis enables PostgreSQL change streams to ignore observability-only events (
STORE_ONLY).event-exporter/pkg/transformer/cloudevents_test.go (2)
69-69: LGTM! Test correctly initializes ProcessingStrategy.The test event properly sets
ProcessingStrategy: pb.ProcessingStrategy_STORE_ONLYto validate that the new field is handled correctly during CloudEvent transformation.
106-108: LGTM! Test validates ProcessingStrategy propagation.The assertion correctly verifies that the
processingStrategyfield appears in the CloudEvent payload with the expected string value "STORE_ONLY", ensuring proper transformation from the protobuf enum.store-client/pkg/client/pipeline_builder.go (1)
35-38: LGTM! Interface extension well-documented.The new
BuildProcessableHealthEventInsertsPipeline()method:
- Is clearly documented with its purpose and use case
- Explains the filtering behavior (processingStrategy=EXECUTE_REMEDIATION)
- References the consumer (fault-quarantine)
- Follows the documentation pattern of other interface methods
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)
14-17: LGTM! ProcessingStrategy enum properly defined.The new
ProcessingStrategyenum class withEXECUTE_REMEDIATIONandSTORE_ONLYmembers is correctly structured according to Python protobuf stub conventions.
31-32: LGTM! HealthEvent ProcessingStrategy integration complete.The
processingStrategyfield is properly integrated into theHealthEventmessage with:
- Module-level enum constants (lines 31-32)
- Field in
__slots__(line 78)- Field number constant (line 104)
- Type annotation (line 120)
__init__parameter with correct typing (line 138)This generated stub correctly reflects the protobuf schema changes.
Also applies to: 78-78, 104-104, 120-120, 138-138
tests/event_exporter_test.go (1)
25-26: Import addition looks correct.The import of
"tests/helpers"is properly added and used throughout the test file for helper functions.data-models/protobufs/health_event.proto (2)
32-38: Well-designed enum with appropriate default value.Using
EXECUTE_REMEDIATION = 0as the default is correct designβexisting clients that don't set the field will get the expected remediation behavior. The comments clearly explain each strategy's semantics, which aligns with the coding guidelines for comprehensive field documentation in Protocol Buffer messages.
77-77: Field addition looks correct.The new
processingStrategyfield at position 16 maintains backward compatibility and follows the existing field numbering sequence.store-client/pkg/client/mongodb_pipeline_builder.go (2)
87-100: Pipeline implementation is correct.The method follows existing patterns and correctly filters for
EXECUTE_REMEDIATIONevents using the protobuf enum's integer value. This ensures only events intended for remediation are processed by downstream consumers.
19-19: Import addition is appropriate.The
protosimport is required to referenceProcessingStrategy_EXECUTE_REMEDIATIONconstant.tests/helpers/healthevent.go (2)
48-48: Field addition follows existing patterns.The
ProcessingStrategyfield withomitemptyJSON tag is consistent with other optional fields in the struct. The default zero value maps toEXECUTE_REMEDIATION, which is the expected default behavior.
153-156: Builder method follows established conventions.The
WithProcessingStrategymethod maintains consistency with other builder methods in the file.tests/gpu_health_monitor_test.go (2)
414-463: Test structure and setup are well-organized.The test properly:
- Updates the DaemonSet to use STORE_ONLY strategy
- Waits for rollout completion
- Injects a GPU error to trigger event generation
- Stores context values for teardown
The use of
UpdateDaemonSetProcessingStrategyandGetDaemonSetPodOnWorkerNodehelpers keeps the test readable.
465-481: Test assertions correctly validate STORE_ONLY behavior.The test verifies that with
STORE_ONLYprocessing strategy:
- Node conditions are NOT applied
- Node is NOT cordoned
- No quarantine annotation is present
This validates the core feature that STORE_ONLY events are observed but don't modify cluster state.
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (5)
107-108: Processing strategy parameter correctly added to test initialization.The test now passes
platformconnector_pb2.STORE_ONLYto the processor, enabling verification of the new processing strategy flow.
301-302: Good assertion for processingStrategy propagation.Verifying that
nvlink_failure_event.processingStrategy == platformconnector_pb2.STORE_ONLYconfirms the strategy is correctly propagated through the event pipeline to the generated HealthEvent.
523-524: Test uses EXECUTE_REMEDIATION for connectivity restored scenario.This is appropriateβthe connectivity restored test uses
EXECUTE_REMEDIATIONwhich tests the alternative processing path, providing coverage for both enum values.
549-549: Assertion validates EXECUTE_REMEDIATION propagation.This confirms the test covers the
EXECUTE_REMEDIATIONstrategy path in the restored connectivity scenario.
493-493: Comprehensive assertion for DCGM connectivity failure event.The test verifies all expected fields including the new
processingStrategyfield, ensuring the complete event structure is validated.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: LGTM - Generated protobuf code.This is an auto-generated file from the protobuf compiler. The changes correctly reflect the addition of the
ProcessingStrategyenum and theprocessingStrategyfield (field number 16) to theHealthEventmessage, as indicated by the updated serialized descriptor and adjusted byte ranges.health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (3)
24-24: LGTM - Import alias consistent with existing usage.The alias
platformconnector_pb2forhealth_event_pb2maintains consistency with the existing import pattern inplatform_connector.py.
74-80: LGTM - CLI option well-defined.The new
--processing-strategyoption is appropriately configured with a sensible default (EXECUTE_REMEDIATION) and a clear help string describing the valid values.
28-51: LGTM - Parameter threading through event processor initialization.The
processing_strategyparameter is correctly added to_init_event_processorsignature and passed through toPlatformConnectorEventProcessor.tests/helpers/kube.go (4)
2208-2249: LGTM - DaemonSet rollout wait logic.The rollout status checks correctly verify that all desired pods are scheduled, updated, and ready before considering the rollout complete.
2251-2268: LGTM - Container argument update handles both formats.The helper correctly handles both
--processing-strategy=VALUEand--processing-strategy VALUEargument formats, and appends the argument if not present.
2312-2344: LGTM - RestoreDaemonSet correctly handles nil input.The function appropriately handles the case where
originalDaemonSetis nil by logging and returning early.
2346-2390: LGTM - GetDaemonSetPodOnWorkerNode with proper readiness checks.The function correctly verifies the pod is not being deleted, is in Running phase, and passes all readiness conditions before returning.
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)
41-66: LGTM - Processing strategy parameter added to constructor.The
processing_strategyparameter is correctly typed withplatformconnector_pb2.ProcessingStrategyand stored as an instance variable for use in event creation.
106-121: LGTM - Processing strategy propagated in connectivity restored events.The
processingStrategyfield is correctly included in theHealthEventfor DCGM connectivity restoration.
206-223: LGTM - Processing strategy propagated in health failure events.The
processingStrategyfield is correctly included when creating health events for GPU failures.
270-287: LGTM - Processing strategy propagated in health recovery events.The
processingStrategyfield is correctly included when creating health events indicating GPU recovery.
366-381: LGTM - Processing strategy propagated in connectivity failure events.The
processingStrategyfield is correctly included in theHealthEventfor DCGM connectivity failures.
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
π§Ή Nitpick comments (5)
tests/helpers/event_exporter.go (1)
221-226: Add function documentation per coding guidelines.The exported function
ValidateCloudEventlacks a godoc comment. As per coding guidelines, "Function comments required for all exported Go functions." Please add documentation explaining the function's purpose and parameters.π Suggested documentation
+// ValidateCloudEvent validates that a CloudEvent has the expected structure and content. +// It checks the CloudEvent spec version, type, source, and validates the embedded HealthEvent +// fields including node name, message, check name, error code, and processing strategy. func ValidateCloudEvent( t *testing.T, event map[string]any, expectedNodeName, expectedMessage, expectedCheckName, expectedErrorCode string, expectedProcessingStrategy string, ) {Based on coding guidelines: "Function comments required for all exported Go functions"
tests/helpers/kube.go (4)
2257-2269: Unused variableoriginalDaemonSet.The
originalDaemonSetvariable is assigned on line 2268 but never used. This appears to be dead code, possibly leftover from a previous implementation that intended to restore the original state.π Proposed fix
err := retry.RetryOnConflict(retry.DefaultRetry, func() error { daemonSet := &appsv1.DaemonSet{} if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil { return err } - if originalDaemonSet == nil { - originalDaemonSet = daemonSet.DeepCopy() - } - containers := daemonSet.Spec.Template.Spec.Containers
2288-2289: Consider removing redundant sleep.The 10-second sleep appears redundant since
waitForDaemonSetRolloutalready waits until all pods are ready (NumberReady == DesiredNumberScheduled). If there's a specific edge case requiring this delay (e.g., waiting for readiness probes to stabilize), consider documenting it; otherwise, this could be removed.
2294-2327: Inconsistent error handling pattern.This function mixes two error handling approaches: it returns an
errorbut also callsrequire.NoErroron line 2319 which will fail the test immediately. This is inconsistent withUpdateDaemonSetArgswhich only returns errors.If the function returns an error, callers should handle it. Using
require.NoErrormakes the return value meaningless since the test fails before returning.π Proposed fix - Option A: Return error consistently (like UpdateDaemonSetArgs)
err := retry.RetryOnConflict(retry.DefaultRetry, func() error { daemonSet := &appsv1.DaemonSet{} if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil { return err } containers := daemonSet.Spec.Template.Spec.Containers for i := range containers { if containers[i].Name == containerName { removeArgsFromContainer(&containers[i], args) break } } return client.Resources().Update(ctx, daemonSet) }) - require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName) + if err != nil { + return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err) + }
2409-2453: ParameterdaemonsetNameis only used in error message, not for pod selection.The function signature accepts
daemonsetNamebut it's only used in the error message on line 2449. The actual pod selection usespodNamePatternviaGetPodOnWorkerNode. This could be misleading since the function name implies it validates the pod belongs to the specified DaemonSet.Consider either:
- Removing the
daemonsetNameparameter if pattern matching is sufficient- Adding validation that the pod is actually owned by the specified DaemonSet
π Option 1: Remove unused parameter
-func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, - daemonsetName string, podNamePattern string) (*v1.Pod, error) { +func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client, + podNamePattern string) (*v1.Pod, error) { t.Helper() var resultPod *v1.Pod require.Eventually(t, func() bool { // Get the pod pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern) if err != nil { t.Logf("Failed to get pod: %v", err) return false } ... }, EventuallyWaitTimeout, WaitInterval, "daemonset pod from current rollout should be running and ready") if resultPod == nil { - return nil, fmt.Errorf("failed to get ready pod for daemonset %s", daemonsetName) + return nil, fmt.Errorf("failed to get ready pod matching pattern %s", podNamePattern) } return resultPod, nil }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
tests/gpu_health_monitor_test.gotests/helpers/event_exporter.gotests/helpers/kube.go
π§ Files skipped from review as they are similar to previous changes (1)
- tests/gpu_health_monitor_test.go
π§° Additional context used
π Path-based instructions (1)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/event_exporter.gotests/helpers/kube.go
π§ Learnings (1)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
tests/helpers/kube.go
π Additional comments (5)
tests/helpers/event_exporter.go (1)
245-245: LGTM! Validation follows the established pattern.The validation of
processingStrategycorrectly follows the same pattern as other field validations in this function, usingrequire.Equalconsistently.tests/helpers/kube.go (4)
2208-2249: LGTM!The
waitForDaemonSetRolloutfunction correctly polls the DaemonSet status and follows the same patterns as the existingWaitForDeploymentRolloutfunction. The rollout completion checks forDesiredNumberScheduled,UpdatedNumberScheduled, andNumberReadyare appropriate.
2329-2357: LGTM!The
tryUpdateExistingArghelper correctly handles both--flag=valueand--flag valueargument styles. The slice manipulation for inserting a value after a flag is correct.
2359-2381: LGTM!The function correctly sets container arguments, leveraging
tryUpdateExistingArgto handle existing args and appending new ones as needed.
2383-2407: LGTM!The function correctly handles removal of both
--flag=valueand--flag valuestyle arguments, appropriately breaking after modification to avoid issues with slice iteration.
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
8398875 to
232c636
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
π§Ή Nitpick comments (8)
tests/helpers/healthevent.go (1)
48-48: Consider using the protobuf enum type for type safety.The
ProcessingStrategyfield usesintinstead ofprotos.ProcessingStrategy. While this provides flexibility in tests, it loses type safety and may allow invalid values.Consider whether type safety is valuable here:
π Option to use the enum type
If you prefer compile-time type checking:
+import "github.com/nvidia/nvsentinel/data-models/pkg/protos" + type HealthEventTemplate struct { Version int `json:"version"` Agent string `json:"agent"` ComponentClass string `json:"componentClass,omitempty"` CheckName string `json:"checkName"` IsFatal bool `json:"isFatal"` IsHealthy bool `json:"isHealthy"` Message string `json:"message"` RecommendedAction int `json:"recommendedAction,omitempty"` ErrorCode []string `json:"errorCode,omitempty"` EntitiesImpacted []EntityImpacted `json:"entitiesImpacted,omitempty"` Metadata map[string]string `json:"metadata,omitempty"` QuarantineOverrides *QuarantineOverrides `json:"quarantineOverrides,omitempty"` NodeName string `json:"nodeName"` - ProcessingStrategy int `json:"processingStrategy,omitempty"` + ProcessingStrategy protos.ProcessingStrategy `json:"processingStrategy,omitempty"` }Then update the builder:
-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate { +func (h *HealthEventTemplate) WithProcessingStrategy(strategy protos.ProcessingStrategy) *HealthEventTemplate { h.ProcessingStrategy = strategy return h }tests/helpers/kube.go (1)
2312-2313: Consider if the 10-second sleep is necessary.After
waitForDaemonSetRolloutcompletes, all pods are confirmed updated and ready. The additional 10-second sleep may be unnecessary unless there's a specific stabilization requirement not covered by the readiness checks.If the sleep is for pod initialization beyond readiness, consider adding a comment explaining why. Otherwise, this delay might be removable:
t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName) waitForDaemonSetRollout(ctx, t, client, daemonsetName) - t.Logf("Waiting 10 seconds for daemonset pods to start") - time.Sleep(10 * time.Second) - return nil }docs/postgresql-schema.sql (1)
106-109: Consider adding constraints for data integrity.The
processing_strategycolumn is nullable and has no constraints. Consider adding:
- A
CHECKconstraint to ensure only valid enum values are stored- A
NOT NULLconstraint with a default value if every health event should have a strategyπ Option to add constraints
If you want to enforce valid values at the database level:
-- Metadata created_at TIMESTAMPTZ DEFAULT NOW(), - updated_at TIMESTAMPTZ DEFAULT NOW(), + updated_at TIMESTAMPTZ DEFAULT NOW() NOT NULL, -- Event handling strategy - processing_strategy VARCHAR(50) + processing_strategy VARCHAR(50) NOT NULL DEFAULT 'EXECUTE_REMEDIATION' + CHECK (processing_strategy IN ('EXECUTE_REMEDIATION', 'STORE_ONLY')) );This prevents invalid values and ensures consistency, but reduces flexibility if new enum values are added later without a migration.
tests/gpu_health_monitor_test.go (1)
488-489: Use defined constants instead of hardcoded strings.Lines 488-489 use hardcoded strings
"gpu-health-monitor-dcgm-4.x"and"gpu-health-monitor"instead of the constantsGPUHealthMonitorDaemonSetNameandGPUHealthMonitorContainerNamedefined at lines 42-43.π Suggested fix
- err = helpers.RemoveDaemonSetArgs(ctx, t, client, "gpu-health-monitor-dcgm-4.x", "gpu-health-monitor", map[string]string{ + err = helpers.RemoveDaemonSetArgs(ctx, t, client, GPUHealthMonitorDaemonSetName, GPUHealthMonitorContainerName, map[string]string{ "--processing-strategy": "EXECUTE_REMEDIATION"})tests/platform-connector_test.go (2)
28-32: Remove unused struct fields.
ConfigMapBackupandTestNamespacefields are defined but never used in the test. Consider removing them to keep the code clean.π Suggested fix
type PlatformConnectorTestContext struct { NodeName string - ConfigMapBackup []byte - TestNamespace string }
98-101: Teardown sends healthy event but doesn't verify cleanup of STORE_ONLY events.The teardown only sends a healthy event. Consider verifying that any state from the STORE_ONLY test cases is properly cleaned up, or add a comment explaining why no cleanup verification is needed.
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)
125-132: Good error handling for invalid processing strategy.The validation correctly:
- Uses protobuf's
Value()method to validate and convert the string- Catches
ValueErrorfor invalid inputs- Logs all valid options to help users correct their configuration
- Exits with code 1 on invalid input
One minor note: Line 132 logs
processing_strategy_valuewhich is the integer enum value. Consider logging the string name for better readability.π Optional: Log the strategy name for better readability
- log.info(f"Event handling strategy configured to: {processing_strategy_value}") + log.info(f"Event handling strategy configured to: {processing_strategy}")health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)
125-135: Consider adding processingStrategy assertion to first test case.The
test_health_event_occurredtest at lines 125-135 verifies event properties but doesn't assert onprocessingStrategy. While other tests cover this, adding an assertion here would ensure complete coverage.π Suggested addition
if event.checkName == "GpuInforomWatch" and event.isHealthy == False: assert event.errorCode[0] == "DCGM_FR_CORRUPT_INFOROM" assert event.entitiesImpacted[0].entityValue == "0" assert event.recommendedAction == platformconnector_pb2.RecommendedAction.COMPONENT_RESET + assert event.processingStrategy == platformconnector_pb2.STORE_ONLY else:
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (17)
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yamldistros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yamldistros/kubernetes/nvsentinel/values-tilt-postgresql.yamldocs/postgresql-schema.sqlfault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pytests/event_exporter_test.gotests/fault_quarantine_test.gotests/gpu_health_monitor_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.gotests/platform-connector_test.go
π§ Files skipped from review as they are similar to previous changes (4)
- fault-quarantine/pkg/initializer/init.go
- tests/event_exporter_test.go
- distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
- distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
π§° Additional context used
π Path-based instructions (4)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/helpers/event_exporter.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/helpers/healthevent.gotests/gpu_health_monitor_test.gotests/helpers/kube.gotests/fault_quarantine_test.gotests/platform-connector_test.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/gpu_health_monitor_test.gotests/fault_quarantine_test.gotests/platform-connector_test.go
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/cli.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
π§ Learnings (8)
π Learning: 2025-12-23T10:34:13.121Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.
Applied to files:
tests/helpers/event_exporter.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/platform-connector_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
fault-quarantine/pkg/evaluator/rule_evaluator_test.gotests/gpu_health_monitor_test.gotests/fault_quarantine_test.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/gpu_health_monitor_test.gotests/fault_quarantine_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/gpu_health_monitor_test.gotests/fault_quarantine_test.gotests/platform-connector_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
tests/helpers/kube.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/fault_quarantine_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/platform-connector_test.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
tests/platform-connector_test.go
𧬠Code graph analysis (2)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/fault_quarantine_test.go (4)
tests/helpers/fault_quarantine.go (4)
QuarantineTestContext(51-54)SetupQuarantineTest(107-112)AssertQuarantineState(315-382)QuarantineAssertion(56-60)tests/helpers/kube.go (1)
SetNodeManagedByNVSentinel(1389-1408)tests/helpers/healthevent.go (3)
NewHealthEvent(60-76)SendHealthEvent(263-275)SendHealthyEvent(277-287)data-models/pkg/protos/health_event.pb.go (2)
ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
π Additional comments (28)
tests/helpers/kube.go (5)
387-409: LGTM!The
EnsureNodeEventNotPresenthelper correctly usesrequire.Neverto assert that a specific event type and reason combination never appears on a node during the test window. The implementation properly queries events and checks both type and reason.
2232-2273: LGTM!The
waitForDaemonSetRolloutfunction correctly implements DaemonSet rollout verification by checking that all desired pods are scheduled, updated, and ready. The logic matcheskubectl rollout statusbehavior and includes helpful progress logging.
2318-2351: LGTM!The
RemoveDaemonSetArgsfunction correctly removes specified arguments from a DaemonSet container and waits for the rollout to complete. Usesretry.RetryOnConflictappropriately without error wrapping.
2353-2431: LGTM!The argument manipulation helpers (
tryUpdateExistingArg,setArgsOnContainer,removeArgsFromContainer) correctly handle various command-line argument formats:
--flag=valuestyle--flag valuestyle (separate entries)--flagstyle (boolean flags)The logic properly preserves argument order and handles edge cases like updating existing args vs adding new ones.
2433-2477: LGTM!The
GetDaemonSetPodOnWorkerNodehelper correctly retrieves a ready DaemonSet pod with proper validation:
- Verifies pod is not being deleted (
DeletionTimestamp == nil)- Confirms pod is in
Runningphase- Checks pod readiness conditions
- Uses
require.Eventuallyfor reliable pollingThis defensive checking improves test stability by ensuring the pod from the current rollout is fully operational.
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: LGTM!The test correctly updates the expected map to include the new
processingStrategyfield with valuefloat64(0), matching the default enum valueProcessingStrategy_EXECUTE_REMEDIATION. Usingfloat64is correct for JSON unmarshaling behavior.distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml (1)
219-222: LGTM!The schema addition matches the canonical source in
docs/postgresql-schema.sql. The file header correctly documents the sync process usingmake update-helm-postgres-schemaandmake validate-postgres-schema.tests/helpers/event_exporter.go (2)
245-245: LGTM!The assertion correctly validates that the
processingStrategyfield in the CloudEvent data matches the expected value. The comparison works properly since the JSON-unmarshaled value will be a string.
221-226: All callers ofValidateCloudEventhave been properly updated. The function call attests/event_exporter_test.go:85correctly passes all 7 required parameters, including the newexpectedProcessingStrategyparameter ("EXECUTE_REMEDIATION"). No outdated calls remain in the codebase.tests/fault_quarantine_test.go (4)
233-250: LGTM!The test setup correctly:
- Uses the existing
SetupQuarantineTesthelper- Sets the node as managed by NVSentinel (required for quarantine logic)
- Follows the established pattern from other tests in this file
252-268: LGTM!The STORE_ONLY assessment correctly validates that events with
ProcessingStrategy_STORE_ONLY:
- Do NOT cause the node to be cordoned
- Do NOT add quarantine annotations
This properly tests the observability-only behavior where events are stored but don't modify cluster state.
270-286: LGTM!The EXECUTE_REMEDIATION assessment correctly validates that events with
ProcessingStrategy_EXECUTE_REMEDIATION:
- DO cause the node to be cordoned
- DO add quarantine annotations
This properly tests the normal remediation behavior where the system takes corrective actions.
288-295: LGTM!The teardown correctly:
- Sends a healthy event to clear the quarantine state from the EXECUTE_REMEDIATION assessment
- Uses
TeardownQuarantineTestto restore the original configuration and clean up test resourcesThis ensures proper test isolation and cleanup.
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)
57-62: LGTM! Clear documentation and sensible default.The
processingStrategyconfiguration option is well-documented with:
- Clear valid values:
EXECUTE_REMEDIATION,STORE_ONLY- Sensible default:
EXECUTE_REMEDIATION(maintains backward compatibility)- Clear behavior description for each mode
Both DaemonSet templates (
daemonset-dcgm-3.x.yamlanddaemonset-dcgm-4.x.yaml) correctly reference this value using{{ .Values.processingStrategy | quote }}.tests/gpu_health_monitor_test.go (3)
34-48: LGTM on constants organization.The constants are well-organized with exported names for reuse. The separation of DCGM-related constants from GPU health monitor constants improves readability.
413-462: Test setup and error injection logic is well-structured.The test correctly:
- Configures the DaemonSet with
STORE_ONLYstrategy- Waits for the pod to be ready
- Injects test metadata and sets the node label
- Injects a DCGM Inforom error to trigger the health monitor
The flow aligns with the PR objective of verifying STORE_ONLY events are stored without triggering remediation.
464-480: LGTM on assess phase.The assertions correctly verify that:
- Node conditions are not applied when using STORE_ONLY strategy
- Node is not cordoned
This validates the expected behavior of the STORE_ONLY processing strategy.
tests/platform-connector_test.go (1)
53-96: Test thoroughly covers both processing strategies.The assess phase correctly validates:
- STORE_ONLY events don't apply node conditions or emit events
- EXECUTE_REMEDIATION events do apply conditions and events
The test uses both fatal (ERRORCODE_79) and non-fatal (ERRORCODE_31) error codes to cover different scenarios.
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)
74-80: CLI option for processing strategy is well-designed.The option:
- Has a sensible default (
EXECUTE_REMEDIATION) for backward compatibility- Provides clear help text describing valid values
- Is marked as optional
37-51: Function signature updated correctly with type hint.The
processing_strategyparameter is properly typed withplatformconnector_pb2.ProcessingStrategyand passed through to thePlatformConnectorEventProcessorconstructor. As per coding guidelines, type hints are required for all functions in Python code.health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (3)
107-108: LGTM on test update for processing_strategy parameter.The test correctly passes
platformconnector_pb2.STORE_ONLYas the processing strategy to the event processor constructor.
301-302: Good assertion on processingStrategy propagation.The test verifies that the
processingStrategyfield on the emittedHealthEventmatches the strategy configured in the processor (STORE_ONLY).
523-549: Test correctly verifies EXECUTE_REMEDIATION strategy propagation.This test case uses
EXECUTE_REMEDIATIONstrategy and verifies the restored event has the correctprocessingStrategyfield. Good coverage of both strategy values across different test cases.health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)
51-66: Processing strategy parameter and storage are correctly implemented.The constructor:
- Accepts the typed
processing_strategyparameter- Stores it as
self._processing_strategyfollowing Python naming conventions for protected attributesAs per coding guidelines, type hints are required for all functions in Python code, which is satisfied here.
106-121: processingStrategy correctly added to connectivity restored event.The
clear_dcgm_connectivity_failuremethod properly includesprocessingStrategy=self._processing_strategyin the HealthEvent message.
206-223: processingStrategy correctly added to health event for entity failures.The HealthEvent created when entity failures are detected properly includes the processing strategy.
270-287: processingStrategy correctly added to healthy status events.The HealthEvent created for healthy (PASS) status properly includes the processing strategy.
366-381: processingStrategy correctly added to DCGM connectivity failure event.The
dcgm_connectivity_failedmethod properly includesprocessingStrategy=self._processing_strategyin the HealthEvent message. All four HealthEvent creation sites now consistently include the processing strategy.
| t.Logf("Node %s should have condition SysLogsXIDError", testCtx.NodeName) | ||
| helpers.CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") | ||
|
|
||
| event = helpers.NewHealthEvent(testCtx.NodeName). | ||
| WithErrorCode(helpers.ERRORCODE_31). | ||
| WithMessage("XID error occurred"). | ||
| WithFatal(false). | ||
| WithProcessingStrategy(int(protos.ProcessingStrategy_EXECUTE_REMEDIATION)) | ||
| helpers.SendHealthEvent(ctx, t, event) | ||
|
|
||
| t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName) | ||
| helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π§© Analysis chain
π Script executed:
#!/bin/bash
# Locate and examine the helper functions mentioned in the review
find . -name "*.go" -path "*/helpers/*" | head -20Repository: NVIDIA/NVSentinel
Length of output: 478
π Script executed:
#!/bin/bash
# Search for CheckNodeConditionExists and CheckNodeEventExists definitions
rg -n 'func CheckNodeConditionExists|func CheckNodeEventExists' --type goRepository: NVIDIA/NVSentinel
Length of output: 172
π Script executed:
#!/bin/bash
# Examine the CheckNodeConditionExists function signature and implementation
sed -n '1283,1323p' tests/helpers/kube.goRepository: NVIDIA/NVSentinel
Length of output: 1599
π Script executed:
#!/bin/bash
# Examine the CheckNodeEventExists function signature and implementation
sed -n '1325,1365p' tests/helpers/kube.goRepository: NVIDIA/NVSentinel
Length of output: 1167
Add assertions for helper function return values at lines 83 and 93.
CheckNodeConditionExists returns (*v1.NodeCondition, error) and CheckNodeEventExists returns (bool, *v1.Event). The test calls these helpers but ignores their return values, so it never verifies that the condition or event actually exists. Capture and assert the return values using standard library assertions (e.g., if err != nil { t.Errorf(...) } or if !exists { t.Errorf(...) }).
π€ Prompt for AI Agents
In tests/platform-connector_test.go around lines 82 to 93, the calls to
CheckNodeConditionExists and CheckNodeEventExists ignore their return values;
capture and assert them: assign the returned (*v1.NodeCondition, error) from
CheckNodeConditionExists, check err is nil and the condition is non-nil (fail
the test with t.Fatalf or t.Errorf if not), and assign the returned (bool,
*v1.Event) from CheckNodeEventExists, assert the bool is true and the event
pointer is non-nil (fail the test if not); use standard library assertions (if
err != nil { t.Fatalf(...) } / if !exists { t.Fatalf(...) } / if event == nil {
t.Fatalf(...) }) so the test actually validates existence.
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Behavioral Changes
Tests
βοΈ Tip: You can customize this high-level summary in your review settings.