-
Notifications
You must be signed in to change notification settings - Fork 33
feat: add event handling strategy changes in health-events-analyzer #610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add event handling strategy changes in health-events-analyzer #610
Conversation
Signed-off-by: Tanisha goyal <[email protected]>
Signed-off-by: Tanisha goyal <[email protected]>
WalkthroughA new ProcessingStrategy enum (EXECUTE_REMEDIATION, STORE_ONLY) and a processingStrategy field on HealthEvent are added and propagated through pipelines, publisher, platform connector, exporters, deployment config, and tests to control whether events trigger remediation or are stored/observed only. Changes
Sequence Diagram(s)mermaid Estimated code review effortπ― 3 (Moderate) | β±οΈ ~25 minutes Areas requiring extra attention:
Poem
Pre-merge checks and finishing touchesβ Failed checks (1 warning)
β Passed checks (2 passed)
β¨ Finishing touches
π§ͺ Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
π§Ή Nitpick comments (7)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)
263-263: LGTM! The test correctly reflects the newprocessingStrategyfield.The addition of
"processingStrategy": float64(0)properly validates that the RoundTrip function now includes the new field with its default value (EXECUTE_REMEDIATION = 0).Optional: Consider adding test coverage for non-default values
While the current test appropriately validates the default behavior, you might consider adding a separate test case that explicitly sets and verifies the
STORE_ONLY(value 1) processingStrategy to ensure complete coverage of the enum values. This would be a nice-to-have enhancement but is not critical for this test's scope.Example:
func TestRoundTrip_WithCustomProcessingStrategy(t *testing.T) { eventTime := timestamppb.New(time.Now()) event := &protos.HealthEvent{ Version: 1, Agent: "test-agent", ProcessingStrategy: protos.ProcessingStrategy_STORE_ONLY, // ... other fields } result, err := RoundTrip(event) if err != nil { t.Fatalf("Failed to roundtrip event: %v", err) } if result["processingStrategy"] != float64(1) { t.Errorf("Expected processingStrategy to be 1, got %v", result["processingStrategy"]) } }tests/helpers/kube.go (2)
2341-2365: Missing container name validation for consistency withSetDeploymentArgs.Unlike
SetDeploymentArgswhich returns an error when a specificcontainerNameis requested but not found,RemoveDeploymentArgssilently proceeds. This inconsistency could mask configuration errors in tests.π Proposed fix
func RemoveDeploymentArgs( ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string, ) error { return retry.RetryOnConflict(retry.DefaultRetry, func() error { deployment := &appsv1.Deployment{} if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil { return err } if len(deployment.Spec.Template.Spec.Containers) == 0 { return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName) } + found := false + for i := range deployment.Spec.Template.Spec.Containers { container := &deployment.Spec.Template.Spec.Containers[i] if containerName != "" && container.Name != containerName { continue } + found = true + removeArgsFromContainer(container, args) } + if containerName != "" && !found { + return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName) + } + return c.Resources().Update(ctx, deployment) }) }
2314-2320: Slice insertion logic is correct but could be clearer.The slice manipulation on line 2319 inserts a value after the flag. While functionally correct, the nested
appendpattern is non-obvious.π Consider using slices.Insert for clarity (Go 1.21+)
- // Insert the value after the flag - container.Args = append(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...) + // Insert the value after the flag + container.Args = slices.Insert(container.Args, j+1, value)This would require adding
"slices"to imports. If staying on older Go versions, the current approach works fine.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
372-416: Consider simplifying:nodeConditionsslice is built but not used.The
nodeConditionsslice (lines 375-386) is constructed only to checklen(nodeConditions) > 0, but the actual node conditions are recalculated insideupdateNodeConditionsfromprocessableEvents. This is redundant work.π Simplified approach
func (r *K8sConnector) processHealthEvents(ctx context.Context, healthEvents *protos.HealthEvents) error { processableEvents := filterProcessableEvents(healthEvents) - var nodeConditions []corev1.NodeCondition - - for _, healthEvent := range processableEvents { - if healthEvent.IsHealthy || healthEvent.IsFatal { - nodeConditions = append(nodeConditions, corev1.NodeCondition{ - Type: corev1.NodeConditionType(healthEvent.CheckName), - LastHeartbeatTime: metav1.NewTime(healthEvent.GeneratedTimestamp.AsTime()), - LastTransitionTime: metav1.NewTime(healthEvent.GeneratedTimestamp.AsTime()), - Message: r.fetchHealthEventMessage(healthEvent), - }) - } - } + // Check if any events require node condition updates (IsHealthy or IsFatal) + hasConditionEvents := false + for _, healthEvent := range processableEvents { + if healthEvent.IsHealthy || healthEvent.IsFatal { + hasConditionEvents = true + break + } + } - if len(nodeConditions) > 0 { + if hasConditionEvents { start := time.Now() err := r.updateNodeConditions(ctx, processableEvents) // ... rest unchangedstore-client/pkg/client/mongodb_pipeline_builder.go (1)
102-115: Minor style inconsistency in operationType matching.
BuildProcessableHealthEventInsertsPipeline(line 93-95) uses$inwith an array foroperationType, while this function uses a direct string match (line 108). Both are functionally equivalent for a single value, but using consistent patterns improves maintainability.π Suggested fix for consistency
func (b *MongoDBPipelineBuilder) BuildProcessableNonFatalUnhealthyInsertsPipeline() datastore.Pipeline { return datastore.ToPipeline( datastore.D( datastore.E("$match", datastore.D( - datastore.E("operationType", "insert"), + datastore.E("operationType", datastore.D( + datastore.E("$in", datastore.A("insert")), + )), datastore.E("fullDocument.healthevent.agent", datastore.D(datastore.E("$ne", "health-events-analyzer"))), datastore.E("fullDocument.healthevent.ishealthy", false), datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION)), )), ), ) }health-events-analyzer/pkg/publisher/publisher.go (1)
102-104: Consider documenting the newruleparameter.The function signature now accepts an optional
ruleparameter for per-ruleProcessingStrategyoverrides. Adding a brief godoc comment explaining when to passnilvs a rule reference would improve API clarity. As per coding guidelines, function comments are required for exported Go functions.π Suggested documentation
+// Publish sends a health event to the platform connector with the specified recommended action. +// The event's ProcessingStrategy defaults to the publisher's configured strategy, but can be +// overridden by providing a non-nil rule with a ProcessingStrategy field set. Pass nil for +// rule when no rule-level override is needed (e.g., XID burst detection). func (p *PublisherConfig) Publish(ctx context.Context, event *protos.HealthEvent, recommendedAction protos.RecommendedAction, ruleName string, message string, rule *config.HealthEventsAnalyzerRule) error {health-events-analyzer/main.go (1)
133-140: Consider enhancing the validation error message.The flag validation logic is correct, but the error message could be more helpful by listing the valid values.
π Suggested improvement
value, ok := protos.ProcessingStrategy_value[*processingStrategyFlag] if !ok { - return fmt.Errorf("unexpected processingStrategy value: %q", *processingStrategyFlag) + return fmt.Errorf("unexpected processingStrategy value: %q, must be one of: EXECUTE_REMEDIATION, STORE_ONLY", *processingStrategyFlag) }Alternatively, you could iterate over the map keys to generate the list dynamically, though that may be overkill for a two-value enum.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (1)
data-models/pkg/protos/health_event.pb.gois excluded by!**/*.pb.go
π Files selected for processing (27)
data-models/protobufs/health_event.protodistros/kubernetes/nvsentinel/charts/health-events-analyzer/templates/deployment.yamldistros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yamlevent-exporter/pkg/transformer/cloudevents.goevent-exporter/pkg/transformer/cloudevents_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.gofault-quarantine/pkg/initializer/init.gohealth-events-analyzer/main.gohealth-events-analyzer/pkg/config/rules.gohealth-events-analyzer/pkg/publisher/publisher.gohealth-events-analyzer/pkg/reconciler/reconciler.gohealth-events-analyzer/pkg/reconciler/reconciler_test.gohealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyhealth-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyiplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_client.gostore-client/pkg/client/mongodb_pipeline_builder.gostore-client/pkg/client/pipeline_builder.gostore-client/pkg/client/pipeline_builder_test.gostore-client/pkg/client/postgresql_pipeline_builder.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.gotests/event_exporter_test.gotests/health_events_analyzer_test.gotests/helpers/event_exporter.gotests/helpers/healthevent.gotests/helpers/kube.go
π§° Additional context used
π Path-based instructions (5)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-events-analyzer/pkg/config/rules.gotests/helpers/healthevent.gofault-quarantine/pkg/initializer/init.gotests/event_exporter_test.gostore-client/pkg/client/mongodb_client.gostore-client/pkg/client/pipeline_builder.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.gotests/health_events_analyzer_test.gostore-client/pkg/client/postgresql_pipeline_builder.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.goplatform-connectors/pkg/connectors/kubernetes/process_node_events.gostore-client/pkg/client/mongodb_pipeline_builder.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.goevent-exporter/pkg/transformer/cloudevents.gohealth-events-analyzer/pkg/reconciler/reconciler_test.gotests/helpers/event_exporter.gohealth-events-analyzer/pkg/publisher/publisher.gotests/helpers/kube.gohealth-events-analyzer/pkg/reconciler/reconciler.gohealth-events-analyzer/main.gostore-client/pkg/client/pipeline_builder_test.go
**/values.yaml
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/values.yaml: Document all values in Helm chartvalues.yamlwith inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable
Files:
distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/event_exporter_test.gofault-quarantine/pkg/evaluator/rule_evaluator_test.goevent-exporter/pkg/transformer/cloudevents_test.gotests/health_events_analyzer_test.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.goplatform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.gohealth-events-analyzer/pkg/reconciler/reconciler_test.gostore-client/pkg/client/pipeline_builder_test.go
data-models/protobufs/**/*.proto
π CodeRabbit inference engine (.github/copilot-instructions.md)
data-models/protobufs/**/*.proto: Define Protocol Buffer messages indata-models/protobufs/directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages
Files:
data-models/protobufs/health_event.proto
**/*.py
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code
Files:
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
π§ Learnings (8)
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yamltests/health_events_analyzer_test.gostore-client/pkg/client/postgresql_pipeline_builder.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.gostore-client/pkg/client/mongodb_pipeline_builder.gohealth-events-analyzer/pkg/reconciler/reconciler.gohealth-events-analyzer/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
tests/event_exporter_test.gotests/helpers/event_exporter.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
tests/event_exporter_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
tests/event_exporter_test.gostore-client/pkg/client/pipeline_builder_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests
Applied to files:
tests/event_exporter_test.gostore-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
π Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.
Applied to files:
tests/health_events_analyzer_test.go
π Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.
Applied to files:
health-events-analyzer/pkg/publisher/publisher.gohealth-events-analyzer/main.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
health-events-analyzer/pkg/reconciler/reconciler.go
𧬠Code graph analysis (16)
health-events-analyzer/pkg/config/rules.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
ValidateCloudEvent(257-283)
store-client/pkg/client/mongodb_client.go (1)
store-client/pkg/client/pipeline_builder.go (1)
GetPipelineBuilder(52-65)
store-client/pkg/client/pipeline_builder.go (1)
store-client/pkg/client/mongodb_client.go (1)
BuildProcessableNonFatalUnhealthyInsertsPipeline(294-297)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (5)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)
store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go (2)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
store-client/pkg/client/mongodb_pipeline_builder.go (3)
store-client/pkg/datastore/types.go (4)
ToPipeline(161-163)D(131-133)E(126-128)A(136-138)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)store-client/pkg/client/mongodb_client.go (1)
BuildProcessableNonFatalUnhealthyInsertsPipeline(294-297)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
data-models/pkg/protos/health_event.pb.go (19)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)Entity(208-214)Entity(227-227)Entity(242-244)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)ProcessingStrategy_STORE_ONLY(47-47)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)HealthEvents(156-162)HealthEvents(175-175)HealthEvents(190-192)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
ProcessingStrategy(14-17)data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
health-events-analyzer/pkg/reconciler/reconciler_test.go (2)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
Publisher(36-38)data-models/pkg/protos/health_event.pb.go (1)
ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-events-analyzer/pkg/publisher/publisher.go (2)
data-models/pkg/protos/health_event.pb.go (12)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)ProcessingStrategy_value(56-59)health-events-analyzer/pkg/config/rules.go (1)
HealthEventsAnalyzerRule(23-32)
health-events-analyzer/pkg/reconciler/reconciler.go (2)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
Publisher(36-38)data-models/pkg/protos/health_event.pb.go (8)
HealthEvent(260-280)HealthEvent(293-293)HealthEvent(308-310)RecommendedAction(89-89)RecommendedAction(139-141)RecommendedAction(143-145)RecommendedAction(152-154)ProcessingStrategy_EXECUTE_REMEDIATION(46-46)
health-events-analyzer/main.go (2)
store-client/pkg/client/mongodb_client.go (1)
BuildProcessableNonFatalUnhealthyInsertsPipeline(294-297)health-events-analyzer/pkg/publisher/publisher.go (2)
PublisherConfig(37-40)NewPublisher(94-100)
store-client/pkg/client/pipeline_builder_test.go (4)
store-client/pkg/client/pipeline_builder.go (1)
PipelineBuilder(26-48)store-client/pkg/client/mongodb_pipeline_builder.go (1)
NewMongoDBPipelineBuilder(29-31)store-client/pkg/client/postgresql_pipeline_builder.go (1)
NewPostgreSQLPipelineBuilder(29-31)store-client/pkg/client/mongodb_client.go (1)
BuildProcessableNonFatalUnhealthyInsertsPipeline(294-297)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
ProcessingStrategy(43-43)ProcessingStrategy(72-74)ProcessingStrategy(76-78)ProcessingStrategy(85-87)
π Additional comments (40)
fault-quarantine/pkg/initializer/init.go (1)
66-66: Verify the pipeline filtering behavior and backward compatibility.The change from
BuildAllHealthEventInsertsPipeline()toBuildProcessableHealthEventInsertsPipeline()correctly alters which health events trigger remediation actions in fault-quarantine. The new method filters for events withprocessingStrategy=EXECUTE_REMEDIATION, excluding observability-onlySTORE_ONLYevents. Existing events without the strategy field default toEXECUTE_REMEDIATIONfor backward compatibility.data-models/protobufs/health_event.proto (2)
32-38: Well-designed enum with correct default value.Using
EXECUTE_REMEDIATION = 0as the default ensures backward compatibilityβexisting events without this field will behave as before (executing remediation). The comments are comprehensive per the coding guidelines requirement for Protocol Buffer messages.
77-77: Field addition looks correct.Field number 16 follows the sequential pattern from the previous field (drainOverrides = 15). This is a backward-compatible, additive change.
event-exporter/pkg/transformer/cloudevents_test.go (2)
69-69: Good test coverage for the new field.Using
STORE_ONLY(non-default value) in the test is the right approachβit verifies the field is explicitly serialized rather than relying on default behavior.
106-108: Assertion correctly validates string serialization.The test verifies that
ProcessingStrategyis serialized as its string representation ("STORE_ONLY") in the CloudEvent payload, which aligns with the enum's.String()method usage in the transformer.distros/kubernetes/nvsentinel/charts/health-events-analyzer/templates/deployment.yaml (1)
77-77: CLI argument follows existing patterns.The new argument is correctly templated and positioned with other configuration flags. Since
processingStrategyhas a default value invalues.yaml, this will always render with a valid value.tests/helpers/kube.go (1)
2208-2249: Well-structured DaemonSet rollout helper.The implementation correctly checks all three conditions for a complete rollout: desired pods scheduled, pods updated, and pods ready. Good logging for debugging test failures.
distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml (1)
34-39: Configuration is well-documented with valid values and behavior descriptions.The inline comments cover valid values, default behavior, and detailed explanations for each strategy, following the Helm chart documentation guidelines.
Consider verifying that the application validates the
processingStrategyvalue at startup and fails fast with a clear error message if an invalid value is provided:#!/bin/bash # Description: Check if processingStrategy is validated in the main.go or flag parsing code # Search for processingStrategy flag definition and validation rg -n "processing-strategy" --type=go -A5 -B2 # Search for any enum validation or parsing logic ast-grep --pattern $'func $_(strategy string) $_ { $$$ }'event-exporter/pkg/transformer/cloudevents.go (1)
66-66: LGTM! ProcessingStrategy field properly added to CloudEvent payload.The new field is correctly populated using
event.ProcessingStrategy.String()and follows the same pattern as other fields likerecommendedAction. The protobuf enum's String() method will handle the zero-value case appropriately by returning the enum name.tests/event_exporter_test.go (1)
25-26: LGTM! Test updated to validate processingStrategy in CloudEvents.The import addition and updated
ValidateCloudEventcall correctly validate that health events default toEXECUTE_REMEDIATIONstrategy in the CloudEvent payload. The hard-coded string matches the expected default behavior.Based on learnings: Write table-driven tests when testing multiple scenarios in Go.
Also applies to: 85-85
tests/helpers/healthevent.go (1)
48-48: LGTM! ProcessingStrategy support properly added to test helper.The field and builder method follow the existing pattern. Using
inttype is correct for protobuf enums, and theomitemptytag ensures the field defaults toEXECUTE_REMEDIATION(value 0) when not explicitly set.Also applies to: 153-156
store-client/pkg/client/mongodb_client.go (1)
289-297: LGTM! Function renamed to reflect processing strategy filtering.The rename from
BuildNonFatalUnhealthyInsertsPipelinetoBuildProcessableNonFatalUnhealthyInsertsPipelinebetter communicates that the pipeline filters for events withprocessingStrategy=EXECUTE_REMEDIATION. The function is already marked as deprecated, so the breaking change is acceptable for users who should migrate to the newPipelineBuilderinterface.health-events-analyzer/pkg/reconciler/reconciler_test.go (1)
288-288: LGTM! All test publisher instantiations updated with processing strategy.All
NewPublishercalls consistently specifyprotos.ProcessingStrategy_EXECUTE_REMEDIATION, aligning with the updated constructor signature. This ensures tests validate the default remediation behavior.Based on learnings: Use
testify/assertandtestify/requirefor assertions in Go tests.Also applies to: 333-333, 378-378, 403-403, 429-429
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
1391-1589: Excellent test coverage for ProcessingStrategy behavior!This comprehensive test validates the critical distinction between
STORE_ONLY(observability-only) andEXECUTE_REMEDIATION(creates node conditions/events) strategies. Key strengths:
- Clear table-driven test structure with descriptive names
- Properly excludes standard Kubernetes conditions when counting (NodeReady, MemoryPressure, etc.)
- Tests all scenarios: STORE_ONLY fatal/non-fatal, EXECUTE_REMEDIATION, and mixed strategies
- Uses
requirefor setup andassertfor validation per best practicesBased on learnings: Write table-driven tests when testing multiple scenarios in Go, and use
testify/assertandtestify/requirefor assertions in Go tests.health-events-analyzer/pkg/config/rules.go (1)
30-31: LGTM! Optional per-rule processing strategy override properly defined.The field is well-documented and follows TOML conventions. The consuming code in
health-events-analyzer/pkg/publisher/publisher.goproperly validates the string value against theprotos.ProcessingStrategyenum map (lines 116-122) and returns an error for invalid values, then converts to the protobuf enum type.health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)
1-51: Auto-generated protobuf file β no manual review needed.This file is generated by the protocol buffer compiler (as stated on line 2). The offset adjustments for
_PROCESSINGSTRATEGY,_RECOMMENDEDACTION,_HEALTHEVENT, and related descriptors are consistent with the addition of the newProcessingStrategyenum in the source.protofile.platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
325-343: LGTM β filtering logic is clean and well-documented.The function correctly filters out
STORE_ONLYevents and provides informational logging for skipped events, which aids in debugging and observability.tests/helpers/event_exporter.go (2)
220-254: LGTM β helper function follows established patterns.The function correctly mirrors the structure of
FindEventByNodeAndMessageand properly handles type assertions with early continue on failures.
256-283: LGTM β signature extended to validate processingStrategy.The new
expectedProcessingStrategyparameter and corresponding assertion integrate well with the existing validation logic.store-client/pkg/client/pipeline_builder.go (1)
35-43: LGTM β interface extended with well-documented pipeline builders.The new methods clearly express the filtering intent via their names and godoc comments:
BuildProcessableHealthEventInsertsPipelinefor events withprocessingStrategy=EXECUTE_REMEDIATIONBuildProcessableNonFatalUnhealthyInsertsPipelinefor the health-events-analyzer pattern detectionThe naming convention
Processable*effectively communicates that STORE_ONLY events are excluded.tests/health_events_analyzer_test.go (2)
1535-1595: LGTM β test setup correctly configures STORE_ONLY strategy.The setup properly:
- Configures the deployment with
--processing-strategy=STORE_ONLY- Waits for rollout before proceeding
- Injects two bursts of XID 120 errors with appropriate gap timing
1597-1621: Verify empty message expectation in CloudEvent validation.Line 1618 passes an empty string for
expectedMessage:helpers.ValidateCloudEvent(t, receivedEvent, testNodeName, "", "RepeatedXIDErrorOnSameGPU", "120", "STORE_ONLY")Looking at
ValidateCloudEvent(line 280 in helpers), it performsrequire.Equal(t, expectedMessage, healthEvent["message"]). If the actual health event has a non-empty message (which XID error events typically do), this assertion will fail.Please confirm whether the
RepeatedXIDErrorOnSameGPUhealth event is expected to have an empty message, or if this should be updated to match the actual expected message content.store-client/pkg/client/mongodb_pipeline_builder.go (2)
17-21: LGTM!The import addition for
protosis necessary to referenceProcessingStrategy_EXECUTE_REMEDIATIONin the new pipeline methods.
87-100: LGTM!The new pipeline correctly filters for insert operations with
processingStrategy = EXECUTE_REMEDIATION. Theint32cast is appropriate since protobuf enum values are stored as integers in MongoDB documents.health-events-analyzer/pkg/reconciler/reconciler.go (3)
292-293: LGTM!Correctly passes the rule reference to
Publisher.Publish, enabling per-ruleProcessingStrategyoverrides as defined in the rule configuration.
399-408: LGTM!The pipeline correctly filters events to only process those with
processingStrategy = EXECUTE_REMEDIATION, preventingSTORE_ONLYevents from triggering remediation rules. This aligns with the existing pattern of auto-adding critical filters ingetPipelineStages. Based on learnings, this is the correct location for mandatory filters.
471-472: LGTM!Passing
nilfor the rule parameter is appropriate here since XID burst detection is not driven by a configured rule. The publisher will correctly use the module-level defaultProcessingStrategy.store-client/pkg/client/postgresql_pipeline_builder.go (2)
119-132: LGTM!The PostgreSQL implementation correctly mirrors the MongoDB builder, filtering for insert operations with
processingStrategy = EXECUTE_REMEDIATION.
134-147: LGTM!The function correctly includes both
insertandupdateoperation types, which is appropriate for PostgreSQL's trigger-based change detection. This defensive approach aligns with the patterns established in other PostgreSQL pipeline methods in this file (e.g.,BuildNodeQuarantineStatusPipeline).health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)
14-17: LGTM!The
ProcessingStrategyenum and module-level constants are correctly defined, matching the protobuf schema changes.Also applies to: 31-32
78-78: LGTM!The
processingStrategyfield is correctly added to theHealthEventclass with proper protobuf stub generation including the slot declaration, field number constant, type annotation, and__init__parameter. The type hint_Optional[_Union[ProcessingStrategy, str]]appropriately allows both enum values and string representations.This
.pyistub file is properly generated from the proto source via the automated build pipeline. The generation is performed using protoc with the--pyi_outparameter, invoked through the Makefile target inhealth-monitors/gpu-health-monitor/Makefile, which is called from the root Makefile'sprotos-generatetarget. The repository correctly enforces regeneration from proto definitions through theprotos-linttarget that verifies generated files are up-to-date.health-events-analyzer/pkg/publisher/publisher.go (3)
37-40: LGTM!The
PublisherConfigstruct correctly includes the newprocessingStrategyfield to store the module-level default strategy.
94-100: LGTM!The constructor correctly accepts and stores the
processingStrategyparameter. Based on learnings, the lack of nil validation forplatformConnectorClientfollows the established pattern where callers are expected to provide valid parameters.
113-123: LGTM!The processing strategy logic correctly implements the default-with-override pattern:
- Sets the module-level default first
- Validates and applies rule-level override only when provided
- Returns a descriptive error for invalid strategy values
The error handling follows coding guidelines with proper context wrapping.
store-client/pkg/client/pipeline_builder_test.go (3)
69-86: LGTM! Test properly updated to reflect pipeline builder changes.The test correctly validates the new
BuildProcessableHealthEventInsertsPipeline()method across both MongoDB and PostgreSQL implementations using a table-driven approach.
88-105: LGTM! New test provides appropriate coverage.The test correctly validates the new
BuildProcessableNonFatalUnhealthyInsertsPipeline()method and follows the established table-driven test pattern.
185-189: LGTM! Backward compatibility test properly updated.The test correctly validates that the refactored
BuildProcessableNonFatalUnhealthyInsertsPipeline()function works through the global helper interface.health-events-analyzer/main.go (3)
93-96: LGTM! Pipeline correctly updated to filter by processing strategy.The change to
BuildProcessableNonFatalUnhealthyInsertsPipeline()ensures the analyzer only processes events withprocessingStrategy=EXECUTE_REMEDIATION, excluding observability-only events.
98-111: LGTM! Processing strategy properly wired through platform connection.The
connectToPlatformfunction signature correctly accepts theprocessingStrategyparameter and passes it to the publisher constructor.
121-122: LGTM! Processing strategy flag properly defined.The
--processing-strategyflag is well-defined with a sensible default ofEXECUTE_REMEDIATIONthat maintains backward compatibility.
store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
Show resolved
Hide resolved
Signed-off-by: Tanisha goyal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
π§Ή Nitpick comments (1)
tests/helpers/kube.go (1)
2251-2256: Consider clarifying the documentation for boolean flags.The function comment example shows
{"--verbose": ""}but doesn't explicitly state that an empty string value indicates a boolean flag. Consider adding a note like "Empty string values indicate boolean flags (no value)".
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (2)
tests/health_events_analyzer_test.gotests/helpers/kube.go
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/health_events_analyzer_test.gotests/helpers/kube.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/health_events_analyzer_test.go
π§ Learnings (2)
π Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
tests/health_events_analyzer_test.go
π Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.
Applied to files:
tests/health_events_analyzer_test.go
𧬠Code graph analysis (1)
tests/health_events_analyzer_test.go (5)
tests/helpers/health_events_analyzer.go (3)
HealthEventsAnalyzerTestContext(41-45)SetupHealthEventsAnalyzerTest(47-85)TeardownHealthEventsAnalyzer(241-250)tests/helpers/kube.go (4)
SetDeploymentArgs(2255-2288)NVSentinelNamespace(64-64)NeverWaitTimeout(62-62)WaitInterval(63-63)tests/helpers/node_pool.go (2)
AcquireNodeFromPool(43-86)DefaultExpiry(31-31)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_RESTART_VM(96-96)tests/helpers/event_exporter.go (3)
GetMockEvents(36-99)FindEventByNodeAndCheckName(221-254)ValidateCloudEvent(257-283)
π Additional comments (7)
tests/health_events_analyzer_test.go (4)
1535-1594: Setup logic is well-structured.The test properly configures the
STORE_ONLYprocessing strategy before acquiring a node and injecting XID events. Error handling and rollout waits are correctly implemented. The burst gap timing (17s) aligns with other tests in this file.
1606-1614: Potential flaky test:NeverWaitTimeoutmay be too short for positive assertion.
helpers.NeverWaitTimeout(10 seconds) is used here to wait for the event to appear in changestream. This timeout name suggests it's intended for negative assertions where you expect something to not happen. For positive assertions like "event should be exported", a longer timeout (e.g.,EventuallyWaitTimeout) would be more robust and reduce test flakiness.Verify that 10 seconds is sufficient for the changestream export to complete reliably, or consider using a longer timeout constant intended for positive assertions.
1618-1618: Verify emptyexpectedMessageis intentional.An empty string is passed for
expectedMessage, which meansValidateCloudEventwill assert thathealthEvent["message"]equals"". Confirm this is the expected behavior for STORE_ONLY strategy events, or if a specific message should be validated.
1623-1647: Teardown correctly captures and handles errors.The previous issue where
RemoveDeploymentArgserror was discarded has been addressed. The teardown now properly captures the error and validates it before proceeding with the deployment rollout wait.tests/helpers/kube.go (3)
2208-2249: LGTM! Well-implemented DaemonSet rollout waiter.The function correctly polls the DaemonSet status until all desired pods are updated and ready, with informative logging at each step.
2290-2336: LGTM! Complex but correct argument manipulation logic.The function properly handles three argument styles (--flag=value, --flag value, --flag) with appropriate bounds checking and slice manipulation.
2368-2393: LGTM! Correct argument removal logic.The function properly handles both --flag=value and --flag value styles with safe slice manipulation and bounds checking.
| return ctx | ||
| }) | ||
|
|
||
| feature.Assess("Check if HealthEventsAnalyzerStoreOnlyStrategy node condition is added", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assess description is misleading.
The description says "Check if ... node condition is added" but the test actually verifies that no node condition is created and validates the exported CloudEvent. Consider updating to match the actual assertion intent.
π Suggested fix
- feature.Assess("Check if HealthEventsAnalyzerStoreOnlyStrategy node condition is added", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
+ feature.Assess("Verify no node condition is created and CloudEvent is exported for STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| feature.Assess("Check if HealthEventsAnalyzerStoreOnlyStrategy node condition is added", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { | |
| feature.Assess("Verify no node condition is created and CloudEvent is exported for STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context { |
π€ Prompt for AI Agents
In tests/health_events_analyzer_test.go around line 1597, the Assess description
is misleading: it claims to check that a "node condition is added" while the
test actually asserts that no node condition is created and validates the
exported CloudEvent. Update the Assess description to accurately reflect the
test intent (e.g., "Check that HealthEventsAnalyzerStoreOnlyStrategy does NOT
add a node condition and exported CloudEvent is valid") so the description
matches the assertions and intent of the test.
| func RemoveDeploymentArgs( | ||
| ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string, | ||
| ) error { | ||
| return retry.RetryOnConflict(retry.DefaultRetry, func() error { | ||
| deployment := &appsv1.Deployment{} | ||
| if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| if len(deployment.Spec.Template.Spec.Containers) == 0 { | ||
| return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName) | ||
| } | ||
|
|
||
| for i := range deployment.Spec.Template.Spec.Containers { | ||
| container := &deployment.Spec.Template.Spec.Containers[i] | ||
|
|
||
| if containerName != "" && container.Name != containerName { | ||
| continue | ||
| } | ||
|
|
||
| removeArgsFromContainer(container, args) | ||
| } | ||
|
|
||
| return c.Resources().Update(ctx, deployment) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent behavior: missing container existence check.
Unlike SetDeploymentArgs (lines 2282-2284), this function does not verify that the specified container exists when containerName is non-empty. It will silently succeed even if the named container is not found, which could mask configuration errors in tests.
π Proposed fix to add container existence check
return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName)
}
+ found := false
+
for i := range deployment.Spec.Template.Spec.Containers {
container := &deployment.Spec.Template.Spec.Containers[i]
if containerName != "" && container.Name != containerName {
continue
}
+ found = true
+
removeArgsFromContainer(container, args)
}
+ if containerName != "" && !found {
+ return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName)
+ }
+
return c.Resources().Update(ctx, deployment)
})π€ Prompt for AI Agents
In tests/helpers/kube.go around lines 2341 to 2365, the RemoveDeploymentArgs
function lacks the existence check for a specific container name (unlike
SetDeploymentArgs), so it silently succeeds if the named container isn't
present; modify the function to track whether any container matching
containerName was found (e.g., set a found bool inside the loop when you process
the matching container), and after the loop if containerName != "" and found is
false return a clear error (fmt.Errorf("container %q not found in deployment
%s/%s", containerName, namespace, deploymentName)); keep this change inside the
existing RetryOnConflict closure so Update still only runs when appropriate.
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Behavior Changes
Tests
βοΈ Tip: You can customize this high-level summary in your review settings.