Skip to content

Conversation

@tanishagoyal2
Copy link
Contributor

@tanishagoyal2 tanishagoyal2 commented Dec 22, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Added a processing strategy for health events: EXECUTE_REMEDIATION (default) or STORE_ONLY (observability-only).
    • Processing strategy configurable globally and can be overridden per rule.
    • Strategy is included in exported health events/CloudEvents.
  • Behavior Changes

    • STORE_ONLY events are treated as observability-only (no node condition updates); EXECUTE_REMEDIATION can trigger remediation.
  • Tests

    • Added tests and helpers covering both strategies.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 22, 2025

Walkthrough

A new ProcessingStrategy enum (EXECUTE_REMEDIATION, STORE_ONLY) and a processingStrategy field on HealthEvent are added and propagated through pipelines, publisher, platform connector, exporters, deployment config, and tests to control whether events trigger remediation or are stored/observed only.

Changes

Cohort / File(s) Change Summary
Protobuf Definitions
data-models/protobufs/health_event.proto, health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py, health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
Adds ProcessingStrategy enum (EXECUTE_REMEDIATION, STORE_ONLY) and processingStrategy field to HealthEvent; updates generated Python bindings and type stubs.
Publisher Configuration
health-events-analyzer/pkg/publisher/publisher.go, health-events-analyzer/pkg/publisher/publisher_test.go, health-events-analyzer/pkg/config/rules.go
NewPublisher now accepts default processingStrategy; Publish gains a rule *HealthEventsAnalyzerRule parameter and applies per-rule or default processingStrategy (with validation). Adds ProcessingStrategy field to rule config.
Health Events Analyzer Core
health-events-analyzer/main.go
Adds --processing-strategy CLI flag (validated), threads the enum value into connect/publisher creation, and switches to processable pipeline builder.
Reconciler & Event Publishing
health-events-analyzer/pkg/reconciler/reconciler.go, health-events-analyzer/pkg/reconciler/reconciler_test.go
Passes rule pointer to Publisher.Publish; pipeline stage filters now match processingStrategy=EXECUTE_REMEDIATION for rule evaluation; tests updated to supply publisher strategy.
Platform Connector Filtering
platform-connectors/pkg/connectors/kubernetes/process_node_events.go, platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
Adds filterProcessableEvents to skip STORE_ONLY events; processHealthEvents builds node conditions and Kubernetes events only from processable events; introduces createK8sEvent helper and tests to validate store-only vs execute behavior.
Data Store Pipelines
store-client/pkg/client/pipeline_builder.go, store-client/pkg/client/mongodb_client.go, store-client/pkg/client/mongodb_pipeline_builder.go, store-client/pkg/client/postgresql_pipeline_builder.go, store-client/pkg/client/pipeline_builder_test.go, store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
Removes BuildNonFatalUnhealthyInsertsPipeline; introduces BuildProcessableHealthEventInsertsPipeline and BuildProcessableNonFatalUnhealthyInsertsPipeline that filter for processingStrategy=EXECUTE_REMEDIATION; updates builders and tests accordingly.
Event Export & Transformation
event-exporter/pkg/transformer/cloudevents.go, event-exporter/pkg/transformer/cloudevents_test.go
Adds processingStrategy to CloudEvent payload (from event.ProcessingStrategy.String()); tests assert presence/value.
Kubernetes Deployment Configuration
distros/kubernetes/nvsentinel/charts/health-events-analyzer/deployment.yaml, distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml
Adds processingStrategy Helm value (default EXECUTE_REMEDIATION) and passes it to container via --processing-strategy arg.
Fault Quarantine Utilities
fault-quarantine/pkg/initializer/init.go, fault-quarantine/pkg/evaluator/rule_evaluator_test.go
Replace BuildAllHealthEventInsertsPipeline with BuildProcessableHealthEventInsertsPipeline; tests updated to include processingStrategy in expected serialized output.
Test Helpers & Integration Tests
tests/helpers/event_exporter.go, tests/helpers/healthevent.go, tests/helpers/kube.go, tests/health_events_analyzer_test.go, tests/event_exporter_test.go
Adds ProcessingStrategy to test HealthEvent template and builder (WithProcessingStrategy), extends ValidateCloudEvent to assert expected processingStrategy, adds helper FindEventByNodeAndCheckName, and kube helpers WaitForDaemonSetRollout, SetDeploymentArgs, RemoveDeploymentArgs. Adds integration test TestHealthEventsAnalyzerStoreOnlyStrategy (note: duplicate insertion observed in summary).

Sequence Diagram(s)

mermaid
sequenceDiagram
autonumber
participant Analyzer as HealthEventsAnalyzer
participant Publisher as Publisher
participant Store as Datastore (Mongo/Postgres)
participant Platform as PlatformConnector (K8s)
participant Exporter as CloudEvent Exporter

Note over Analyzer,Publisher: Event ingestion pipeline yields HealthEvent with processingStrategy
Analyzer->>Store: Insert HealthEvent (stored regardless of strategy)
Analyzer->>Publisher: Publish(event, rule)
Publisher->>Publisher: determine processingStrategy (rule override or default)
alt processingStrategy == EXECUTE_REMEDIATION
    Publisher->>Platform: send event -> create node conditions / kube events
    Platform->>Platform: update node condition
    Platform->>Exporter: emit CloudEvent (observability)
else processingStrategy == STORE_ONLY
    Publisher->>Exporter: emit CloudEvent only (no node condition)
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas requiring extra attention:

  • Health Events Analyzer pipeline calls and renamed pipeline builder usages
  • Publisher signature and all Publish() call sites (rule parameter propagation and validation)
  • Store pipeline filters matching protobuf enum values across MongoDB/Postgres builders
  • Platform connector filtering to ensure STORE_ONLY events are skipped for remediation
  • Tests and test helpers (duplicate test insertion noted)

Poem

🐰 A strategy field hops through our flow,
EXECUTE or STOREβ€”let events show,
Remediation strikes or observes the land,
Processing branched by a thoughtful hand!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title accurately reflects the main purpose of the PR: introducing a processingStrategy feature that enables configurable event handling (EXECUTE_REMEDIATION vs STORE_ONLY) across the health-events-analyzer module and related components.
✨ Finishing touches
  • πŸ“ Generate docstrings
πŸ§ͺ Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

πŸ”§ golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: LGTM! The test correctly reflects the new processingStrategy field.

The addition of "processingStrategy": float64(0) properly validates that the RoundTrip function now includes the new field with its default value (EXECUTE_REMEDIATION = 0).

Optional: Consider adding test coverage for non-default values

While the current test appropriately validates the default behavior, you might consider adding a separate test case that explicitly sets and verifies the STORE_ONLY (value 1) processingStrategy to ensure complete coverage of the enum values. This would be a nice-to-have enhancement but is not critical for this test's scope.

Example:

func TestRoundTrip_WithCustomProcessingStrategy(t *testing.T) {
	eventTime := timestamppb.New(time.Now())
	event := &protos.HealthEvent{
		Version:            1,
		Agent:              "test-agent",
		ProcessingStrategy: protos.ProcessingStrategy_STORE_ONLY,
		// ... other fields
	}
	
	result, err := RoundTrip(event)
	if err != nil {
		t.Fatalf("Failed to roundtrip event: %v", err)
	}
	
	if result["processingStrategy"] != float64(1) {
		t.Errorf("Expected processingStrategy to be 1, got %v", result["processingStrategy"])
	}
}
tests/helpers/kube.go (2)

2341-2365: Missing container name validation for consistency with SetDeploymentArgs.

Unlike SetDeploymentArgs which returns an error when a specific containerName is requested but not found, RemoveDeploymentArgs silently proceeds. This inconsistency could mask configuration errors in tests.

πŸ”Ž Proposed fix
 func RemoveDeploymentArgs(
 	ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string,
 ) error {
 	return retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		deployment := &appsv1.Deployment{}
 		if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil {
 			return err
 		}
 
 		if len(deployment.Spec.Template.Spec.Containers) == 0 {
 			return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName)
 		}
 
+		found := false
+
 		for i := range deployment.Spec.Template.Spec.Containers {
 			container := &deployment.Spec.Template.Spec.Containers[i]
 
 			if containerName != "" && container.Name != containerName {
 				continue
 			}
 
+			found = true
+
 			removeArgsFromContainer(container, args)
 		}
 
+		if containerName != "" && !found {
+			return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName)
+		}
+
 		return c.Resources().Update(ctx, deployment)
 	})
 }

2314-2320: Slice insertion logic is correct but could be clearer.

The slice manipulation on line 2319 inserts a value after the flag. While functionally correct, the nested append pattern is non-obvious.

πŸ”Ž Consider using slices.Insert for clarity (Go 1.21+)
-						// Insert the value after the flag
-						container.Args = append(container.Args[:j+1], append([]string{value}, container.Args[j+1:]...)...)
+						// Insert the value after the flag
+						container.Args = slices.Insert(container.Args, j+1, value)

This would require adding "slices" to imports. If staying on older Go versions, the current approach works fine.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

372-416: Consider simplifying: nodeConditions slice is built but not used.

The nodeConditions slice (lines 375-386) is constructed only to check len(nodeConditions) > 0, but the actual node conditions are recalculated inside updateNodeConditions from processableEvents. This is redundant work.

πŸ”Ž Simplified approach
 func (r *K8sConnector) processHealthEvents(ctx context.Context, healthEvents *protos.HealthEvents) error {
 	processableEvents := filterProcessableEvents(healthEvents)
 
-	var nodeConditions []corev1.NodeCondition
-
-	for _, healthEvent := range processableEvents {
-		if healthEvent.IsHealthy || healthEvent.IsFatal {
-			nodeConditions = append(nodeConditions, corev1.NodeCondition{
-				Type:               corev1.NodeConditionType(healthEvent.CheckName),
-				LastHeartbeatTime:  metav1.NewTime(healthEvent.GeneratedTimestamp.AsTime()),
-				LastTransitionTime: metav1.NewTime(healthEvent.GeneratedTimestamp.AsTime()),
-				Message:            r.fetchHealthEventMessage(healthEvent),
-			})
-		}
-	}
+	// Check if any events require node condition updates (IsHealthy or IsFatal)
+	hasConditionEvents := false
+	for _, healthEvent := range processableEvents {
+		if healthEvent.IsHealthy || healthEvent.IsFatal {
+			hasConditionEvents = true
+			break
+		}
+	}
 
-	if len(nodeConditions) > 0 {
+	if hasConditionEvents {
 		start := time.Now()
 		err := r.updateNodeConditions(ctx, processableEvents)
 		// ... rest unchanged
store-client/pkg/client/mongodb_pipeline_builder.go (1)

102-115: Minor style inconsistency in operationType matching.

BuildProcessableHealthEventInsertsPipeline (line 93-95) uses $in with an array for operationType, while this function uses a direct string match (line 108). Both are functionally equivalent for a single value, but using consistent patterns improves maintainability.

πŸ”Ž Suggested fix for consistency
 func (b *MongoDBPipelineBuilder) BuildProcessableNonFatalUnhealthyInsertsPipeline() datastore.Pipeline {
 	return datastore.ToPipeline(
 		datastore.D(
 			datastore.E("$match", datastore.D(
-				datastore.E("operationType", "insert"),
+				datastore.E("operationType", datastore.D(
+					datastore.E("$in", datastore.A("insert")),
+				)),
 				datastore.E("fullDocument.healthevent.agent", datastore.D(datastore.E("$ne", "health-events-analyzer"))),
 				datastore.E("fullDocument.healthevent.ishealthy", false),
 				datastore.E("fullDocument.healthevent.processingstrategy", int32(protos.ProcessingStrategy_EXECUTE_REMEDIATION)),
 			)),
 		),
 	)
 }
health-events-analyzer/pkg/publisher/publisher.go (1)

102-104: Consider documenting the new rule parameter.

The function signature now accepts an optional rule parameter for per-rule ProcessingStrategy overrides. Adding a brief godoc comment explaining when to pass nil vs a rule reference would improve API clarity. As per coding guidelines, function comments are required for exported Go functions.

πŸ”Ž Suggested documentation
+// Publish sends a health event to the platform connector with the specified recommended action.
+// The event's ProcessingStrategy defaults to the publisher's configured strategy, but can be
+// overridden by providing a non-nil rule with a ProcessingStrategy field set. Pass nil for
+// rule when no rule-level override is needed (e.g., XID burst detection).
 func (p *PublisherConfig) Publish(ctx context.Context, event *protos.HealthEvent,
 	recommendedAction protos.RecommendedAction, ruleName string, message string,
 	rule *config.HealthEventsAnalyzerRule) error {
health-events-analyzer/main.go (1)

133-140: Consider enhancing the validation error message.

The flag validation logic is correct, but the error message could be more helpful by listing the valid values.

πŸ”Ž Suggested improvement
 value, ok := protos.ProcessingStrategy_value[*processingStrategyFlag]
 if !ok {
-	return fmt.Errorf("unexpected processingStrategy value: %q", *processingStrategyFlag)
+	return fmt.Errorf("unexpected processingStrategy value: %q, must be one of: EXECUTE_REMEDIATION, STORE_ONLY", *processingStrategyFlag)
 }

Alternatively, you could iterate over the map keys to generate the list dynamically, though that may be overkill for a two-value enum.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 61f47cb and 0ef6d11.

β›” Files ignored due to path filters (1)
  • data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go
πŸ“’ Files selected for processing (27)
  • data-models/protobufs/health_event.proto
  • distros/kubernetes/nvsentinel/charts/health-events-analyzer/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml
  • event-exporter/pkg/transformer/cloudevents.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • fault-quarantine/pkg/initializer/init.go
  • health-events-analyzer/main.go
  • health-events-analyzer/pkg/config/rules.go
  • health-events-analyzer/pkg/publisher/publisher.go
  • health-events-analyzer/pkg/reconciler/reconciler.go
  • health-events-analyzer/pkg/reconciler/reconciler_test.go
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_client.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • store-client/pkg/client/pipeline_builder.go
  • store-client/pkg/client/pipeline_builder_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
  • tests/event_exporter_test.go
  • tests/health_events_analyzer_test.go
  • tests/helpers/event_exporter.go
  • tests/helpers/healthevent.go
  • tests/helpers/kube.go
🧰 Additional context used
πŸ““ Path-based instructions (5)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • health-events-analyzer/pkg/config/rules.go
  • tests/helpers/healthevent.go
  • fault-quarantine/pkg/initializer/init.go
  • tests/event_exporter_test.go
  • store-client/pkg/client/mongodb_client.go
  • store-client/pkg/client/pipeline_builder.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • tests/health_events_analyzer_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
  • platform-connectors/pkg/connectors/kubernetes/process_node_events.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • event-exporter/pkg/transformer/cloudevents.go
  • health-events-analyzer/pkg/reconciler/reconciler_test.go
  • tests/helpers/event_exporter.go
  • health-events-analyzer/pkg/publisher/publisher.go
  • tests/helpers/kube.go
  • health-events-analyzer/pkg/reconciler/reconciler.go
  • health-events-analyzer/main.go
  • store-client/pkg/client/pipeline_builder_test.go
**/values.yaml

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

  • distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/event_exporter_test.go
  • fault-quarantine/pkg/evaluator/rule_evaluator_test.go
  • event-exporter/pkg/transformer/cloudevents_test.go
  • tests/health_events_analyzer_test.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
  • platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
  • health-events-analyzer/pkg/reconciler/reconciler_test.go
  • store-client/pkg/client/pipeline_builder_test.go
data-models/protobufs/**/*.proto

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

  • data-models/protobufs/health_event.proto
**/*.py

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

  • health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
🧠 Learnings (8)
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml
  • tests/health_events_analyzer_test.go
  • store-client/pkg/client/postgresql_pipeline_builder.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
  • store-client/pkg/client/mongodb_pipeline_builder.go
  • health-events-analyzer/pkg/reconciler/reconciler.go
  • health-events-analyzer/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

  • tests/event_exporter_test.go
  • tests/helpers/event_exporter.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • tests/event_exporter_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • tests/event_exporter_test.go
  • store-client/pkg/client/pipeline_builder_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests

Applied to files:

  • tests/event_exporter_test.go
  • store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go
πŸ“š Learning: 2025-12-12T07:41:34.094Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 545
File: tests/data/health-events-analyzer-config.yaml:2190-2251
Timestamp: 2025-12-12T07:41:34.094Z
Learning: The XID74Reg2Bit13Set rule in tests/data/health-events-analyzer-config.yaml intentionally omits the time window filter because it only validates the register bit pattern (bit 13 in REG2) on the received XID 74 event itself, without needing to check historical events or count repeated occurrences.

Applied to files:

  • tests/health_events_analyzer_test.go
πŸ“š Learning: 2025-11-04T06:31:02.147Z
Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

  • health-events-analyzer/pkg/publisher/publisher.go
  • health-events-analyzer/main.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • health-events-analyzer/pkg/reconciler/reconciler.go
🧬 Code graph analysis (16)
health-events-analyzer/pkg/config/rules.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
tests/helpers/healthevent.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
tests/event_exporter_test.go (1)
tests/helpers/event_exporter.go (1)
  • ValidateCloudEvent (257-283)
store-client/pkg/client/mongodb_client.go (1)
store-client/pkg/client/pipeline_builder.go (1)
  • GetPipelineBuilder (52-65)
store-client/pkg/client/pipeline_builder.go (1)
store-client/pkg/client/mongodb_client.go (1)
  • BuildProcessableNonFatalUnhealthyInsertsPipeline (294-297)
event-exporter/pkg/transformer/cloudevents_test.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (5)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
store-client/pkg/datastore/providers/postgresql/sql_filter_builder_test.go (2)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
store-client/pkg/client/mongodb_pipeline_builder.go (3)
store-client/pkg/datastore/types.go (4)
  • ToPipeline (161-163)
  • D (131-133)
  • E (126-128)
  • A (136-138)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
store-client/pkg/client/mongodb_client.go (1)
  • BuildProcessableNonFatalUnhealthyInsertsPipeline (294-297)
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)
data-models/pkg/protos/health_event.pb.go (19)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • Entity (208-214)
  • Entity (227-227)
  • Entity (242-244)
  • RecommendedAction (89-89)
  • RecommendedAction (139-141)
  • RecommendedAction (143-145)
  • RecommendedAction (152-154)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • ProcessingStrategy_STORE_ONLY (47-47)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
  • HealthEvents (156-162)
  • HealthEvents (175-175)
  • HealthEvents (190-192)
event-exporter/pkg/transformer/cloudevents.go (2)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
  • ProcessingStrategy (14-17)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
health-events-analyzer/pkg/reconciler/reconciler_test.go (2)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
  • Publisher (36-38)
data-models/pkg/protos/health_event.pb.go (1)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
health-events-analyzer/pkg/publisher/publisher.go (2)
data-models/pkg/protos/health_event.pb.go (12)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • RecommendedAction (89-89)
  • RecommendedAction (139-141)
  • RecommendedAction (143-145)
  • RecommendedAction (152-154)
  • ProcessingStrategy_value (56-59)
health-events-analyzer/pkg/config/rules.go (1)
  • HealthEventsAnalyzerRule (23-32)
health-events-analyzer/pkg/reconciler/reconciler.go (2)
health-monitors/kubernetes-object-monitor/pkg/publisher/publisher.go (1)
  • Publisher (36-38)
data-models/pkg/protos/health_event.pb.go (8)
  • HealthEvent (260-280)
  • HealthEvent (293-293)
  • HealthEvent (308-310)
  • RecommendedAction (89-89)
  • RecommendedAction (139-141)
  • RecommendedAction (143-145)
  • RecommendedAction (152-154)
  • ProcessingStrategy_EXECUTE_REMEDIATION (46-46)
health-events-analyzer/main.go (2)
store-client/pkg/client/mongodb_client.go (1)
  • BuildProcessableNonFatalUnhealthyInsertsPipeline (294-297)
health-events-analyzer/pkg/publisher/publisher.go (2)
  • PublisherConfig (37-40)
  • NewPublisher (94-100)
store-client/pkg/client/pipeline_builder_test.go (4)
store-client/pkg/client/pipeline_builder.go (1)
  • PipelineBuilder (26-48)
store-client/pkg/client/mongodb_pipeline_builder.go (1)
  • NewMongoDBPipelineBuilder (29-31)
store-client/pkg/client/postgresql_pipeline_builder.go (1)
  • NewPostgreSQLPipelineBuilder (29-31)
store-client/pkg/client/mongodb_client.go (1)
  • BuildProcessableNonFatalUnhealthyInsertsPipeline (294-297)
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)
data-models/pkg/protos/health_event.pb.go (4)
  • ProcessingStrategy (43-43)
  • ProcessingStrategy (72-74)
  • ProcessingStrategy (76-78)
  • ProcessingStrategy (85-87)
πŸ”‡ Additional comments (40)
fault-quarantine/pkg/initializer/init.go (1)

66-66: Verify the pipeline filtering behavior and backward compatibility.

The change from BuildAllHealthEventInsertsPipeline() to BuildProcessableHealthEventInsertsPipeline() correctly alters which health events trigger remediation actions in fault-quarantine. The new method filters for events with processingStrategy=EXECUTE_REMEDIATION, excluding observability-only STORE_ONLY events. Existing events without the strategy field default to EXECUTE_REMEDIATION for backward compatibility.

data-models/protobufs/health_event.proto (2)

32-38: Well-designed enum with correct default value.

Using EXECUTE_REMEDIATION = 0 as the default ensures backward compatibilityβ€”existing events without this field will behave as before (executing remediation). The comments are comprehensive per the coding guidelines requirement for Protocol Buffer messages.


77-77: Field addition looks correct.

Field number 16 follows the sequential pattern from the previous field (drainOverrides = 15). This is a backward-compatible, additive change.

event-exporter/pkg/transformer/cloudevents_test.go (2)

69-69: Good test coverage for the new field.

Using STORE_ONLY (non-default value) in the test is the right approachβ€”it verifies the field is explicitly serialized rather than relying on default behavior.


106-108: Assertion correctly validates string serialization.

The test verifies that ProcessingStrategy is serialized as its string representation ("STORE_ONLY") in the CloudEvent payload, which aligns with the enum's .String() method usage in the transformer.

distros/kubernetes/nvsentinel/charts/health-events-analyzer/templates/deployment.yaml (1)

77-77: CLI argument follows existing patterns.

The new argument is correctly templated and positioned with other configuration flags. Since processingStrategy has a default value in values.yaml, this will always render with a valid value.

tests/helpers/kube.go (1)

2208-2249: Well-structured DaemonSet rollout helper.

The implementation correctly checks all three conditions for a complete rollout: desired pods scheduled, pods updated, and pods ready. Good logging for debugging test failures.

distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml (1)

34-39: Configuration is well-documented with valid values and behavior descriptions.

The inline comments cover valid values, default behavior, and detailed explanations for each strategy, following the Helm chart documentation guidelines.

Consider verifying that the application validates the processingStrategy value at startup and fails fast with a clear error message if an invalid value is provided:

#!/bin/bash
# Description: Check if processingStrategy is validated in the main.go or flag parsing code

# Search for processingStrategy flag definition and validation
rg -n "processing-strategy" --type=go -A5 -B2

# Search for any enum validation or parsing logic
ast-grep --pattern $'func $_(strategy string) $_ {
  $$$
}'
event-exporter/pkg/transformer/cloudevents.go (1)

66-66: LGTM! ProcessingStrategy field properly added to CloudEvent payload.

The new field is correctly populated using event.ProcessingStrategy.String() and follows the same pattern as other fields like recommendedAction. The protobuf enum's String() method will handle the zero-value case appropriately by returning the enum name.

tests/event_exporter_test.go (1)

25-26: LGTM! Test updated to validate processingStrategy in CloudEvents.

The import addition and updated ValidateCloudEvent call correctly validate that health events default to EXECUTE_REMEDIATION strategy in the CloudEvent payload. The hard-coded string matches the expected default behavior.

Based on learnings: Write table-driven tests when testing multiple scenarios in Go.

Also applies to: 85-85

tests/helpers/healthevent.go (1)

48-48: LGTM! ProcessingStrategy support properly added to test helper.

The field and builder method follow the existing pattern. Using int type is correct for protobuf enums, and the omitempty tag ensures the field defaults to EXECUTE_REMEDIATION (value 0) when not explicitly set.

Also applies to: 153-156

store-client/pkg/client/mongodb_client.go (1)

289-297: LGTM! Function renamed to reflect processing strategy filtering.

The rename from BuildNonFatalUnhealthyInsertsPipeline to BuildProcessableNonFatalUnhealthyInsertsPipeline better communicates that the pipeline filters for events with processingStrategy=EXECUTE_REMEDIATION. The function is already marked as deprecated, so the breaking change is acceptable for users who should migrate to the new PipelineBuilder interface.

health-events-analyzer/pkg/reconciler/reconciler_test.go (1)

288-288: LGTM! All test publisher instantiations updated with processing strategy.

All NewPublisher calls consistently specify protos.ProcessingStrategy_EXECUTE_REMEDIATION, aligning with the updated constructor signature. This ensures tests validate the default remediation behavior.

Based on learnings: Use testify/assert and testify/require for assertions in Go tests.

Also applies to: 333-333, 378-378, 403-403, 429-429

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (1)

1391-1589: Excellent test coverage for ProcessingStrategy behavior!

This comprehensive test validates the critical distinction between STORE_ONLY (observability-only) and EXECUTE_REMEDIATION (creates node conditions/events) strategies. Key strengths:

  • Clear table-driven test structure with descriptive names
  • Properly excludes standard Kubernetes conditions when counting (NodeReady, MemoryPressure, etc.)
  • Tests all scenarios: STORE_ONLY fatal/non-fatal, EXECUTE_REMEDIATION, and mixed strategies
  • Uses require for setup and assert for validation per best practices

Based on learnings: Write table-driven tests when testing multiple scenarios in Go, and use testify/assert and testify/require for assertions in Go tests.

health-events-analyzer/pkg/config/rules.go (1)

30-31: LGTM! Optional per-rule processing strategy override properly defined.

The field is well-documented and follows TOML conventions. The consuming code in health-events-analyzer/pkg/publisher/publisher.go properly validates the string value against the protos.ProcessingStrategy enum map (lines 116-122) and returns an error for invalid values, then converts to the protobuf enum type.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: Auto-generated protobuf file β€” no manual review needed.

This file is generated by the protocol buffer compiler (as stated on line 2). The offset adjustments for _PROCESSINGSTRATEGY, _RECOMMENDEDACTION, _HEALTHEVENT, and related descriptors are consistent with the addition of the new ProcessingStrategy enum in the source .proto file.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

325-343: LGTM β€” filtering logic is clean and well-documented.

The function correctly filters out STORE_ONLY events and provides informational logging for skipped events, which aids in debugging and observability.

tests/helpers/event_exporter.go (2)

220-254: LGTM β€” helper function follows established patterns.

The function correctly mirrors the structure of FindEventByNodeAndMessage and properly handles type assertions with early continue on failures.


256-283: LGTM β€” signature extended to validate processingStrategy.

The new expectedProcessingStrategy parameter and corresponding assertion integrate well with the existing validation logic.

store-client/pkg/client/pipeline_builder.go (1)

35-43: LGTM β€” interface extended with well-documented pipeline builders.

The new methods clearly express the filtering intent via their names and godoc comments:

  • BuildProcessableHealthEventInsertsPipeline for events with processingStrategy=EXECUTE_REMEDIATION
  • BuildProcessableNonFatalUnhealthyInsertsPipeline for the health-events-analyzer pattern detection

The naming convention Processable* effectively communicates that STORE_ONLY events are excluded.

tests/health_events_analyzer_test.go (2)

1535-1595: LGTM β€” test setup correctly configures STORE_ONLY strategy.

The setup properly:

  1. Configures the deployment with --processing-strategy=STORE_ONLY
  2. Waits for rollout before proceeding
  3. Injects two bursts of XID 120 errors with appropriate gap timing

1597-1621: Verify empty message expectation in CloudEvent validation.

Line 1618 passes an empty string for expectedMessage:

helpers.ValidateCloudEvent(t, receivedEvent, testNodeName, "", "RepeatedXIDErrorOnSameGPU", "120", "STORE_ONLY")

Looking at ValidateCloudEvent (line 280 in helpers), it performs require.Equal(t, expectedMessage, healthEvent["message"]). If the actual health event has a non-empty message (which XID error events typically do), this assertion will fail.

Please confirm whether the RepeatedXIDErrorOnSameGPU health event is expected to have an empty message, or if this should be updated to match the actual expected message content.

store-client/pkg/client/mongodb_pipeline_builder.go (2)

17-21: LGTM!

The import addition for protos is necessary to reference ProcessingStrategy_EXECUTE_REMEDIATION in the new pipeline methods.


87-100: LGTM!

The new pipeline correctly filters for insert operations with processingStrategy = EXECUTE_REMEDIATION. The int32 cast is appropriate since protobuf enum values are stored as integers in MongoDB documents.

health-events-analyzer/pkg/reconciler/reconciler.go (3)

292-293: LGTM!

Correctly passes the rule reference to Publisher.Publish, enabling per-rule ProcessingStrategy overrides as defined in the rule configuration.


399-408: LGTM!

The pipeline correctly filters events to only process those with processingStrategy = EXECUTE_REMEDIATION, preventing STORE_ONLY events from triggering remediation rules. This aligns with the existing pattern of auto-adding critical filters in getPipelineStages. Based on learnings, this is the correct location for mandatory filters.


471-472: LGTM!

Passing nil for the rule parameter is appropriate here since XID burst detection is not driven by a configured rule. The publisher will correctly use the module-level default ProcessingStrategy.

store-client/pkg/client/postgresql_pipeline_builder.go (2)

119-132: LGTM!

The PostgreSQL implementation correctly mirrors the MongoDB builder, filtering for insert operations with processingStrategy = EXECUTE_REMEDIATION.


134-147: LGTM!

The function correctly includes both insert and update operation types, which is appropriate for PostgreSQL's trigger-based change detection. This defensive approach aligns with the patterns established in other PostgreSQL pipeline methods in this file (e.g., BuildNodeQuarantineStatusPipeline).

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)

14-17: LGTM!

The ProcessingStrategy enum and module-level constants are correctly defined, matching the protobuf schema changes.

Also applies to: 31-32


78-78: LGTM!

The processingStrategy field is correctly added to the HealthEvent class with proper protobuf stub generation including the slot declaration, field number constant, type annotation, and __init__ parameter. The type hint _Optional[_Union[ProcessingStrategy, str]] appropriately allows both enum values and string representations.

This .pyi stub file is properly generated from the proto source via the automated build pipeline. The generation is performed using protoc with the --pyi_out parameter, invoked through the Makefile target in health-monitors/gpu-health-monitor/Makefile, which is called from the root Makefile's protos-generate target. The repository correctly enforces regeneration from proto definitions through the protos-lint target that verifies generated files are up-to-date.

health-events-analyzer/pkg/publisher/publisher.go (3)

37-40: LGTM!

The PublisherConfig struct correctly includes the new processingStrategy field to store the module-level default strategy.


94-100: LGTM!

The constructor correctly accepts and stores the processingStrategy parameter. Based on learnings, the lack of nil validation for platformConnectorClient follows the established pattern where callers are expected to provide valid parameters.


113-123: LGTM!

The processing strategy logic correctly implements the default-with-override pattern:

  1. Sets the module-level default first
  2. Validates and applies rule-level override only when provided
  3. Returns a descriptive error for invalid strategy values

The error handling follows coding guidelines with proper context wrapping.

store-client/pkg/client/pipeline_builder_test.go (3)

69-86: LGTM! Test properly updated to reflect pipeline builder changes.

The test correctly validates the new BuildProcessableHealthEventInsertsPipeline() method across both MongoDB and PostgreSQL implementations using a table-driven approach.


88-105: LGTM! New test provides appropriate coverage.

The test correctly validates the new BuildProcessableNonFatalUnhealthyInsertsPipeline() method and follows the established table-driven test pattern.


185-189: LGTM! Backward compatibility test properly updated.

The test correctly validates that the refactored BuildProcessableNonFatalUnhealthyInsertsPipeline() function works through the global helper interface.

health-events-analyzer/main.go (3)

93-96: LGTM! Pipeline correctly updated to filter by processing strategy.

The change to BuildProcessableNonFatalUnhealthyInsertsPipeline() ensures the analyzer only processes events with processingStrategy=EXECUTE_REMEDIATION, excluding observability-only events.


98-111: LGTM! Processing strategy properly wired through platform connection.

The connectToPlatform function signature correctly accepts the processingStrategy parameter and passes it to the publisher constructor.


121-122: LGTM! Processing strategy flag properly defined.

The --processing-strategy flag is well-defined with a sensible default of EXECUTE_REMEDIATION that maintains backward compatibility.

Signed-off-by: Tanisha goyal <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/helpers/kube.go (1)

2251-2256: Consider clarifying the documentation for boolean flags.

The function comment example shows {"--verbose": ""} but doesn't explicitly state that an empty string value indicates a boolean flag. Consider adding a note like "Empty string values indicate boolean flags (no value)".

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 0ef6d11 and f181148.

πŸ“’ Files selected for processing (2)
  • tests/health_events_analyzer_test.go
  • tests/helpers/kube.go
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/health_events_analyzer_test.go
  • tests/helpers/kube.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/health_events_analyzer_test.go
🧠 Learnings (2)
πŸ“š Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

  • tests/health_events_analyzer_test.go
πŸ“š Learning: 2025-11-10T10:25:19.443Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

  • tests/health_events_analyzer_test.go
🧬 Code graph analysis (1)
tests/health_events_analyzer_test.go (5)
tests/helpers/health_events_analyzer.go (3)
  • HealthEventsAnalyzerTestContext (41-45)
  • SetupHealthEventsAnalyzerTest (47-85)
  • TeardownHealthEventsAnalyzer (241-250)
tests/helpers/kube.go (4)
  • SetDeploymentArgs (2255-2288)
  • NVSentinelNamespace (64-64)
  • NeverWaitTimeout (62-62)
  • WaitInterval (63-63)
tests/helpers/node_pool.go (2)
  • AcquireNodeFromPool (43-86)
  • DefaultExpiry (31-31)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_RESTART_VM (96-96)
tests/helpers/event_exporter.go (3)
  • GetMockEvents (36-99)
  • FindEventByNodeAndCheckName (221-254)
  • ValidateCloudEvent (257-283)
πŸ”‡ Additional comments (7)
tests/health_events_analyzer_test.go (4)

1535-1594: Setup logic is well-structured.

The test properly configures the STORE_ONLY processing strategy before acquiring a node and injecting XID events. Error handling and rollout waits are correctly implemented. The burst gap timing (17s) aligns with other tests in this file.


1606-1614: Potential flaky test: NeverWaitTimeout may be too short for positive assertion.

helpers.NeverWaitTimeout (10 seconds) is used here to wait for the event to appear in changestream. This timeout name suggests it's intended for negative assertions where you expect something to not happen. For positive assertions like "event should be exported", a longer timeout (e.g., EventuallyWaitTimeout) would be more robust and reduce test flakiness.

Verify that 10 seconds is sufficient for the changestream export to complete reliably, or consider using a longer timeout constant intended for positive assertions.


1618-1618: Verify empty expectedMessage is intentional.

An empty string is passed for expectedMessage, which means ValidateCloudEvent will assert that healthEvent["message"] equals "". Confirm this is the expected behavior for STORE_ONLY strategy events, or if a specific message should be validated.


1623-1647: Teardown correctly captures and handles errors.

The previous issue where RemoveDeploymentArgs error was discarded has been addressed. The teardown now properly captures the error and validates it before proceeding with the deployment rollout wait.

tests/helpers/kube.go (3)

2208-2249: LGTM! Well-implemented DaemonSet rollout waiter.

The function correctly polls the DaemonSet status until all desired pods are updated and ready, with informative logging at each step.


2290-2336: LGTM! Complex but correct argument manipulation logic.

The function properly handles three argument styles (--flag=value, --flag value, --flag) with appropriate bounds checking and slice manipulation.


2368-2393: LGTM! Correct argument removal logic.

The function properly handles both --flag=value and --flag value styles with safe slice manipulation and bounds checking.

return ctx
})

feature.Assess("Check if HealthEventsAnalyzerStoreOnlyStrategy node condition is added", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Assess description is misleading.

The description says "Check if ... node condition is added" but the test actually verifies that no node condition is created and validates the exported CloudEvent. Consider updating to match the actual assertion intent.

πŸ”Ž Suggested fix
-	feature.Assess("Check if HealthEventsAnalyzerStoreOnlyStrategy node condition is added", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
+	feature.Assess("Verify no node condition is created and CloudEvent is exported for STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
feature.Assess("Check if HealthEventsAnalyzerStoreOnlyStrategy node condition is added", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
feature.Assess("Verify no node condition is created and CloudEvent is exported for STORE_ONLY strategy", func(ctx context.Context, t *testing.T, c *envconf.Config) context.Context {
πŸ€– Prompt for AI Agents
In tests/health_events_analyzer_test.go around line 1597, the Assess description
is misleading: it claims to check that a "node condition is added" while the
test actually asserts that no node condition is created and validates the
exported CloudEvent. Update the Assess description to accurately reflect the
test intent (e.g., "Check that HealthEventsAnalyzerStoreOnlyStrategy does NOT
add a node condition and exported CloudEvent is valid") so the description
matches the assertions and intent of the test.

Comment on lines +2341 to +2365
func RemoveDeploymentArgs(
ctx context.Context, c klient.Client, deploymentName, namespace, containerName string, args map[string]string,
) error {
return retry.RetryOnConflict(retry.DefaultRetry, func() error {
deployment := &appsv1.Deployment{}
if err := c.Resources().Get(ctx, deploymentName, namespace, deployment); err != nil {
return err
}

if len(deployment.Spec.Template.Spec.Containers) == 0 {
return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName)
}

for i := range deployment.Spec.Template.Spec.Containers {
container := &deployment.Spec.Template.Spec.Containers[i]

if containerName != "" && container.Name != containerName {
continue
}

removeArgsFromContainer(container, args)
}

return c.Resources().Update(ctx, deployment)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟑 Minor

Inconsistent behavior: missing container existence check.

Unlike SetDeploymentArgs (lines 2282-2284), this function does not verify that the specified container exists when containerName is non-empty. It will silently succeed even if the named container is not found, which could mask configuration errors in tests.

πŸ”Ž Proposed fix to add container existence check
 		return fmt.Errorf("deployment %s/%s has no containers", namespace, deploymentName)
 	}
 
+	found := false
+
 	for i := range deployment.Spec.Template.Spec.Containers {
 		container := &deployment.Spec.Template.Spec.Containers[i]
 
 		if containerName != "" && container.Name != containerName {
 			continue
 		}
 
+		found = true
+
 		removeArgsFromContainer(container, args)
 	}
 
+	if containerName != "" && !found {
+		return fmt.Errorf("container %q not found in deployment %s/%s", containerName, namespace, deploymentName)
+	}
+
 	return c.Resources().Update(ctx, deployment)
 })
πŸ€– Prompt for AI Agents
In tests/helpers/kube.go around lines 2341 to 2365, the RemoveDeploymentArgs
function lacks the existence check for a specific container name (unlike
SetDeploymentArgs), so it silently succeeds if the named container isn't
present; modify the function to track whether any container matching
containerName was found (e.g., set a found bool inside the loop when you process
the matching container), and after the loop if containerName != "" and found is
false return a clear error (fmt.Errorf("container %q not found in deployment
%s/%s", containerName, namespace, deploymentName)); keep this change inside the
existing RetryOnConflict closure so Update still only runs when appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant