feat: add event handling strategy changes in gpu health monitor #611

tanishagoyal2 · 2025-12-22T11:01:28Z

Summary

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

New Features
- Introduced a processing strategy option (EXECUTE_REMEDIATION, STORE_ONLY) with CLI/daemonset configuration and default EXECUTE_REMEDIATION; strategy is carried in exported CloudEvents and monitor messages.
Behavioral Changes
- STORE_ONLY events are persisted/exported but do not cause cluster state changes (no node conditions or Kubernetes events).
Tests
- Expanded tests and e2e coverage for strategy propagation and STORE_ONLY vs EXECUTE_REMEDIATION behavior, plus schema/storage validation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Tanisha goyal <[email protected]>

coderabbitai · 2025-12-22T11:01:39Z

📝 Walkthrough

Walkthrough

Adds a ProcessingStrategy enum and processingStrategy field to HealthEvent and threads the strategy through CLI, event creation, exporters, storage pipelines, Kubernetes processing, and tests so STORE_ONLY events are stored without triggering remediation.

Changes

Cohort / File(s)	Summary
Protobuf & Generated Bindings `data-models/protobufs/health_event.proto`, `health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py`, `health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi`	Add `ProcessingStrategy` enum (EXECUTE_REMEDIATION, STORE_ONLY) and `processingStrategy` field (field 16) on `HealthEvent`; regenerate Python bindings and stubs.
DaemonSet Templates & Chart Values `distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml`, `.../daemonset-dcgm-4.x.yaml`, `distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml`	Add `--processing-strategy` container arg and new `processingStrategy` value (default EXECUTE_REMEDIATION) to chart values.
GPU Health Monitor CLI & Platform Connector (Python) `health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py`, `.../platform_connector/platform_connector.py`	Parse CLI `--processing-strategy`, thread `processing_strategy` into PlatformConnectorEventProcessor, store as `self._processing_strategy`, and embed into all produced `HealthEvent` messages.
Event Exporter (CloudEvents) `event-exporter/pkg/transformer/cloudevents.go`, `event-exporter/pkg/transformer/cloudevents_test.go`	Include `processingStrategy` in CloudEvent healthEvent payload; tests updated to assert propagation.
Kubernetes Connector: Filtering & Events `platform-connectors/pkg/connectors/kubernetes/process_node_events.go`, `platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go`	Add `filterProcessableEvents()` to skip STORE_ONLY events, add `createK8sEvent()` and `writeNodeEvent()` usage, and ensure only EXECUTE_REMEDIATION events update node conditions/create K8s events; tests validate behavior.
Storage Pipeline Builders & Interface `store-client/pkg/client/pipeline_builder.go`, `.../mongodb_pipeline_builder.go`, `.../postgresql_pipeline_builder.go`, `.../pipeline_builder_test.go`	Add `BuildProcessableHealthEventInsertsPipeline()` to interface and MongoDB/Postgres builders to watch inserts where `processingStrategy == EXECUTE_REMEDIATION`; tests added to assert pipeline shape.
Fault-Quarantine & Initializer `fault-quarantine/pkg/evaluator/rule_evaluator_test.go`, `fault-quarantine/pkg/initializer/init.go`	Tests updated to expect processingStrategy field; initializer switched to use `BuildProcessableHealthEventInsertsPipeline()`.
Tests & Helpers (integration/test utilities) `tests/helpers/healthevent.go`, `tests/helpers/event_exporter.go`, `tests/helpers/kube.go`, `tests/event_exporter_test.go`, `tests/gpu_health_monitor_test.go`	Add ProcessingStrategy to HealthEventTemplate and builder, extend ValidateCloudEvent to check processingStrategy, add DaemonSet arg update/rollout helpers, and add a store-only GPU health monitor test path.
Platform Connector Unit Tests (Python) `health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py`	Update tests to pass processing_strategy into processor init and assert `processingStrategy` propagation on created HealthEvents.
Postgres Init / Schema `distro.../values-tilt-postgresql.yaml`, `docs/postgresql-schema.sql`	Add `processing_strategy VARCHAR(50)` column to `health_events` schema and initialization SQL.
Event Exporter / Test Helpers Update `tests/event_exporter_test.go`, `tests/helpers/event_exporter.go`	Update tests and helper `ValidateCloudEvent` signature to require expected processingStrategy and validate it in healthEvent payloads.

Sequence Diagram

sequenceDiagram
    participant CLI as GPU Monitor CLI
    participant Init as _init_event_processor
    participant PC as PlatformConnectorEventProcessor
    participant HE as HealthEvent (protobuf)
    participant Exporter as Event Exporter
    participant Store as Storage (DB / pipeline)
    participant K8s as K8s Connector

    CLI->>Init: parse --processing-strategy
    Init->>PC: init(..., processing_strategy)
    PC->>PC: store self._processing_strategy

    PC->>HE: create HealthEvent(processingStrategy=self._processing_strategy)
    HE->>Exporter: emit CloudEvent (includes processingStrategy)
    Exporter->>Store: persist event (all strategies)

    rect rgba(200,230,200,0.6)
      note right of Store: Watch pipeline filters EXECUTE_REMEDIATION
      Store->>K8s: deliver only EXECUTE_REMEDIATION events
      K8s->>K8s: update Node conditions / create Node events
    end

    rect rgba(230,200,200,0.6)
      note right of Store: STORE_ONLY path
      Store->>Store: stored for observability (no K8s remediation)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"A rabbit hops with code and glee,
I threaded strategy through each tree.
STORE or heal — the events now say,
I stash the logs or act today.
🐇✨"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.56% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add event handling strategy changes in gpu health monitor' accurately describes the main change—introducing a processing/event handling strategy feature to the GPU health monitor, which is reflected across all 29 modified files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (5)

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)
325-343: Add godoc for the exported function.

The function filterProcessableEvents is exported (starts with uppercase) but lacks a proper godoc comment. Per coding guidelines, function comments are required for all exported Go functions. The comment on line 325 is present but should follow godoc format.

Also, consider nil-safety: if healthEvents or healthEvents.Events is nil, this could panic.
🔎 Suggested improvement
-// filterProcessableEvents filters out STORE_ONLY events that should not create node conditions or K8s events.
-func filterProcessableEvents(healthEvents *protos.HealthEvents) []*protos.HealthEvent {
+// filterProcessableEvents filters out STORE_ONLY events that should not create node conditions or K8s events.
+// It returns only events with processing strategy other than STORE_ONLY.
+func filterProcessableEvents(healthEvents *protos.HealthEvents) []*protos.HealthEvent {
+	if healthEvents == nil || len(healthEvents.Events) == 0 {
+		return nil
+	}
+
 	var processableEvents []*protos.HealthEvent
 
 	for _, healthEvent := range healthEvents.Events {
tests/gpu_health_monitor_test.go (1)
502-510: Teardown clears "Memory" error that was never injected.

The setup only injects an Inforom error (field 84), but the teardown clearCommands includes both Inforom and Memory errors. While this is harmless (clearing a non-existent error is a no-op), it adds unnecessary operations. Consider matching the teardown to what was actually injected.
🔎 Suggested simplification
 		clearCommands := []struct {
 			name      string
 			fieldID   string
 			value     string
 			condition string
 		}{
 			{"Inforom", "84", "1", "GpuInforomWatch"},
-			{"Memory", "395", "0", "GpuMemWatch"},
 		}
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)
125-132: Consider logging the strategy name instead of the integer value.

Line 132 logs processing_strategy_value, which is the integer enum value (e.g., 0 or 1). For better readability in logs, consider logging the original string name or using ProcessingStrategy.Name().
🔎 Suggested improvement
-    log.info(f"Event handling strategy configured to: {processing_strategy_value}")
+    log.info(f"Event handling strategy configured to: {processing_strategy}")
This logs the human-readable string (e.g., EXECUTE_REMEDIATION) instead of the integer value.
tests/helpers/kube.go (2)
2306-2307: Hardcoded sleep after rollout completion is fragile.

The 10-second sleep after waitForDaemonSetRollout adds unnecessary delay. The rollout wait already ensures pods are ready. If additional startup time is genuinely needed, consider polling for a specific readiness indicator instead.
🔎 Suggested improvement

Remove the hardcoded sleep or replace it with a condition-based wait if there's a specific startup behavior to await:
 	t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName)
 	waitForDaemonSetRollout(ctx, t, client, daemonsetName)

-	t.Logf("Waiting 10 seconds for daemonset pods to start")
-	time.Sleep(10 * time.Second)
-
 	return originalDaemonSet, nil
2270-2272: Add godoc comments for exported functions.

Per coding guidelines, exported Go functions require function comments. The new exported functions UpdateDaemonSetProcessingStrategy, RestoreDaemonSet, and GetDaemonSetPodOnWorkerNode are missing godoc comments.
🔎 Suggested improvement
+// UpdateDaemonSetProcessingStrategy updates the specified container in a DaemonSet to use
+// STORE_ONLY processing strategy, waits for rollout completion, and returns the original DaemonSet.
 func UpdateDaemonSetProcessingStrategy(ctx context.Context, t *testing.T,
 	client klient.Client, daemonsetName string, containerName string) (*appsv1.DaemonSet, error) {
+// RestoreDaemonSet restores a DaemonSet's containers to their original state and waits for rollout.
 func RestoreDaemonSet(ctx context.Context, t *testing.T, client klient.Client,
 	originalDaemonSet *appsv1.DaemonSet, daemonsetName string,
 ) error {
+// GetDaemonSetPodOnWorkerNode returns a running, ready pod from the DaemonSet on a real worker node.
 func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
 	daemonsetName string, podNamePattern string) (*v1.Pod, error) {
As per coding guidelines, function comments are required for all exported Go functions.

Also applies to: 2312-2314, 2346-2348

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61f47cb and 8cd1107.

⛔ Files ignored due to path filters (1)

data-models/pkg/protos/health_event.pb.go is excluded by !**/*.pb.go

📒 Files selected for processing (23)

data-models/protobufs/health_event.proto
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
event-exporter/pkg/transformer/cloudevents.go
event-exporter/pkg/transformer/cloudevents_test.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
fault-quarantine/pkg/initializer/init.go
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
store-client/pkg/client/mongodb_pipeline_builder.go
store-client/pkg/client/pipeline_builder.go
store-client/pkg/client/pipeline_builder_test.go
store-client/pkg/client/postgresql_pipeline_builder.go
tests/event_exporter_test.go
tests/gpu_health_monitor_test.go
tests/helpers/healthevent.go
tests/helpers/kube.go

🧰 Additional context used

📓 Path-based instructions (7)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

fault-quarantine/pkg/initializer/init.go
store-client/pkg/client/mongodb_pipeline_builder.go
store-client/pkg/client/postgresql_pipeline_builder.go
store-client/pkg/client/pipeline_builder.go
store-client/pkg/client/pipeline_builder_test.go
tests/helpers/healthevent.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
tests/event_exporter_test.go
platform-connectors/pkg/connectors/kubernetes/process_node_events.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
event-exporter/pkg/transformer/cloudevents_test.go
tests/helpers/kube.go
event-exporter/pkg/transformer/cloudevents.go
tests/gpu_health_monitor_test.go

data-models/protobufs/**/*.proto

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

data-models/protobufs/**/*.proto: Define Protocol Buffer messages in data-models/protobufs/ directory
Use semantic versioning for breaking changes in Protocol Buffer messages
Include comprehensive comments for all fields in Protocol Buffer messages

Files:

data-models/protobufs/health_event.proto

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

store-client/pkg/client/pipeline_builder_test.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
tests/event_exporter_test.go
platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go
event-exporter/pkg/transformer/cloudevents_test.go
tests/gpu_health_monitor_test.go

**/values.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml

**/daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml

distros/kubernetes/**/*daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py

🧠 Learnings (6)

📚 Learning: 2025-11-10T10:25:19.443Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 248
File: distros/kubernetes/nvsentinel/charts/health-events-analyzer/values.yaml:87-124
Timestamp: 2025-11-10T10:25:19.443Z
Learning: In the NVSentinel health-events-analyzer, the agent filter to exclude analyzer-published events (`"healthevent.agent": {"$ne": "health-events-analyzer"}`) is automatically added as the first stage in getPipelineStages() function in health-events-analyzer/pkg/reconciler/reconciler.go, not in individual rule configurations in values.yaml.

Applied to files:

fault-quarantine/pkg/initializer/init.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/event_exporter_test.go
tests/gpu_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/event_exporter_test.go
tests/gpu_health_monitor_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `testify/assert` and `testify/require` for assertions in Go tests

Applied to files:

tests/event_exporter_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods

Applied to files:

tests/event_exporter_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

tests/helpers/kube.go

🧬 Code graph analysis (10)

store-client/pkg/client/mongodb_pipeline_builder.go (2)

store-client/pkg/datastore/types.go (4)

ToPipeline (161-163)

D (131-133)

E (126-128)

A (136-138)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

store-client/pkg/client/postgresql_pipeline_builder.go (2)

store-client/pkg/datastore/types.go (4)

ToPipeline (161-163)

D (131-133)

E (126-128)

A (136-138)

data-models/pkg/protos/health_event.pb.go (1)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

store-client/pkg/client/pipeline_builder_test.go (3)

store-client/pkg/client/pipeline_builder.go (1)

PipelineBuilder (26-47)

store-client/pkg/client/mongodb_pipeline_builder.go (1)

NewMongoDBPipelineBuilder (29-31)

store-client/pkg/client/postgresql_pipeline_builder.go (1)

NewPostgreSQLPipelineBuilder (29-31)

tests/helpers/healthevent.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

tests/event_exporter_test.go (1)

tests/helpers/event_exporter.go (1)

ValidateCloudEvent (221-245)

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

data-models/pkg/protos/health_event.pb.go (11)

HealthEvents (156-162)

HealthEvents (175-175)

HealthEvents (190-192)

HealthEvent (260-280)

HealthEvent (293-293)

HealthEvent (308-310)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

ProcessingStrategy_STORE_ONLY (47-47)

event-exporter/pkg/transformer/cloudevents_test.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (5)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

ProcessingStrategy_STORE_ONLY (47-47)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

event-exporter/pkg/transformer/cloudevents.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

tests/gpu_health_monitor_test.go (3)

tests/helpers/kube.go (10)

UpdateDaemonSetProcessingStrategy (2271-2310)

GetDaemonSetPodOnWorkerNode (2346-2390)

NVSentinelNamespace (64-64)

SetNodeManagedByNVSentinel (1365-1384)

ExecInPod (1539-1574)

EnsureNodeConditionNotPresent (1773-1794)

RestoreDaemonSet (2312-2344)

CheckNodeConditionExists (1259-1276)

EventuallyWaitTimeout (61-61)

WaitInterval (63-63)

tests/helpers/metadata.go (3)

CreateTestMetadata (59-108)

InjectMetadata (110-181)

DeleteMetadata (183-229)

tests/helpers/fault_quarantine.go (2)

AssertQuarantineState (315-382)

QuarantineAssertion (56-60)

🔇 Additional comments (44)

platform-connectors/pkg/connectors/kubernetes/k8s_platform_connector_test.go (2)

1391-1506: Well-structured test suite for processing strategy.

The test suite properly validates the new ProcessingStrategy filtering behavior with good coverage:

STORE_ONLY events (fatal and non-fatal) correctly don't create node conditions or K8s events

EXECUTE_REMEDIATION events properly create node conditions

Mixed strategy scenarios validate selective processing

The table-driven approach and per-test isolation with local clientsets align well with the coding guidelines.

1508-1587: Test implementation looks correct.

The test harness correctly:

Creates isolated environments per test case

Filters standard K8s node conditions (NodeReady, NodeMemoryPressure, etc.) when counting NVSentinel conditions

Validates both the presence/absence of node conditions and K8s events

One minor observation: the description field is logged at the end but could be incorporated into the assertion failure messages for better debugging context, though the current approach is acceptable.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go (1)

372-416: Implementation correctly integrates filtering.

The processHealthEvents function properly:

Filters events using filterProcessableEvents first

Uses only processableEvents for both node condition updates and K8s event creation

Maintains the existing logic for fatal vs non-fatal event handling

One edge case to consider: if processableEvents is empty (all events are STORE_ONLY), the function returns early without errors, which appears to be the intended behavior.

event-exporter/pkg/transformer/cloudevents.go (1)

66-66: Correctly propagates processingStrategy to CloudEvent.

The addition of processingStrategy to the CloudEvent data payload is consistent with how other enum fields (e.g., recommendedAction on line 61) are handled, using .String() for serialization.

fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: Test correctly updated for new processingStrategy field.

The expected map now includes processingStrategy: float64(0), which correctly reflects:

The default enum value EXECUTE_REMEDIATION = 0 when no explicit value is set

The JSON unmarshaling behavior where numbers become float64 in interface{}

store-client/pkg/client/pipeline_builder_test.go (1)

69-86: Test follows established patterns correctly.

The new test TestProcessableHealthEventInsertsPipeline is well-structured and consistent with the existing test patterns in this file:

Uses table-driven tests for both MongoDB and PostgreSQL implementations

Properly validates pipeline is non-nil, non-empty, and has exactly 1 stage

Uses require for critical assertions and assert for validations

As per coding guidelines, consider using a more descriptive test name format like TestPipelineBuilder_ProcessableHealthEventInserts to align with TestFunctionName_Scenario_ExpectedBehavior.

fault-quarantine/pkg/initializer/init.go (1)

66-66: LGTM! Pipeline switch correctly filters for processable events.

The change from BuildAllHealthEventInsertsPipeline() to BuildProcessableHealthEventInsertsPipeline() correctly ensures that fault-quarantine only processes health events with processingStrategy=EXECUTE_REMEDIATION, ignoring observability-only (STORE_ONLY) events.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml (1)

61-62: LGTM! Processing strategy argument properly configured.

The --processing-strategy argument is correctly added with the value sourced from .Values.processingStrategy and properly quoted. This aligns with the PR objectives to enable configurable event handling strategy.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)

57-62: LGTM! Well-documented configuration option.

The processingStrategy configuration is clearly documented with:

Valid values (EXECUTE_REMEDIATION, STORE_ONLY)

Default value that maintains backward compatibility

Clear explanations of each mode's behavior

This follows the coding guidelines for Helm chart documentation.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml (1)

61-62: LGTM! Consistent with DCGM 3.x template.

The --processing-strategy argument is added consistently with the DCGM 3.x DaemonSet template, ensuring uniform behavior across DCGM versions.

store-client/pkg/client/postgresql_pipeline_builder.go (2)

19-19: LGTM! Import added for ProcessingStrategy enum.

The import of github.com/nvidia/nvsentinel/data-models/pkg/protos is necessary to reference the ProcessingStrategy_EXECUTE_REMEDIATION enum value used in the new pipeline.

119-132: LGTM! Pipeline correctly filters for processable events.

The new BuildProcessableHealthEventInsertsPipeline() method:

Follows the established pipeline pattern from BuildAllHealthEventInsertsPipeline()

Correctly filters INSERT operations where processingStrategy equals EXECUTE_REMEDIATION

Uses the appropriate int32 cast for the protobuf enum value

This enables PostgreSQL change streams to ignore observability-only events (STORE_ONLY).

event-exporter/pkg/transformer/cloudevents_test.go (2)

69-69: LGTM! Test correctly initializes ProcessingStrategy.

The test event properly sets ProcessingStrategy: pb.ProcessingStrategy_STORE_ONLY to validate that the new field is handled correctly during CloudEvent transformation.

106-108: LGTM! Test validates ProcessingStrategy propagation.

The assertion correctly verifies that the processingStrategy field appears in the CloudEvent payload with the expected string value "STORE_ONLY", ensuring proper transformation from the protobuf enum.

store-client/pkg/client/pipeline_builder.go (1)

35-38: LGTM! Interface extension well-documented.

The new BuildProcessableHealthEventInsertsPipeline() method:

Is clearly documented with its purpose and use case

Explains the filtering behavior (processingStrategy=EXECUTE_REMEDIATION)

References the consumer (fault-quarantine)

Follows the documentation pattern of other interface methods

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (2)

14-17: LGTM! ProcessingStrategy enum properly defined.

The new ProcessingStrategy enum class with EXECUTE_REMEDIATION and STORE_ONLY members is correctly structured according to Python protobuf stub conventions.

31-32: LGTM! HealthEvent ProcessingStrategy integration complete.

The processingStrategy field is properly integrated into the HealthEvent message with:

Module-level enum constants (lines 31-32)

Field in __slots__ (line 78)

Field number constant (line 104)

Type annotation (line 120)

__init__ parameter with correct typing (line 138)

This generated stub correctly reflects the protobuf schema changes.

Also applies to: 78-78, 104-104, 120-120, 138-138

tests/event_exporter_test.go (1)

25-26: Import addition looks correct.

The import of "tests/helpers" is properly added and used throughout the test file for helper functions.

data-models/protobufs/health_event.proto (2)

32-38: Well-designed enum with appropriate default value.

Using EXECUTE_REMEDIATION = 0 as the default is correct design—existing clients that don't set the field will get the expected remediation behavior. The comments clearly explain each strategy's semantics, which aligns with the coding guidelines for comprehensive field documentation in Protocol Buffer messages.

77-77: Field addition looks correct.

The new processingStrategy field at position 16 maintains backward compatibility and follows the existing field numbering sequence.

store-client/pkg/client/mongodb_pipeline_builder.go (2)

87-100: Pipeline implementation is correct.

The method follows existing patterns and correctly filters for EXECUTE_REMEDIATION events using the protobuf enum's integer value. This ensures only events intended for remediation are processed by downstream consumers.

19-19: Import addition is appropriate.

The protos import is required to reference ProcessingStrategy_EXECUTE_REMEDIATION constant.

tests/helpers/healthevent.go (2)

48-48: Field addition follows existing patterns.

The ProcessingStrategy field with omitempty JSON tag is consistent with other optional fields in the struct. The default zero value maps to EXECUTE_REMEDIATION, which is the expected default behavior.

153-156: Builder method follows established conventions.

The WithProcessingStrategy method maintains consistency with other builder methods in the file.

tests/gpu_health_monitor_test.go (2)

414-463: Test structure and setup are well-organized.

The test properly:

Updates the DaemonSet to use STORE_ONLY strategy

Waits for rollout completion

Injects a GPU error to trigger event generation

Stores context values for teardown

The use of UpdateDaemonSetProcessingStrategy and GetDaemonSetPodOnWorkerNode helpers keeps the test readable.

465-481: Test assertions correctly validate STORE_ONLY behavior.

The test verifies that with STORE_ONLY processing strategy:

Node conditions are NOT applied

Node is NOT cordoned

No quarantine annotation is present

This validates the core feature that STORE_ONLY events are observed but don't modify cluster state.

health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (5)

107-108: Processing strategy parameter correctly added to test initialization.

The test now passes platformconnector_pb2.STORE_ONLY to the processor, enabling verification of the new processing strategy flow.

301-302: Good assertion for processingStrategy propagation.

Verifying that nvlink_failure_event.processingStrategy == platformconnector_pb2.STORE_ONLY confirms the strategy is correctly propagated through the event pipeline to the generated HealthEvent.

523-524: Test uses EXECUTE_REMEDIATION for connectivity restored scenario.

This is appropriate—the connectivity restored test uses EXECUTE_REMEDIATION which tests the alternative processing path, providing coverage for both enum values.

549-549: Assertion validates EXECUTE_REMEDIATION propagation.

This confirms the test covers the EXECUTE_REMEDIATION strategy path in the restored connectivity scenario.

493-493: Comprehensive assertion for DCGM connectivity failure event.

The test verifies all expected fields including the new processingStrategy field, ensuring the complete event structure is validated.

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.py (1)

1-51: LGTM - Generated protobuf code.

This is an auto-generated file from the protobuf compiler. The changes correctly reflect the addition of the ProcessingStrategy enum and the processingStrategy field (field number 16) to the HealthEvent message, as indicated by the updated serialized descriptor and adjusted byte ranges.

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (3)

24-24: LGTM - Import alias consistent with existing usage.

The alias platformconnector_pb2 for health_event_pb2 maintains consistency with the existing import pattern in platform_connector.py.

74-80: LGTM - CLI option well-defined.

The new --processing-strategy option is appropriately configured with a sensible default (EXECUTE_REMEDIATION) and a clear help string describing the valid values.

28-51: LGTM - Parameter threading through event processor initialization.

The processing_strategy parameter is correctly added to _init_event_processor signature and passed through to PlatformConnectorEventProcessor.

tests/helpers/kube.go (4)

2208-2249: LGTM - DaemonSet rollout wait logic.

The rollout status checks correctly verify that all desired pods are scheduled, updated, and ready before considering the rollout complete.

2251-2268: LGTM - Container argument update handles both formats.

The helper correctly handles both --processing-strategy=VALUE and --processing-strategy VALUE argument formats, and appends the argument if not present.

2312-2344: LGTM - RestoreDaemonSet correctly handles nil input.

The function appropriately handles the case where originalDaemonSet is nil by logging and returning early.

2346-2390: LGTM - GetDaemonSetPodOnWorkerNode with proper readiness checks.

The function correctly verifies the pod is not being deleted, is in Running phase, and passes all readiness conditions before returning.

health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)

41-66: LGTM - Processing strategy parameter added to constructor.

The processing_strategy parameter is correctly typed with platformconnector_pb2.ProcessingStrategy and stored as an instance variable for use in event creation.

106-121: LGTM - Processing strategy propagated in connectivity restored events.

The processingStrategy field is correctly included in the HealthEvent for DCGM connectivity restoration.

206-223: LGTM - Processing strategy propagated in health failure events.

The processingStrategy field is correctly included when creating health events for GPU failures.

270-287: LGTM - Processing strategy propagated in health recovery events.

The processingStrategy field is correctly included when creating health events indicating GPU recovery.

366-381: LGTM - Processing strategy propagated in connectivity failure events.

The processingStrategy field is correctly included in the HealthEvent for DCGM connectivity failures.

platform-connectors/pkg/connectors/kubernetes/process_node_events.go

tests/event_exporter_test.go

tests/gpu_health_monitor_test.go

github-actions · 2025-12-22T11:20:58Z

Merging this branch will decrease overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator	41.95% (-0.57%)	👎
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer	0.00% (ø)
github.com/nvidia/nvsentinel/tests	0.00% (ø)
github.com/nvidia/nvsentinel/tests/helpers	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Missed
github.com/nvidia/nvsentinel/fault-quarantine/pkg/initializer/init.go	0.00% (ø)	280	280
github.com/nvidia/nvsentinel/tests/helpers/healthevent.go	0.00% (ø)	0	0
github.com/nvidia/nvsentinel/tests/helpers/kube.go	0.00% (ø)	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/fault-quarantine/pkg/evaluator/rule_evaluator_test.go
github.com/nvidia/nvsentinel/tests/event_exporter_test.go
github.com/nvidia/nvsentinel/tests/gpu_health_monitor_test.go

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

tests/helpers/event_exporter.go (1)
221-226: Add function documentation per coding guidelines.

The exported function ValidateCloudEvent lacks a godoc comment. As per coding guidelines, "Function comments required for all exported Go functions." Please add documentation explaining the function's purpose and parameters.
🔎 Suggested documentation
+// ValidateCloudEvent validates that a CloudEvent has the expected structure and content.
+// It checks the CloudEvent spec version, type, source, and validates the embedded HealthEvent
+// fields including node name, message, check name, error code, and processing strategy.
 func ValidateCloudEvent(
 	t *testing.T,
 	event map[string]any,
 	expectedNodeName, expectedMessage, expectedCheckName, expectedErrorCode string,
 	expectedProcessingStrategy string,
 ) {
Based on coding guidelines: "Function comments required for all exported Go functions"
tests/helpers/kube.go (4)
2257-2269: Unused variable originalDaemonSet.

The originalDaemonSet variable is assigned on line 2268 but never used. This appears to be dead code, possibly leftover from a previous implementation that intended to restore the original state.
🔎 Proposed fix
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		daemonSet := &appsv1.DaemonSet{}
 		if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil {
 			return err
 		}
 
-		if originalDaemonSet == nil {
-			originalDaemonSet = daemonSet.DeepCopy()
-		}
-
 		containers := daemonSet.Spec.Template.Spec.Containers
2288-2289: Consider removing redundant sleep.

The 10-second sleep appears redundant since waitForDaemonSetRollout already waits until all pods are ready (NumberReady == DesiredNumberScheduled). If there's a specific edge case requiring this delay (e.g., waiting for readiness probes to stabilize), consider documenting it; otherwise, this could be removed.

2294-2327: Inconsistent error handling pattern.

This function mixes two error handling approaches: it returns an error but also calls require.NoError on line 2319 which will fail the test immediately. This is inconsistent with UpdateDaemonSetArgs which only returns errors.

If the function returns an error, callers should handle it. Using require.NoError makes the return value meaningless since the test fails before returning.
🔎 Proposed fix - Option A: Return error consistently (like UpdateDaemonSetArgs)
 	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		daemonSet := &appsv1.DaemonSet{}
 		if err := client.Resources().Get(ctx, daemonsetName, NVSentinelNamespace, daemonSet); err != nil {
 			return err
 		}
 
 		containers := daemonSet.Spec.Template.Spec.Containers
 
 		for i := range containers {
 			if containers[i].Name == containerName {
 				removeArgsFromContainer(&containers[i], args)
 				break
 			}
 		}
 
 		return client.Resources().Update(ctx, daemonSet)
 	})
-	require.NoError(t, err, "failed to remove args from daemonset %s/%s", NVSentinelNamespace, daemonsetName)
+	if err != nil {
+		return fmt.Errorf("failed to remove args from daemonset %s/%s: %w", NVSentinelNamespace, daemonsetName, err)
+	}
2409-2453: Parameter daemonsetName is only used in error message, not for pod selection.

The function signature accepts daemonsetName but it's only used in the error message on line 2449. The actual pod selection uses podNamePattern via GetPodOnWorkerNode. This could be misleading since the function name implies it validates the pod belongs to the specified DaemonSet.

Consider either:

Removing the daemonsetName parameter if pattern matching is sufficient

Adding validation that the pod is actually owned by the specified DaemonSet
🔎 Option 1: Remove unused parameter
-func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
-	daemonsetName string, podNamePattern string) (*v1.Pod, error) {
+func GetDaemonSetPodOnWorkerNode(ctx context.Context, t *testing.T, client klient.Client,
+	podNamePattern string) (*v1.Pod, error) {
 	t.Helper()
 
 	var resultPod *v1.Pod
 
 	require.Eventually(t, func() bool {
 		// Get the pod
 		pod, err := GetPodOnWorkerNode(ctx, t, client, NVSentinelNamespace, podNamePattern)
 		if err != nil {
 			t.Logf("Failed to get pod: %v", err)
 			return false
 		}
         ...
 	}, EventuallyWaitTimeout, WaitInterval, "daemonset pod from current rollout should be running and ready")
 
 	if resultPod == nil {
-		return nil, fmt.Errorf("failed to get ready pod for daemonset %s", daemonsetName)
+		return nil, fmt.Errorf("failed to get ready pod matching pattern %s", podNamePattern)
 	}
 
 	return resultPod, nil
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8cd1107 and 8398875.

📒 Files selected for processing (3)

tests/gpu_health_monitor_test.go
tests/helpers/event_exporter.go
tests/helpers/kube.go

🚧 Files skipped from review as they are similar to previous changes (1)

tests/gpu_health_monitor_test.go

🧰 Additional context used

📓 Path-based instructions (1)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

tests/helpers/event_exporter.go
tests/helpers/kube.go

🧠 Learnings (1)

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

tests/helpers/kube.go

🔇 Additional comments (5)

tests/helpers/event_exporter.go (1)

245-245: LGTM! Validation follows the established pattern.

The validation of processingStrategy correctly follows the same pattern as other field validations in this function, using require.Equal consistently.

tests/helpers/kube.go (4)

2208-2249: LGTM!

The waitForDaemonSetRollout function correctly polls the DaemonSet status and follows the same patterns as the existing WaitForDeploymentRollout function. The rollout completion checks for DesiredNumberScheduled, UpdatedNumberScheduled, and NumberReady are appropriate.

2329-2357: LGTM!

The tryUpdateExistingArg helper correctly handles both --flag=value and --flag value argument styles. The slice manipulation for inserting a value after a flag is correct.

2359-2381: LGTM!

The function correctly sets container arguments, leveraging tryUpdateExistingArg to handle existing args and appending new ones as needed.

2383-2407: LGTM!

The function correctly handles removal of both --flag=value and --flag value style arguments, appropriately breaking after modification to avoid issues with slice iteration.

Signed-off-by: Tanisha goyal <[email protected]>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (8)

tests/helpers/healthevent.go (1)
48-48: Consider using the protobuf enum type for type safety.

The ProcessingStrategy field uses int instead of protos.ProcessingStrategy. While this provides flexibility in tests, it loses type safety and may allow invalid values.

Consider whether type safety is valuable here:
🔎 Option to use the enum type

If you prefer compile-time type checking:
+import "github.com/nvidia/nvsentinel/data-models/pkg/protos"
+
 type HealthEventTemplate struct {
 	Version             int                  `json:"version"`
 	Agent               string               `json:"agent"`
 	ComponentClass      string               `json:"componentClass,omitempty"`
 	CheckName           string               `json:"checkName"`
 	IsFatal             bool                 `json:"isFatal"`
 	IsHealthy           bool                 `json:"isHealthy"`
 	Message             string               `json:"message"`
 	RecommendedAction   int                  `json:"recommendedAction,omitempty"`
 	ErrorCode           []string             `json:"errorCode,omitempty"`
 	EntitiesImpacted    []EntityImpacted     `json:"entitiesImpacted,omitempty"`
 	Metadata            map[string]string    `json:"metadata,omitempty"`
 	QuarantineOverrides *QuarantineOverrides `json:"quarantineOverrides,omitempty"`
 	NodeName            string               `json:"nodeName"`
-	ProcessingStrategy  int                  `json:"processingStrategy,omitempty"`
+	ProcessingStrategy  protos.ProcessingStrategy `json:"processingStrategy,omitempty"`
 }
Then update the builder:
-func (h *HealthEventTemplate) WithProcessingStrategy(strategy int) *HealthEventTemplate {
+func (h *HealthEventTemplate) WithProcessingStrategy(strategy protos.ProcessingStrategy) *HealthEventTemplate {
 	h.ProcessingStrategy = strategy
 	return h
 }
tests/helpers/kube.go (1)
2312-2313: Consider if the 10-second sleep is necessary.

After waitForDaemonSetRollout completes, all pods are confirmed updated and ready. The additional 10-second sleep may be unnecessary unless there's a specific stabilization requirement not covered by the readiness checks.

If the sleep is for pod initialization beyond readiness, consider adding a comment explaining why. Otherwise, this delay might be removable:
 	t.Logf("Waiting for daemonset %s/%s rollout to complete", NVSentinelNamespace, daemonsetName)
 	waitForDaemonSetRollout(ctx, t, client, daemonsetName)
 
-	t.Logf("Waiting 10 seconds for daemonset pods to start")
-	time.Sleep(10 * time.Second)
-
 	return nil
 }
docs/postgresql-schema.sql (1)
106-109: Consider adding constraints for data integrity.

The processing_strategy column is nullable and has no constraints. Consider adding:

A CHECK constraint to ensure only valid enum values are stored

A NOT NULL constraint with a default value if every health event should have a strategy
🔎 Option to add constraints

If you want to enforce valid values at the database level:
     -- Metadata
     created_at TIMESTAMPTZ DEFAULT NOW(),
-    updated_at TIMESTAMPTZ DEFAULT NOW(),
+    updated_at TIMESTAMPTZ DEFAULT NOW() NOT NULL,
 
     -- Event handling strategy
-    processing_strategy VARCHAR(50)
+    processing_strategy VARCHAR(50) NOT NULL DEFAULT 'EXECUTE_REMEDIATION' 
+        CHECK (processing_strategy IN ('EXECUTE_REMEDIATION', 'STORE_ONLY'))
 );
This prevents invalid values and ensures consistency, but reduces flexibility if new enum values are added later without a migration.
tests/gpu_health_monitor_test.go (1)
488-489: Use defined constants instead of hardcoded strings.

Lines 488-489 use hardcoded strings "gpu-health-monitor-dcgm-4.x" and "gpu-health-monitor" instead of the constants GPUHealthMonitorDaemonSetName and GPUHealthMonitorContainerName defined at lines 42-43.
🔎 Suggested fix
-		err = helpers.RemoveDaemonSetArgs(ctx, t, client, "gpu-health-monitor-dcgm-4.x", "gpu-health-monitor", map[string]string{
+		err = helpers.RemoveDaemonSetArgs(ctx, t, client, GPUHealthMonitorDaemonSetName, GPUHealthMonitorContainerName, map[string]string{
 			"--processing-strategy": "EXECUTE_REMEDIATION"})
tests/platform-connector_test.go (2)
28-32: Remove unused struct fields.

ConfigMapBackup and TestNamespace fields are defined but never used in the test. Consider removing them to keep the code clean.
🔎 Suggested fix
 type PlatformConnectorTestContext struct {
 	NodeName        string
-	ConfigMapBackup []byte
-	TestNamespace   string
 }
98-101: Teardown sends healthy event but doesn't verify cleanup of STORE_ONLY events.

The teardown only sends a healthy event. Consider verifying that any state from the STORE_ONLY test cases is properly cleaned up, or add a comment explaining why no cleanup verification is needed.
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (1)
125-132: Good error handling for invalid processing strategy.

The validation correctly:

Uses protobuf's Value() method to validate and convert the string

Catches ValueError for invalid inputs

Logs all valid options to help users correct their configuration

Exits with code 1 on invalid input

One minor note: Line 132 logs processing_strategy_value which is the integer enum value. Consider logging the string name for better readability.
🔎 Optional: Log the strategy name for better readability
-    log.info(f"Event handling strategy configured to: {processing_strategy_value}")
+    log.info(f"Event handling strategy configured to: {processing_strategy}")
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (1)
125-135: Consider adding processingStrategy assertion to first test case.

The test_health_event_occurred test at lines 125-135 verifies event properties but doesn't assert on processingStrategy. While other tests cover this, adding an assertion here would ensure complete coverage.
🔎 Suggested addition
             if event.checkName == "GpuInforomWatch" and event.isHealthy == False:
                 assert event.errorCode[0] == "DCGM_FR_CORRUPT_INFOROM"
                 assert event.entitiesImpacted[0].entityValue == "0"
                 assert event.recommendedAction == platformconnector_pb2.RecommendedAction.COMPONENT_RESET
+                assert event.processingStrategy == platformconnector_pb2.STORE_ONLY
             else:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8398875 and 232c636.

📒 Files selected for processing (17)

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml
distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml
docs/postgresql-schema.sql
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
fault-quarantine/pkg/initializer/init.go
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py
health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
tests/event_exporter_test.go
tests/fault_quarantine_test.go
tests/gpu_health_monitor_test.go
tests/helpers/event_exporter.go
tests/helpers/healthevent.go
tests/helpers/kube.go
tests/platform-connector_test.go

🚧 Files skipped from review as they are similar to previous changes (4)

fault-quarantine/pkg/initializer/init.go
tests/event_exporter_test.go
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-3.x.yaml
distros/kubernetes/nvsentinel/charts/gpu-health-monitor/templates/daemonset-dcgm-4.x.yaml

🧰 Additional context used

📓 Path-based instructions (4)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

tests/helpers/event_exporter.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
tests/helpers/healthevent.go
tests/gpu_health_monitor_test.go
tests/helpers/kube.go
tests/fault_quarantine_test.go
tests/platform-connector_test.go

**/values.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/values.yaml: Document all values in Helm chart values.yaml with inline comments
Include examples for non-obvious configurations in Helm chart documentation
Note truthy value requirements in Helm chart documentation where applicable

Files:

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

fault-quarantine/pkg/evaluator/rule_evaluator_test.go
tests/gpu_health_monitor_test.go
tests/fault_quarantine_test.go
tests/platform-connector_test.go

**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Use Poetry for dependency management in Python code
Follow PEP 8 style guide for Python code
Use Black for formatting Python code
Type hints required for all functions in Python code
Use dataclasses for structured data in Python code

Files:

health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py
health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py

🧠 Learnings (8)

📚 Learning: 2025-12-23T10:34:13.121Z

Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 610
File: platform-connectors/pkg/connectors/kubernetes/process_node_events.go:346-370
Timestamp: 2025-12-23T10:34:13.121Z
Learning: In platform-connectors/pkg/connectors/kubernetes/process_node_events.go, the Event.Type field is intentionally set to healthEvent.CheckName rather than the standard Kubernetes "Normal" or "Warning" values. This is a deliberate design choice for NVSentinel to apply node events with the checkname as the type.

Applied to files:

tests/helpers/event_exporter.go
fault-quarantine/pkg/evaluator/rule_evaluator_test.go
tests/platform-connector_test.go
health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

fault-quarantine/pkg/evaluator/rule_evaluator_test.go
tests/gpu_health_monitor_test.go
tests/fault_quarantine_test.go
tests/platform-connector_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

tests/gpu_health_monitor_test.go
tests/fault_quarantine_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

tests/gpu_health_monitor_test.go
tests/fault_quarantine_test.go
tests/platform-connector_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

tests/helpers/kube.go

📚 Learning: 2025-11-06T16:18:09.952Z

Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

tests/fault_quarantine_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

tests/platform-connector_test.go

📚 Learning: 2025-11-04T06:31:02.147Z

Learnt from: Gyan172004
Repo: NVIDIA/NVSentinel PR: 223
File: platform-connectors/pkg/nodemetadata/processor.go:0-0
Timestamp: 2025-11-04T06:31:02.147Z
Learning: In platform-connectors/pkg/nodemetadata/processor.go, the NewProcessor function does not perform a nil check on the config parameter because the caller is expected to guarantee a non-nil config is provided.

Applied to files:

tests/platform-connector_test.go

🧬 Code graph analysis (2)

tests/helpers/healthevent.go (2)

health-monitors/gpu-health-monitor/gpu_health_monitor/protos/health_event_pb2.pyi (1)

ProcessingStrategy (14-17)

data-models/pkg/protos/health_event.pb.go (4)

ProcessingStrategy (43-43)

ProcessingStrategy (72-74)

ProcessingStrategy (76-78)

ProcessingStrategy (85-87)

tests/fault_quarantine_test.go (4)

tests/helpers/fault_quarantine.go (4)

QuarantineTestContext (51-54)

SetupQuarantineTest (107-112)

AssertQuarantineState (315-382)

QuarantineAssertion (56-60)

tests/helpers/kube.go (1)

SetNodeManagedByNVSentinel (1389-1408)

tests/helpers/healthevent.go (3)

NewHealthEvent (60-76)

SendHealthEvent (263-275)

SendHealthyEvent (277-287)

data-models/pkg/protos/health_event.pb.go (2)

ProcessingStrategy_STORE_ONLY (47-47)

ProcessingStrategy_EXECUTE_REMEDIATION (46-46)

🔇 Additional comments (28)

tests/helpers/kube.go (5)

387-409: LGTM!

The EnsureNodeEventNotPresent helper correctly uses require.Never to assert that a specific event type and reason combination never appears on a node during the test window. The implementation properly queries events and checks both type and reason.

2232-2273: LGTM!

The waitForDaemonSetRollout function correctly implements DaemonSet rollout verification by checking that all desired pods are scheduled, updated, and ready. The logic matches kubectl rollout status behavior and includes helpful progress logging.

2318-2351: LGTM!

The RemoveDaemonSetArgs function correctly removes specified arguments from a DaemonSet container and waits for the rollout to complete. Uses retry.RetryOnConflict appropriately without error wrapping.

2353-2431: LGTM!

The argument manipulation helpers (tryUpdateExistingArg, setArgsOnContainer, removeArgsFromContainer) correctly handle various command-line argument formats:

--flag=value style

--flag value style (separate entries)

--flag style (boolean flags)

The logic properly preserves argument order and handles edge cases like updating existing args vs adding new ones.

2433-2477: LGTM!

The GetDaemonSetPodOnWorkerNode helper correctly retrieves a ready DaemonSet pod with proper validation:

Verifies pod is not being deleted (DeletionTimestamp == nil)

Confirms pod is in Running phase

Checks pod readiness conditions

Uses require.Eventually for reliable polling

This defensive checking improves test stability by ensuring the pod from the current rollout is fully operational.

fault-quarantine/pkg/evaluator/rule_evaluator_test.go (1)

263-263: LGTM!

The test correctly updates the expected map to include the new processingStrategy field with value float64(0), matching the default enum value ProcessingStrategy_EXECUTE_REMEDIATION. Using float64 is correct for JSON unmarshaling behavior.

distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml (1)

219-222: LGTM!

The schema addition matches the canonical source in docs/postgresql-schema.sql. The file header correctly documents the sync process using make update-helm-postgres-schema and make validate-postgres-schema.

tests/helpers/event_exporter.go (2)

245-245: LGTM!

The assertion correctly validates that the processingStrategy field in the CloudEvent data matches the expected value. The comparison works properly since the JSON-unmarshaled value will be a string.

221-226: All callers of ValidateCloudEvent have been properly updated. The function call at tests/event_exporter_test.go:85 correctly passes all 7 required parameters, including the new expectedProcessingStrategy parameter ("EXECUTE_REMEDIATION"). No outdated calls remain in the codebase.

tests/fault_quarantine_test.go (4)

233-250: LGTM!

The test setup correctly:

Uses the existing SetupQuarantineTest helper

Sets the node as managed by NVSentinel (required for quarantine logic)

Follows the established pattern from other tests in this file

252-268: LGTM!

The STORE_ONLY assessment correctly validates that events with ProcessingStrategy_STORE_ONLY:

Do NOT cause the node to be cordoned

Do NOT add quarantine annotations

This properly tests the observability-only behavior where events are stored but don't modify cluster state.

270-286: LGTM!

The EXECUTE_REMEDIATION assessment correctly validates that events with ProcessingStrategy_EXECUTE_REMEDIATION:

DO cause the node to be cordoned

DO add quarantine annotations

This properly tests the normal remediation behavior where the system takes corrective actions.

288-295: LGTM!

The teardown correctly:

Sends a healthy event to clear the quarantine state from the EXECUTE_REMEDIATION assessment

Uses TeardownQuarantineTest to restore the original configuration and clean up test resources

This ensures proper test isolation and cleanup.

distros/kubernetes/nvsentinel/charts/gpu-health-monitor/values.yaml (1)

57-62: LGTM! Clear documentation and sensible default.

The processingStrategy configuration option is well-documented with:

Clear valid values: EXECUTE_REMEDIATION, STORE_ONLY

Sensible default: EXECUTE_REMEDIATION (maintains backward compatibility)

Clear behavior description for each mode

Both DaemonSet templates (daemonset-dcgm-3.x.yaml and daemonset-dcgm-4.x.yaml) correctly reference this value using {{ .Values.processingStrategy | quote }}.

tests/gpu_health_monitor_test.go (3)

34-48: LGTM on constants organization.

The constants are well-organized with exported names for reuse. The separation of DCGM-related constants from GPU health monitor constants improves readability.

413-462: Test setup and error injection logic is well-structured.

The test correctly:

Configures the DaemonSet with STORE_ONLY strategy

Waits for the pod to be ready

Injects test metadata and sets the node label

Injects a DCGM Inforom error to trigger the health monitor

The flow aligns with the PR objective of verifying STORE_ONLY events are stored without triggering remediation.

464-480: LGTM on assess phase.

The assertions correctly verify that:

Node conditions are not applied when using STORE_ONLY strategy

Node is not cordoned

This validates the expected behavior of the STORE_ONLY processing strategy.

tests/platform-connector_test.go (1)

53-96: Test thoroughly covers both processing strategies.

The assess phase correctly validates:

STORE_ONLY events don't apply node conditions or emit events

EXECUTE_REMEDIATION events do apply conditions and events

The test uses both fatal (ERRORCODE_79) and non-fatal (ERRORCODE_31) error codes to cover different scenarios.

health-monitors/gpu-health-monitor/gpu_health_monitor/cli.py (2)

74-80: CLI option for processing strategy is well-designed.

The option:

Has a sensible default (EXECUTE_REMEDIATION) for backward compatibility

Provides clear help text describing valid values

Is marked as optional

37-51: Function signature updated correctly with type hint.

The processing_strategy parameter is properly typed with platformconnector_pb2.ProcessingStrategy and passed through to the PlatformConnectorEventProcessor constructor. As per coding guidelines, type hints are required for all functions in Python code.

health-monitors/gpu-health-monitor/gpu_health_monitor/tests/test_platform_connector/test_platform_connector.py (3)

107-108: LGTM on test update for processing_strategy parameter.

The test correctly passes platformconnector_pb2.STORE_ONLY as the processing strategy to the event processor constructor.

301-302: Good assertion on processingStrategy propagation.

The test verifies that the processingStrategy field on the emitted HealthEvent matches the strategy configured in the processor (STORE_ONLY).

523-549: Test correctly verifies EXECUTE_REMEDIATION strategy propagation.

This test case uses EXECUTE_REMEDIATION strategy and verifies the restored event has the correct processingStrategy field. Good coverage of both strategy values across different test cases.

health-monitors/gpu-health-monitor/gpu_health_monitor/platform_connector/platform_connector.py (5)

51-66: Processing strategy parameter and storage are correctly implemented.

The constructor:

Accepts the typed processing_strategy parameter

Stores it as self._processing_strategy following Python naming conventions for protected attributes

As per coding guidelines, type hints are required for all functions in Python code, which is satisfied here.

106-121: processingStrategy correctly added to connectivity restored event.

The clear_dcgm_connectivity_failure method properly includes processingStrategy=self._processing_strategy in the HealthEvent message.

206-223: processingStrategy correctly added to health event for entity failures.

The HealthEvent created when entity failures are detected properly includes the processing strategy.

270-287: processingStrategy correctly added to healthy status events.

The HealthEvent created for healthy (PASS) status properly includes the processing strategy.

366-381: processingStrategy correctly added to DCGM connectivity failure event.

The dcgm_connectivity_failed method properly includes processingStrategy=self._processing_strategy in the HealthEvent message. All four HealthEvent creation sites now consistently include the processing strategy.

tests/gpu_health_monitor_test.go

coderabbitai · 2025-12-25T12:22:15Z

tests/platform-connector_test.go

+		t.Logf("Node %s should have condition SysLogsXIDError", testCtx.NodeName)
+		helpers.CheckNodeConditionExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")
+
+		event = helpers.NewHealthEvent(testCtx.NodeName).
+			WithErrorCode(helpers.ERRORCODE_31).
+			WithMessage("XID error occurred").
+			WithFatal(false).
+			WithProcessingStrategy(int(protos.ProcessingStrategy_EXECUTE_REMEDIATION))
+		helpers.SendHealthEvent(ctx, t, event)
+
+		t.Logf("Node %s should have event SysLogsXIDError", testCtx.NodeName)
+		helpers.CheckNodeEventExists(ctx, client, testCtx.NodeName, "SysLogsXIDError", "SysLogsXIDErrorIsNotHealthy")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Locate and examine the helper functions mentioned in the review find . -name "*.go" -path "*/helpers/*" | head -20

Repository: NVIDIA/NVSentinel

Length of output: 478

🏁 Script executed:

#!/bin/bash # Search for CheckNodeConditionExists and CheckNodeEventExists definitions rg -n 'func CheckNodeConditionExists|func CheckNodeEventExists' --type go

Repository: NVIDIA/NVSentinel

Length of output: 172

🏁 Script executed:

#!/bin/bash # Examine the CheckNodeConditionExists function signature and implementation sed -n '1283,1323p' tests/helpers/kube.go

Repository: NVIDIA/NVSentinel

Length of output: 1599

🏁 Script executed:

#!/bin/bash # Examine the CheckNodeEventExists function signature and implementation sed -n '1325,1365p' tests/helpers/kube.go

Repository: NVIDIA/NVSentinel

Length of output: 1167

Add assertions for helper function return values at lines 83 and 93.

CheckNodeConditionExists returns (*v1.NodeCondition, error) and CheckNodeEventExists returns (bool, *v1.Event). The test calls these helpers but ignores their return values, so it never verifies that the condition or event actually exists. Capture and assert the return values using standard library assertions (e.g., if err != nil { t.Errorf(...) } or if !exists { t.Errorf(...) }).

🤖 Prompt for AI Agents

In tests/platform-connector_test.go around lines 82 to 93, the calls to CheckNodeConditionExists and CheckNodeEventExists ignore their return values; capture and assert them: assign the returned (*v1.NodeCondition, error) from CheckNodeConditionExists, check err is nil and the condition is non-nil (fail the test with t.Fatalf or t.Errorf if not), and assign the returned (bool, *v1.Event) from CheckNodeEventExists, assert the bool is true and the event pointer is non-nil (fail the test if not); use standard library assertions (if err != nil { t.Fatalf(...) } / if !exists { t.Fatalf(...) } / if event == nil { t.Fatalf(...) }) so the test actually validates existence.

feat: add processing strategy field in health events

a44b5a6

Signed-off-by: Tanisha goyal <[email protected]>

coderabbitai bot reviewed Dec 22, 2025

View reviewed changes

platform-connectors/pkg/connectors/kubernetes/process_node_events.go Show resolved Hide resolved

tests/event_exporter_test.go Show resolved Hide resolved

tests/gpu_health_monitor_test.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

tanishagoyal2 added 3 commits December 23, 2025 21:54

feat: add e2e tests

69a01e2

Signed-off-by: Tanisha goyal <[email protected]>

feat: add event handling strategy in gpu health monitor

f375e3c

Signed-off-by: Tanisha goyal <[email protected]>

feat: update tests

232c636

Signed-off-by: Tanisha goyal <[email protected]>

tanishagoyal2 force-pushed the 390-event-handling-in-gpu-monitor branch from 8398875 to 232c636 Compare December 25, 2025 12:17

coderabbitai bot reviewed Dec 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add event handling strategy changes in gpu health monitor #611

feat: add event handling strategy changes in gpu health monitor #611

Uh oh!

tanishagoyal2 commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 22, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 22, 2025

Changed files (no unit tests)

Changed unit test files

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add event handling strategy changes in gpu health monitor #611

Are you sure you want to change the base?

feat: add event handling strategy changes in gpu health monitor #611

Uh oh!

Conversation

tanishagoyal2 commented Dec 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 22, 2025

Merging this branch will decrease overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tanishagoyal2 commented Dec 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 22, 2025 •

edited

Loading