feature: update metadata-collector to track pod GPU device allocation #663

natherz97 · 2026-01-06T23:36:12Z

Summary

Related design doc for GPU reset: https://github.com/NVIDIA/NVSentinel/blob/main/docs/designs/020-nvsentinel-gpu-reset.md.

This PR adds support to the metadata-collector to add a device annotation to all pods running on the given node which have been allocated a GPU device. The annotation is named dgxc.nvidia.com/devices with keys as the device resource name, either nvidia.com/gpu or nvidia.com/pgpu, and values a list of GPU UUIDs for the corresponding devices. In a follow-up PR, this pod annotation will be consumed by the node-drainer-module to allow partial node drains where only pods needing a GPU reset for a given GPU UUID will be drained and other pods running on the node will not be drained.

Example annotation:

   "annotations": {
     "dgxc.nvidia.com/devices": "{\"devices\":{\"nvidia.com/gpu\":[\"GPU-455d8f70-2051-db6c-0430-ffc457bff834\"]}},
     ...
   }

High-level steps:

List all pods from the Kubelet's HTTPS /pods endpoint. This allows us to determine which pods need the devices
annotation added, updated, or removed. The /pods Kubelet endpoint is equivalent to the following K8s API request: /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME. Using the local Kubelet endpoint will significantly reduce load on the control plane from all metadata-collector daemonset pods with the tradeoff that the endpoint has to be periodically scraped rather than watched. Since we have to periodically scrape the PodResourcesLister endpoint in the next step, we will be consistent in how we're discovering both the pod list for this node and the device to pod allocation by listing from a local endpoint on the node.
Discover the pod to device mapping using the Kubelet's gRPC PodResourcesLister endpoint. This local gRPC service is
exposed on a local Unix socket. This endpoint exposes devices allocated to pods through either resource requests through device plugins or resource claims through DRA. Starting in v1.35, we can rely on the ResourceHealthStatus feature to expose this same device mapping through container statuses and at that point we'll be able to remove this component
from the metadata-collector.
Compare each pod object to the current set of allocated devices and determine if any pods need a device annotation
updated, added, or removed. We will issue a patch request which updates the dgxc.nvidia.com/devices annotation to the
desired value or remove it. To prevent unnecessary writes, we will only update this annotation if it is modified.
This comes with the tradeoff that consumers of this annotation (node-drainer-module) will not be able to detect stale
annotations outside of the annotation not existing for a pod that is requesting GPUs.

Testing

Metadata-collector logs for pod annotation added:

{"time":"2026-01-06T22:39:43.715069754Z","level":"INFO","msg":"Updating devices annotation for pod","module":"metadata-collector","version":"metadata-collector-nherz6","podKey":"default/nvidia-smi","annotation":"{\"devices\":{\"nvidia.com/gpu\":[\"GPU-1ae89870-0153-0725-e573-4c3e854b7b26\",\"GPU-b8ab7e46-8480-9c59-df72-ea3ba333e5e6\",\"GPU-db60f780-8f9c-0269-4d0a-960fb0f6b8e3\"]}}"}
{"time":"2026-01-06T22:39:43.733297915Z","level":"INFO","msg":"Device mapper pod annotation updates","module":"metadata-collector","version":"metadata-collector-nherz6","podCount":1}

Checking pod object:

% kubectl get pod nvidia-smi -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    dgxc.nvidia.com/devices: '{"devices":{"nvidia.com/gpu":["GPU-1ae89870-0153-0725-e573-4c3e854b7b26","GPU-b8ab7e46-8480-9c59-df72-ea3ba333e5e6","GPU-db60f780-8f9c-0269-4d0a-960fb0f6b8e3"]}}'
...

Metadata-collector logs for pod annotation removed:

{"time":"2026-01-06T22:42:13.707839166Z","level":"INFO","msg":"Updating devices annotation for pod","module":"metadata-collector","version":"metadata-collector-nherz6","podKey":"default/nvidia-smi2","annotation":"{\"devices\":{\"nvidia.com/gpu\":[\"GPU-71db6c25-f636-d21a-42db-b120699914de\",\"GPU-b8ab7e46-8480-9c59-df72-ea3ba333e5e6\",\"GPU-db60f780-8f9c-0269-4d0a-960fb0f6b8e3\"]}}"}
{"time":"2026-01-06T22:42:13.7604773Z","level":"INFO","msg":"Removing devices annotation for pod since it no longer has devices allocated","module":"metadata-collector","version":"metadata-collector-nherz6","podKey":"default/nvidia-smi"}
{"time":"2026-01-06T22:42:13.774572016Z","level":"INFO","msg":"Device mapper pod annotation updates","module":"metadata-collector","version":"metadata-collector-nherz6","podCount":2}

Checking pod objects:

% kubectl get pod nvidia-smi2 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    dgxc.nvidia.com/devices: '{"devices":{"nvidia.com/gpu":["GPU-71db6c25-f636-d21a-42db-b120699914de","GPU-b8ab7e46-8480-9c59-df72-ea3ba333e5e6","GPU-db60f780-8f9c-0269-4d0a-960fb0f6b8e3"]}}'
...

% kubectl get pod nvidia-smi -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
...

Metadata-collector graceful pod termination:

{"time":"2026-01-06T22:45:13.675209965Z","level":"INFO","msg":"Device mapper pod annotation updates","module":"metadata-collector","version":"metadata-collector-nherz6","podCount":0}
{"time":"2026-01-06T22:45:43.704563264Z","level":"INFO","msg":"Device mapper pod annotation updates","module":"metadata-collector","version":"metadata-collector-nherz6","podCount":0}
{"time":"2026-01-06T22:45:48.448663531Z","level":"INFO","msg":"Pod device mapper completed successfully","module":"metadata-collector","version":"metadata-collector-nherz6"}

Additional tests:

confirmed that multiple pods can have their annotations added/removed in a single reconcile loop
confirmed that pods will a terminal status will have their device annotation removed and these devices will be allocated to any new pods
confirmed that existing annotations will be persisted with the devices annotation is added/removed and that the devices annotation can be added/removed when there are no existing annotations

CSP testing:

OCI
AWS
GCP
Azure

Type of Change

Component(s) Affected

Testing

Tests pass locally
Manual testing completed
No breaking changes (or documented)

Checklist

Self-review completed
Documentation updated (if needed)
Ready for review

Summary by CodeRabbit

New Features
- Pods are automatically annotated with their allocated GPU device information; a background mapper periodically syncs node device allocations into pod annotations.
Infrastructure
- Kubernetes chart adds ServiceAccount and RBAC, mounts host pod-resources, and tightens DaemonSet container/security settings.
Documentation
- Docs updated to describe continuous metadata collection and pod-to-GPU annotation exposure; removed pause/init-container pattern.
Tests
- Added unit tests for the mapper and both kubelet client implementations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-06T23:36:20Z

📝 Walkthrough

Walkthrough

Adds a pod device annotation data model, kubelet gRPC and HTTPS clients, a PodDeviceMapper that periodically patches pods with GPU allocations, Helm RBAC/ServiceAccount/DaemonSet templates, refactors metadata-collector into separate collector and mapper phases, and includes tests and docs updates.

Changes

Cohort / File(s)	Summary
Data model `data-models/pkg/model/pod_device_annotation.go`	New constant `PodDeviceAnnotationName` (`"dgxc.nvidia.com/devices"`), `EntityTypeToResourceNames` map (`"GPU_UUID"` → `["nvidia.com/gpu","nvidia.com/pgpu"]`), and `DeviceAnnotation` struct (`Devices map[string][]string`).
Helm RBAC & ServiceAccount `distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml`, `.../clusterrolebinding.yaml`, `.../serviceaccount.yaml`	Adds ClusterRole (pods:patch, nodes/proxy:get), ClusterRoleBinding, and ServiceAccount templates for metadata-collector.
DaemonSet chart `distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml`	Replaces pause/init-container pattern with direct container; adds `serviceAccountName`, `runAsGroup: 0`, `privileged: true`, and mounts hostPath `/var/lib/kubelet/pod-resources`.
Chart values `distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml`	Removes `pauseImage` configuration and related pause/init-container settings.
Go module `metadata-collector/go.mod`	Adds direct gRPC and Kubernetes module dependencies and updates indirect dependency graph.
Orchestration / main `metadata-collector/main.go`	Introduces `runCollector` (NVML metadata write) and `runMapper` (ticker-driven pod-device mapper), adds `defaultPodDeviceMonitorPeriod` (30s), and updates logging/error handling.
Kubelet gRPC client + tests `metadata-collector/pkg/mapper/grpcclient.go`, `.../grpcclient_test.go`	New `KubeletGRPClient` interface and UDS gRPC implementation; `ListPodResources()` with retry, normalization to `map[string]*model.DeviceAnnotation`; mock server tests for success, error, and retry cases.
Kubelet HTTPS client + tests `metadata-collector/pkg/mapper/httpsclient.go`, `.../httpsclient_test.go`	New `KubeletHTTPSClient` interface and HTTPS implementation to GET `/pods` with token handling, TLS skip-verify, retry and JSON parsing; extensive unit tests covering success and error paths.
Pod device mapper + tests `metadata-collector/pkg/mapper/mapper.go`, `.../mapper_test.go`	New `PodDeviceMapper` and `NewPodDeviceMapper`; `UpdatePodDevicesAnnotations()` computes `dgxc.nvidia.com/devices` annotation patches (MergePatch / JSONPatch) and applies them with retry logic; comprehensive unit tests for add/update/remove/no-op and retry semantics.
Docs `docs/configuration/metadata-collector.md`, `docs/metadata-collector.md`	Documents pod-to-GPU mapping annotation, removes pauseImage docs, and updates flow from init-container to continuous collector+mapper model.

Sequence Diagram(s)

sequenceDiagram
    actor Ticker as Ticker (30s)
    participant Mapper as PodDeviceMapper
    participant HTTPS as Kubelet HTTPS Client
    participant gRPC as Kubelet gRPC Client
    participant Kubelet as Kubelet
    participant K8sAPI as Kubernetes API

    Ticker ->> Mapper: UpdatePodDevicesAnnotations()

    Mapper ->> HTTPS: ListPods()
    HTTPS ->> Kubelet: HTTPS GET /pods (token)
    Kubelet -->> HTTPS: PodList (JSON)
    HTTPS -->> Mapper: []Pod

    Mapper ->> gRPC: ListPodResources()
    gRPC ->> Kubelet: gRPC over UDS ListPodResources
    Kubelet -->> gRPC: PodResources
    gRPC -->> Mapper: map[pod] DeviceAnnotation

    Mapper ->> Mapper: compute desired annotations
    loop per pod with change
        Mapper ->> K8sAPI: Patch pod annotation (MergePatch / JSONPatch)
        alt Success
            K8sAPI -->> Mapper: 200 OK
        else Retryable error
            K8sAPI -->> Mapper: Error
            Mapper ->> K8sAPI: Retry Patch
            K8sAPI -->> Mapper: 200 OK
        end
    end
    Mapper -->> Ticker: return update count

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰
I hopped the kubelet socket, listened for pods,
gRPC and HTTPS shared their little nods,
I stitched GPU UUIDs into pod notes bright,
Retried when storms came, kept annotations right,
A joyful hop—mapping done by night!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately captures the main change: adding GPU device allocation tracking to the metadata-collector via pod annotations, which is the core feature introduced across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)

45-50: Separate DaemonSets for kata and regular nodes, and verify privileged mode necessity.

The current single DaemonSet violates the coding guideline requiring separate DaemonSets for kata vs regular nodes using nodeAffinity based on the kata.enabled label. The syslog-health-monitor chart provides the correct pattern: separate daemonset-regular.yaml and daemonset-kata.yaml files with different volume mounts per variant. Implement kata vs regular node separation using this pattern instead of the current nodeSelector approach.

Additionally, the privileged: true security context poses unnecessary risk. The syslog-health-monitor reference implementation accesses similar kubelet resources using granular Linux capabilities (SYSLOG, SYS_ADMIN) instead of privileged mode. Verify whether pod-resources socket access truly requires full privileged mode or if granular capabilities suffice, and replace privileged: true with a minimal capability set if possible.

🤖 Fix all issues with AI Agents

In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 152-166: The deduplication loop in the devicesPerPod handling can
panic when a resource's devices slice is empty; update the inner loop over
deviceAnnotation.Devices (keyed by resourceName) to skip slices of length 0 or 1
by adding a guard like `if len(devices) <= 1 { continue }` before running the
in-place dedupe that uses j and assigns `devices[:j+1]`, so empty or
single-element slices are left unchanged and the slice expression cannot panic.

In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 55-62: The TLS config on the http.Transport sets
InsecureSkipVerify but omits a minimum TLS version; update the tls.Config used
in transport (the tls.Config literal in the http.Transport assigned to variable
transport) to include MinVersion set to a safe constant (e.g., tls.VersionTLS12
or tls.VersionTLS13) so the client refuses older protocol versions while
retaining InsecureSkipVerify for localhost use.

In @metadata-collector/pkg/mapper/mapper_test.go:
- Around line 327-332: The call to newTestDeviceMapper assigns an error to err
that is never checked; update the test to verify this by adding an assertion
(e.g., require.NoError or assert.NoError) immediately after calling
newTestDeviceMapper (before using mapper), referencing the newTestDeviceMapper
result so the test fails fast if setup failed.
- Around line 286-291: The call to newTestDeviceMapper in mapper_test.go may
return an error that is currently ignored; after "mapper, err :=
newTestDeviceMapper(testCase, []*corev1.Pod{existingTestPod})" add an explicit
test failure on error (e.g., assert.NoError(t, err) or t.Fatalf with the error)
before using mapper so the test stops instead of continuing with a nil mapper.

In @metadata-collector/pkg/mapper/mapper.go:
- Around line 49-76: Add a godoc comment for the exported constructor
NewPodDeviceMapper explaining its purpose and parameters; fix the typo gpcClient
-> grpcClient (references: NewKubeletGRPClient -> NewKubeletGRPCClient and
variable grpcClient) and update all uses (kubeletGRPCClient); wrap returned
errors from NewKubeletHTTPSClient, rest.InClusterConfig, NewKubeletGRPCClient,
and kubernetes.NewForConfig with contextual messages (e.g., using
fmt.Errorf("failed to ...: %w", err)) so callers get actionable context while
preserving the original error.
- Around line 193-209: In patchPodWithRetry, the log and error messages
incorrectly reference "node" instead of "pod"; update the slog.Warn call and the
returned fmt.Errorf message inside the patchPodWithRetry function to use "pod"
(and keep podName variable) so messages correctly say "patching pod annotation"
and "failed to patch pod <name>: %w" respectively; ensure you only change the
strings in slog.Warn and fmt.Errorf and not the variable names or logic.

🧹 Nitpick comments (9)

metadata-collector/pkg/mapper/mapper_test.go (2)
127-127: Incorrect argument order in assert.Equal.

The assert.Equal function expects arguments in the order (expected, actual), but here and throughout the file, the order is reversed as (actual, expected). This won't affect test correctness but will produce confusing error messages on failure.
🔎 Proposed fix
-	assert.Equal(t, numUpdates, 0)
+	assert.Equal(t, 0, numUpdates)
Similar fixes should be applied to other assert.Equal calls in this file (lines 143, 163, 167, 179, 192, 196, 219, 223, 246, 250, 267, 271, 312, 350, 398, 402, 406).
17-33: Consider using standard library assertions for simpler checks.

Based on learnings, this repository prefers avoiding testify for simple equality/inequality checks. While testify is acceptable for complex scenarios requiring richer diagnostics, consider using standard testing package assertions (e.g., t.Errorf) for straightforward checks to maintain consistency across the codebase.
metadata-collector/main.go (1)
79-93: Ticker resource leak and delayed first update.

Two issues:

The ticker is never stopped, causing a minor resource leak when context is cancelled.

The first pod device annotation update only happens after 30 seconds. Consider performing an immediate update before entering the ticker loop.
🔎 Proposed fix
 	ticker := time.NewTicker(defaultPodDeviceMonitorPeriod)
+	defer ticker.Stop()
+
+	// Perform an immediate update before entering the periodic loop
+	numUpdates, err := podDeviceMapper.UpdatePodDevicesAnnotations()
+	if err != nil {
+		return fmt.Errorf("could not update mapper pod devices annotations: %w", err)
+	}
+	slog.Info("Device mapper pod annotation updates", "podCount", numUpdates)

 	for {
 		select {
 		case <-ctx.Done():
 			return nil
 		case <-ticker.C:
data-models/pkg/model/pod_device_annotation.go (1)
15-32: LGTM - Clean data model definition.

The model definitions are well-structured. The EntityTypeToResourceNames being a var rather than const is a Go necessity for map types. Consider adding a package-level godoc comment to document the purpose of this file, as per coding guidelines.

Add a package-level comment:
// Package model provides data structures for GPU device annotations on Kubernetes pods.
package model
metadata-collector/pkg/mapper/httpsclient.go (1)
153-155: Consider limiting retries for authentication errors.

The retryAllErrors function retries all errors including HTTP 401/403, which typically indicate persistent authorization issues that won't resolve with retries. While the impact is bounded by DefaultRetry limits, avoiding unnecessary retries for auth errors would improve failure diagnosis.
🔎 Proposed fix (if you want to distinguish retriable errors)
func isRetriableHTTPError(err error) bool {
	// Don't retry if we got a response - the error is in the status code
	// and would need to be extracted from the error message or a custom error type
	return true // For now, keep existing behavior
}
Alternatively, consider using a custom error type that carries the status code for more granular retry decisions.
metadata-collector/pkg/mapper/grpcclient.go (2)
54-54: Global state modification with resolver.SetDefaultScheme.

resolver.SetDefaultScheme("passthrough") modifies global gRPC resolver state, which could affect other gRPC clients in the same process. Consider using the scheme prefix directly in the target address instead:
🔎 Proposed fix
-	resolver.SetDefaultScheme("passthrough")
-
 	connection, err := grpc.NewClient(
-		podResourcesKubeletSocket,
+		"passthrough:///"+podResourcesKubeletSocket,
 		grpc.WithTransportCredentials(insecure.NewCredentials()),
This avoids global state mutation and makes the intent explicit per connection.
183-185: Consider logging or returning error from Close().

The Close() method ignores the error returned by connection.Close(). While this is common for cleanup methods, logging the error could aid debugging connection issues.
🔎 Option 1: Log the error
 func (client *kubeletGRPClient) Close() {
-	client.connection.Close()
+	if err := client.connection.Close(); err != nil {
+		slog.Warn("Failed to close gRPC connection", "error", err)
+	}
 }
🔎 Option 2: Return the error
-func (client *kubeletGRPClient) Close() {
-	client.connection.Close()
+func (client *kubeletGRPClient) Close() error {
+	return client.connection.Close()
 }
Note: This would require updating the interface definition.
metadata-collector/pkg/mapper/mapper.go (2)
15-15: Missing package-level godoc comment.

As per coding guidelines, package-level godoc is required for all Go packages. Add a doc comment before the package declaration.
🔎 Proposed fix
+// Package mapper provides functionality for annotating Kubernetes pods with
+// their allocated GPU device information by querying Kubelet endpoints.
 package mapper
37-39: Missing godoc comment for exported interface.

As per coding guidelines, exported interfaces should have documentation comments explaining their purpose and method contracts.
🔎 Proposed fix
+// PodDeviceMapper defines the interface for updating pod annotations
+// with allocated GPU device information.
 type PodDeviceMapper interface {
+	// UpdatePodDevicesAnnotations discovers pods and their GPU allocations,
+	// then patches the dgxc.nvidia.com/devices annotation accordingly.
+	// Returns the number of pods modified and any error encountered.
 	UpdatePodDevicesAnnotations() (int, error)
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd4466e and 3705a73.

⛔ Files ignored due to path filters (1)

metadata-collector/go.sum is excluded by !**/*.sum

📒 Files selected for processing (14)

data-models/pkg/model/pod_device_annotation.go
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
metadata-collector/go.mod
metadata-collector/main.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/pkg/mapper/mapper_test.go

💤 Files with no reviewable changes (1)

distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml

🧰 Additional context used

📓 Path-based instructions (5)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper.go
data-models/pkg/model/pod_device_annotation.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/main.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient_test.go

**/go.mod

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

metadata-collector/go.mod

**/daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

distros/kubernetes/**/*daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

🧠 Learnings (14)

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/go.mod

📚 Learning: 2026-01-06T21:31:31.115Z

Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:31.115Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:48:13.460Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-06T10:59:26.687Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

🧬 Code graph analysis (6)

metadata-collector/pkg/mapper/mapper_test.go (2)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

platform-connectors/pkg/pipeline/factory.go (1)

Create (32-39)

metadata-collector/pkg/mapper/grpcclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (1)

DeviceAnnotation (30-32)

metadata-collector/pkg/mapper/grpcclient.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

EntityTypeToResourceNames (22-27)

metadata-collector/pkg/mapper/httpsclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

metadata-collector/pkg/mapper/mapper.go (3)

metadata-collector/pkg/mapper/httpsclient.go (2)

KubeletHTTPSClient (39-41)

NewKubeletHTTPSClient (54-70)

metadata-collector/pkg/mapper/grpcclient.go (2)

KubeletGRPClient (37-39)

NewKubeletGRPClient (48-75)

data-models/pkg/model/pod_device_annotation.go (1)

PodDeviceAnnotationName (18-18)

metadata-collector/main.go (1)

metadata-collector/pkg/mapper/mapper.go (1)

NewPodDeviceMapper (49-76)

🪛 ast-grep (0.40.3)

metadata-collector/pkg/mapper/httpsclient.go

[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🪛 YAMLlint (1.37.1)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

🔇 Additional comments (19)

metadata-collector/go.mod (3)

7-17: Direct dependencies align with PR objectives.

The addition of google.golang.org/grpc for PodResourcesLister communication and k8s.io packages (api, apimachinery, client-go, kubelet) for Kubernetes interactions are appropriate and follow the coding guideline to use client-go for Kubernetes API interactions. All Kubernetes dependencies are consistently versioned at v0.35.0, which is good practice.

27-37: This dependency structure is expected and standard. When Go modules use semantic import versioning and explicitly import subpackages (as shown by the metadata-collector/pkg/mapper imports), those subpackages appear as separate entries in go.mod. The parent module's internal dependencies on its subpackages further explains why they're listed. No action needed.

12-16: No action required. Go 1.25.0 is compatible with k8s.io/client-go v0.35.0, which requires Go 1.23 as a minimum. The subpackages listed for go-openapi/swag in the indirect dependencies are valid Go module behavior when transitive dependencies import from submodules. The dependency structure is correct.

Likely an incorrect or invalid review comment.

metadata-collector/pkg/mapper/httpsclient_test.go (2)

468-480: LGTM - Well-structured test helper.

The newTestKubeletHTTPSClient helper properly configures the mock and returns a testable client instance. The use of io.NopCloser for response body handling is appropriate for test scenarios.

534-553: Good retry behavior test.

The test properly validates retry logic by configuring the mock to return a failure on the first call and success on the second, verifying that transient failures are handled correctly.

metadata-collector/main.go (1)

86-91: Consider whether all update errors should be fatal.

Currently, any error from UpdatePodDevicesAnnotations causes the mapper to exit. For a DaemonSet, this triggers a restart, but frequent transient errors could cause restart loops. Consider logging errors and continuing for certain error types, or implementing a circuit breaker pattern.

metadata-collector/pkg/mapper/grpcclient_test.go (2)

159-180: LGTM - Comprehensive test for device aggregation.

The test validates that devices are properly aggregated per pod with:

Deduplication of device IDs

Sorting for deterministic ordering

Filtering of unsupported resources

Correct namespace/name key format

The expected output correctly reflects the sorted, unique device lists.

182-193: Good coverage of retry behavior.

The error tests properly validate both scenarios:

errReturnCount=10 exceeds retry limit, ensuring error propagation

errReturnCount=1 allows retry to succeed on second attempt

metadata-collector/pkg/mapper/httpsclient.go (1)

72-96: Excellent security documentation.

The detailed comment thoroughly documents:

Why InsecureSkipVerify is used (localhost SAN limitation)

AuthN/AuthZ requirements and token usage

RBAC requirements (nodes/proxy, HostNetwork)

Equivalent CLI command for debugging

This level of documentation is exemplary for security-sensitive code.

metadata-collector/pkg/mapper/grpcclient.go (1)

77-112: Good documentation for the ListPodResources function.

The documentation clearly explains:

The socket requirement for the DaemonSet

Output structure with example JSON

Behavior guarantees (unique, sorted device lists)

Edge case handling (same device in multiple pods)

metadata-collector/pkg/mapper/mapper.go (2)

78-191: Well-documented implementation with clear logic.

The block comment thoroughly explains the design rationale, tradeoffs, and the three-step process. The implementation correctly handles annotation add, update, and remove cases with proper error wrapping.

211-228: LGTM!

The retry classification covers the appropriate transient error cases: conflicts, timeouts, rate limiting, and service unavailability.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)

15-20: LGTM! ServiceAccount correctly configured.

The ServiceAccount manifest is properly structured with appropriate Helm template placeholders for name and labels.

Note: The YAMLlint syntax error on line 20 is a false positive—YAMLlint doesn't recognize Helm template syntax.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (2)

15-20: LGTM! ClusterRole structure is correct.

The ClusterRole is properly defined with appropriate Helm template placeholders.

Note: The YAMLlint syntax error on line 20 is a false positive—YAMLlint doesn't recognize Helm template syntax.

21-33: The get verb on nodes/proxy appears to be unnecessary and is not actually used by the code.

The implementation accesses the Kubelet directly at https://localhost:10250/pods rather than through the Kubernetes API proxy (/api/v1/nodes/{node}/proxy/pods). The Kubelet validates the request using its own TokenReview and SubjectAccessReview mechanisms, not RBAC permissions on the nodes/proxy resource.

The only RBAC permission actually required is patch on pods for the pod annotation updates at mapper.go:195. There is no get or list operation on pods via the Kubernetes API server. Remove the nodes/proxy rule from the ClusterRole unless this is a planned future enhancement.

Likely an incorrect or invalid review comment.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)

15-28: LGTM! ClusterRoleBinding correctly configured.

The ClusterRoleBinding properly binds the ClusterRole to the ServiceAccount with correct name references and namespace handling.

Note: The YAMLlint syntax error on line 20 is a false positive—YAMLlint doesn't recognize Helm template syntax.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (3)

38-38: LGTM! ServiceAccount correctly referenced.

The DaemonSet now properly references the ServiceAccount, enabling RBAC-controlled access to the Kubernetes API.

69-71: LGTM! Pod resources volume correctly configured.

The volume mount for /var/lib/kubelet/pod-resources is correctly configured to access the kubelet's PodResourcesLister gRPC socket, which is essential for discovering pod-to-device mappings as described in the PR objectives.

Also applies to: 81-84

85-101: Consider kata-specific configuration and update documentation.

Based on learnings, separate DaemonSets should be created for kata vs. regular nodes using nodeAffinity based on the kata.enabled label. Additionally, DaemonSet variant selection logic should be explained in Helm chart documentation.

Since this metadata-collector appears to interact with kubelet and host resources, verify whether:

It requires separate kata-specific configuration (e.g., different volume mounts for systemd journal)

It should be excluded from kata nodes or configured differently

Based on learnings, as per coding guidelines for DaemonSet configuration and documentation.

metadata-collector/pkg/mapper/grpcclient.go

metadata-collector/pkg/mapper/mapper_test.go

metadata-collector/pkg/mapper/mapper.go

github-actions · 2026-01-06T23:48:27Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper	46.88% (+46.88%)	🌟

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model/pod_device_annotation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/metadata-collector/main.go	0.00% (ø)	165 (+83)	0	165 (+83)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient.go	46.92% (+46.92%)	130 (+130)	61 (+61)	69 (+69)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient.go	50.51% (+50.51%)	99 (+99)	50 (+50)	49 (+49)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper.go	44.92% (+44.92%)	187 (+187)	84 (+84)	103 (+103)	🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper_test.go

github-actions · 2026-01-06T23:54:59Z

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)

85-101: Create separate DaemonSets for kata and regular nodes with appropriate log volume mounts.

The coding guidelines require DaemonSets in distros/kubernetes/**/*daemonset*.yaml to create separate variants for kata vs regular nodes. The syslog-health-monitor chart in this repository demonstrates the correct pattern: separate daemonset-kata.yaml and daemonset-regular.yaml files with a shared template that configures different volume mounts based on node type.

metadata-collector should follow the same pattern:

Create separate DaemonSet variants for kata and regular nodes

Use nodeSelector with nvsentinel.dgxc.nvidia.com/kata.enabled label to target the appropriate node type

Regular nodes: Mount /var/log for file-based logs

Kata nodes: Mount /run/systemd/journal and /var/log/journal for systemd journal access

Refer to the syslog-health-monitor templates for the implementation pattern.

🤖 Fix all issues with AI Agents

In @data-models/pkg/model/pod_device_annotation.go:
- Line 15: Add a package-level godoc comment for the package declaration in
pod_device_annotation.go: write a one- or two-sentence comment immediately above
"package model" that succinctly describes the purpose of the model package (what
domain objects it contains and its role), following Go godoc style (starts with
"Package model ...") so the package complies with the coding guideline requiring
package-level documentation.

In @metadata-collector/main.go:
- Around line 79-93: The ticker created as ticker :=
time.NewTicker(defaultPodDeviceMonitorPeriod) is never stopped and
UpdatePodDevicesAnnotations errors currently abort the loop; fix by deferring
ticker.Stop() or stopping the ticker when ctx.Done() is observed and ensure the
select exit path calls ticker.Stop() before returning, and change the error
handling for podDeviceMapper.UpdatePodDevicesAnnotations() so it logs the error
(use slog.Error) and continues the loop (optionally add simple retry/backoff
logic or skip until the next tick) instead of returning the error; keep
references to ticker, ctx.Done(), podDeviceMapper.UpdatePodDevicesAnnotations,
and the slog.Info call when numUpdates succeeds.

In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 37-39: Add a godoc comment above the exported interface
KubeletGRPClient that briefly describes its purpose (e.g., a gRPC client
abstraction for querying kubelet pod device/resource annotations) and include a
short sentence describing the ListPodResources method and its return values (map
of pod/device annotations and error). Place the comment immediately above the
type declaration and use the `KubeletGRPClient` identifier as the first word to
satisfy godoc conventions.

In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 54-62: In NewKubeletHTTPSClient, the TLSClientConfig currently
only sets InsecureSkipVerify; add a minimum TLS version to harden against
downgrades by setting TLSClientConfig.MinVersion = tls.VersionTLS12 (or
tls.VersionTLS13 if you require TLS 1.3) within the tls.Config passed to the
http.Transport so the kubelet client enforces at least TLS 1.2 while keeping
InsecureSkipVerify for localhost use.

🧹 Nitpick comments (8)

data-models/pkg/model/pod_device_annotation.go (1)
21-28: Public variable is mutable and could lead to unexpected behavior.

EntityTypeToResourceNames is a public variable holding a map, which means external code can modify it. This could cause race conditions or unexpected behavior if multiple components access it concurrently or if callers accidentally mutate it.
🔎 Recommended approaches

Option 1: Make it a function that returns a new map (immutable)
-var (
-	EntityTypeToResourceNames = map[string][]string{
+func GetEntityTypeToResourceNames() map[string][]string {
+	return map[string][]string{
 		"GPU_UUID": {
 			"nvidia.com/gpu",
 			"nvidia.com/pgpu",
 		},
 	}
-	}
-)
+}
Option 2: Keep as variable but document it should not be modified
 var (
+	// EntityTypeToResourceNames maps entity types to their corresponding Kubernetes resource names.
+	// This variable should be treated as read-only; modifying it may cause unexpected behavior.
 	EntityTypeToResourceNames = map[string][]string{
metadata-collector/pkg/mapper/grpcclient.go (3)
154-168: Deduplication logic can be made safer and clearer.

The past review comment correctly identified that devices[:j+1] at line 166 could panic if the slice is empty. While the current code flow (line 145 checks len(device.GetDeviceIds()) > 0) should prevent empty slices, the logic is fragile and non-obvious.

The deduplication algorithm also has unnecessary complexity for the common case of single-element or already-unique slices.
🔎 Proposed fix for safety and clarity
 	// ensure that we have a consistent ordering of devices
 	for _, deviceAnnotation := range devicesPerPod {
 		for resourceName, devices := range deviceAnnotation.Devices {
+			if len(devices) == 0 {
+				delete(deviceAnnotation.Devices, resourceName)
+				continue
+			}
+			
 			sort.Strings(devices)
+			
+			if len(devices) == 1 {
+				continue
+			}
 
 			j := 0
 			for i := 1; i < len(devices); i++ {
 				if devices[i] != devices[j] {
 					j++
 					devices[j] = devices[i]
 				}
 			}
 
 			deviceAnnotation.Devices[resourceName] = devices[:j+1]
 		}
 	}
54-54: Global state modification may affect other gRPC clients.

resolver.SetDefaultScheme("passthrough") modifies global gRPC state, which could impact other gRPC clients in the same process. This is acceptable if metadata-collector is the only component using gRPC, but worth documenting or considering per-connection configuration.

Consider adding a comment explaining this is safe in the metadata-collector context, or investigate if the passthrough scheme can be specified per-connection rather than globally.

185-187: Close method should follow best practices for cleanup.

The Close method should return an error to allow callers to handle connection closure failures. Additionally, it should check if the connection is nil before attempting to close it.
🔎 Proposed fix
-func (client *kubeletGRPClient) Close() {
-	client.connection.Close()
+func (client *kubeletGRPClient) Close() error {
+	if client.connection != nil {
+		return client.connection.Close()
+	}
+	return nil
 }
Note: This will require updating the interface definition to include Close() error.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)
1-28: LGTM! The ClusterRoleBinding structure is correct.

The ClusterRoleBinding properly references the ClusterRole and ServiceAccount using Helm templating. The static analysis warning about line 20 is a false positive—the {{- syntax is valid Helm template notation.
Minor: Add newline at end of file

Per POSIX conventions and common YAML style guides, add a trailing newline:
   - kind: ServiceAccount
     name: {{ include "metadata-collector.fullname" . }}
     namespace: {{ .Release.Namespace }}
+
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)
22-33: LGTM! ClusterRole permissions correctly scoped for the feature.

The two policy rules are appropriately minimal:

patch on pods enables annotation updates with GPU device mappings

get on nodes/proxy allows accessing the kubelet endpoints on localhost

These align with the PR objectives for discovering and annotating pod GPU allocations.
Minor: Add newline at end of file

Per POSIX conventions and common YAML style guides, add a trailing newline:
     resources:
       - nodes/proxy
     verbs:
       - get
+
metadata-collector/pkg/mapper/mapper_test.go (1)
286-309: Refactor to eliminate duplicate clientset and pod creation.

Lines 289-292 recreate a clientset and pod that were already created in newTestDeviceMapper (line 286), only to replace the mapper's client at line 309. This duplication wastes setup work.
Refactoring options

Option 1: Pass nil for test pods to newTestDeviceMapper and do all setup here:
-	mapper, err := newTestDeviceMapper(testCase, []*corev1.Pod{existingTestPod})
+	mapper, err := newTestDeviceMapper(testCase, nil)
 	assert.NoError(t, err)

 	ctx := context.TODO()
 	clientSet := fake.NewSimpleClientset()
 	_, err = clientSet.CoreV1().Pods(existingTestPod.GetNamespace()).Create(ctx, existingTestPod, metav1.CreateOptions{})
 	assert.NoError(t, err)
+	mapper.kubernetesClient = clientSet

-	callCount := 0
 	clientSet.Fake.PrependReactor("patch", "pods", func(action ktesting.Action) (handled bool,
 		ret runtime.Object, err error) {
 		...
 	})
-	mapper.kubernetesClient = clientSet
Option 2: Extend newTestDeviceMapper to accept an optional clientset parameter for tests that need custom reactors.

Apply similar refactoring to TestUpdatePodDevicesAnnotationsWithNonRetryablePatchError (lines 328-348).
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)

15-28: Consider documenting the DaemonSet selection logic.

The coding guidelines recommend explaining DaemonSet variant selection logic in Helm chart documentation. Consider adding comments or documentation that describes:

When this DaemonSet will be deployed (nodeSelector requirements)

How the runtimeClassName affects deployment

Any configuration options that control DaemonSet behavior

Based on coding guidelines learnings.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3705a73 and 14257ec.

⛔ Files ignored due to path filters (1)

metadata-collector/go.sum is excluded by !**/*.sum

📒 Files selected for processing (14)

data-models/pkg/model/pod_device_annotation.go
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
metadata-collector/go.mod
metadata-collector/main.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/pkg/mapper/mapper_test.go

💤 Files with no reviewable changes (1)

distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/mapper.go

🧰 Additional context used

📓 Path-based instructions (5)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

metadata-collector/main.go
data-models/pkg/model/pod_device_annotation.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/httpsclient.go

**/go.mod

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

metadata-collector/go.mod

**/daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

distros/kubernetes/**/*daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper_test.go

🧠 Learnings (16)

📚 Learning: 2026-01-06T21:31:31.115Z

Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:31.115Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

metadata-collector/go.mod
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/httpsclient.go

📚 Learning: 2025-12-22T16:48:13.460Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

metadata-collector/go.mod
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Regular node DaemonSets should use `/var/log` volume mount for file-based logs

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-06T10:59:26.687Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

metadata-collector/pkg/mapper/mapper_test.go

🧬 Code graph analysis (4)

metadata-collector/main.go (1)

metadata-collector/pkg/mapper/mapper.go (1)

NewPodDeviceMapper (49-76)

metadata-collector/pkg/mapper/grpcclient.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

EntityTypeToResourceNames (22-27)

metadata-collector/pkg/mapper/httpsclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

metadata-collector/pkg/mapper/mapper_test.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

🪛 ast-grep (0.40.3)

metadata-collector/pkg/mapper/httpsclient.go

[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🪛 YAMLlint (1.37.1)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

🔇 Additional comments (7)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)

15-20: LGTM! The static analysis hint is a false positive.

The YAMLlint error flagged at line 20 is a false positive. The {{- include ... | nindent 4 }} syntax is valid Helm templating, where:

{{- trims preceding whitespace

nindent 4 indents the included content by 4 spaces

YAMLlint cannot properly parse Helm template directives and incorrectly interprets them as malformed YAML. The template structure is correct and follows standard Helm patterns.
metadata-collector/go.mod (1)
12-16: Update dependency versions to use stable releases instead of pre-release versions.

The review comment's concern is valid. While google.golang.org/grpc v1.77.0 is current and free from known vulnerabilities, the k8s.io packages (api, apimachinery, client-go, kubelet) are specified at v0.35.0, which does not have a stable release yet—only release candidates (v0.35.0-rc.0 and v0.35.0-rc.1). Using unreleased versions in a module creates maintainability and compatibility risks. Specify the latest stable Kubernetes release (v1.31.x or v1.32.x) or wait for v0.35.0 stable release before adding these as direct dependencies.
⛔ Skipped due to learnings
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:31.115Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
metadata-collector/pkg/mapper/mapper_test.go (1)

89-428: Test coverage is comprehensive and well-structured.

The test suite thoroughly covers the PodDeviceMapper functionality:

Error handling for kubelet client failures

Annotation lifecycle (add, update, remove, no-change, no-annotation)

Retry behavior on conflicts

Non-retryable error handling

Multiple pods with varied device allocations

Past review comments regarding missing error checks have been properly addressed throughout.

metadata-collector/pkg/mapper/httpsclient_test.go (1)

32-559: LGTM! Comprehensive test coverage for kubelet HTTPS client.

The test suite covers all critical paths:

Successful pod listing with device annotation extraction

Error handling for token loading, request creation, network failures, invalid responses, and JSON unmarshalling

Retry behavior on transient failures

The large JSON constant provides realistic test data, and the mock implementation cleanly isolates HTTP interactions.

metadata-collector/pkg/mapper/httpsclient.go (1)

97-151: LGTM! ListPods implementation is robust and well-documented.

The implementation properly handles:

Token refresh on each request (supports rotation via projected volumes)

Context-aware HTTP requests with proper timeouts

Retry logic with error wrapping for diagnostics

Resource cleanup with deferred body close

The extensive comments (lines 72-96) clearly justify the security trade-offs for localhost communication.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (2)

38-38: LGTM: Service account binding added.

The serviceAccountName binding correctly enables RBAC permissions for the metadata-collector to patch pod annotations via the Kubernetes API.

45-84: LGTM: Container security context and pod-resources volume mount are appropriate.

The privileged security context (runAsUser: 0, runAsGroup: 0, privileged: true) combined with the pod-gpu-resources volume mount at /var/lib/kubelet/pod-resources correctly enables access to the Kubelet PodResourcesLister gRPC endpoint for discovering GPU device allocations. The readOnly flag appropriately restricts write access.

metadata-collector/pkg/mapper/httpsclient.go

github-actions · 2026-01-07T00:16:00Z

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

github-actions · 2026-01-07T00:35:07Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper	47.02% (+47.02%)	🌟

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model/pod_device_annotation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/metadata-collector/main.go	0.00% (ø)	165 (+83)	0	165 (+83)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient.go	47.37% (+47.37%)	133 (+133)	63 (+63)	70 (+70)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient.go	50.51% (+50.51%)	99 (+99)	50 (+50)	49 (+49)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper.go	44.92% (+44.92%)	187 (+187)	84 (+84)	103 (+103)	🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper_test.go

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)

45-50: Implement kata vs regular node separation with proper volume mounts per coding guidelines.

This DaemonSet must be split into separate variants for kata and regular nodes using nodeAffinity based on the nvsentinel.dgxc.nvidia.com/kata.enabled label. Regular node DaemonSets should mount /var/log for file-based logs, while Kata node DaemonSets should mount /run/log/journal and /var/log/journal for systemd journal. The single nodeSelector approach is insufficient—explicit nodeAffinity filtering is required.

Additionally, reconsider whether privileged: true is necessary for accessing the kubelet pod-resources socket, or whether specific capabilities would provide adequate access with better security boundaries.

🤖 Fix all issues with AI agents

In @docs/configuration/metadata-collector.md:
- Line 5: Add a new "Configuration" subsection to the metadata-collector docs
that documents the pod-to-GPU mapping annotation (dgxc.nvidia.com/devices):
specify a boolean flag to enable/disable pod device annotation (e.g.,
enablePodDeviceAnnotation), configuration knobs for scrape/poll intervals (e.g.,
podResourcesPollInterval) and any refresh backoff, the configurable Kubelet
PodResourcesLister socket path/name if supported (e.g.,
kubeletPodResourcesSocket), the exact annotation key (dgxc.nvidia.com/devices)
and example key/value formats (single GPU, multiple GPUs, UUIDs vs indexes), and
any RBAC/serviceAccount requirements needed to update pod annotations (which
role, verbs, and serviceAccount name) so operators know how to enable and secure
the feature.

In @docs/metadata-collector.md:
- Line 9: Replace instances of the compound adjective "pod to GPU" used to
modify nouns (e.g., "pod to GPU mapping") with the hyphenated form "pod-to-GPU"
so the phrase reads "pod-to-GPU mapping" (and any similar uses like "pod-to-GPU
annotation"); update all occurrences mentioned in the review (the three places
where "pod to GPU" is used as a modifier) to use the hyphenated form for correct
compound-adjective hyphenation.
- Line 28: Fix the grammar in the sentence "The Metadata Collector runs as an
regular container on each GPU node:" by replacing "an regular" with "a regular"
so it reads "The Metadata Collector runs as a regular container on each GPU
node:"; update the single occurrence of that sentence in the
docs/metadata-collector.md content.

In @metadata-collector/go.mod:
- Around line 47-48: Consolidate the duplicate YAML imports by using a single
canonical import path and version: replace the go.yaml.in/yaml/v3 v3.0.4 entry
with gopkg.in/yaml.v3 v3.0.4 (or update the existing gopkg.in/yaml.v3 v3.0.1
entry to v3.0.4) and remove the redundant alternative path so only one
gopkg.in/yaml.v3 entry remains at the chosen version.

In @metadata-collector/main.go:
- Around line 79-93: The ticker created with
time.NewTicker(defaultPodDeviceMonitorPeriod) is never stopped and
UpdatePodDevicesAnnotations errors currently abort the loop; fix by calling
ticker.Stop() via a defer immediately after creating ticker to avoid leaking its
goroutine, and change the error handling for
podDeviceMapper.UpdatePodDevicesAnnotations() to log the error (using slog.Error
or slog.Warn with context) and continue the loop instead of returning fmt.Errorf
so transient failures don’t terminate the periodic monitor; keep the ctx.Done()
case unchanged to allow graceful shutdown.

In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 48-75: Add a godoc comment above the exported constructor
NewKubeletGRPClient describing what the function does, its parameters and return
values, and any important behavior (e.g., it verifies the kubelet pod resources
socket, establishes a gRPC connection over a unix domain socket using insecure
credentials and a custom dialer, and returns a KubeletGRPClient or an error).
Mention side-effects (like calling resolver.SetDefaultScheme) and the meaning of
errors returned to make usage clear.
- Around line 153-168: The deduplication loop over deviceAnnotation.Devices can
panic for empty slices; update the loop in grpcclient.go (the block iterating
devicesPerPod and resourceName) to skip processing when len(devices) == 0 (i.e.,
check and continue before calling sort.Strings and the dedupe logic) so you
never attempt devices[:j+1] on an empty slice; keep the existing sort and
in-place dedupe for non-empty slices and assign back to
deviceAnnotation.Devices[resourceName] only after the guard.

In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 39-41: Add a godoc comment for the exported interface
KubeletHTTPSClient explaining its purpose (a client for communicating with the
kubelet over HTTPS to retrieve pod information) and document the ListPods method
(returns the current list of corev1.Pod objects from the kubelet or an error).
Place the comment immediately above the "type KubeletHTTPSClient interface"
declaration and follow Go doc style (start with "KubeletHTTPSClient ..." and
mention the behavior/return values of ListPods).
- Around line 55-62: The TLS config on the http.Transport currently sets
InsecureSkipVerify but omits a minimum TLS version; update the TLSClientConfig
in the transport construction to set MinVersion to a safe minimum (e.g.,
tls.VersionTLS12) to prevent protocol downgrades. Locate the transport variable
and its TLSClientConfig block and add MinVersion: tls.VersionTLS12 (or
tls.VersionTLS13 if you require only TLS1.3) alongside InsecureSkipVerify to
harden the connection.
- Around line 54-70: Add a Go doc comment immediately above the exported
constructor NewKubeletHTTPSClient that explains what the function returns (a
KubeletHTTPSClient configured to talk to the kubelet over HTTPS), summarizes key
behavior (uses an http.Transport with InsecureSkipVerify=true, timeouts, and
sets bearerTokenPath and listPodsURI), and notes any important context or side
effects (takes a context.Context, may read the bearer token from
bearerTokenPath). Reference the constructor name NewKubeletHTTPSClient and the
returned type KubeletHTTPSClient in the comment.

In @metadata-collector/pkg/mapper/mapper.go:
- Around line 38-40: Add a godoc comment above the exported interface
PodDeviceMapper describing its responsibility (e.g., mapping device information
into Pod annotations) and document its method UpdatePodDevicesAnnotations to
explain what it does, what the return values represent (the int and error), and
any side effects or expectations; place the comment immediately above the type
PodDeviceMapper and use standard Go doc style starting with "PodDeviceMapper
..." and "UpdatePodDevicesAnnotations ..." to satisfy linting/godoc
requirements.

🧹 Nitpick comments (7)

metadata-collector/pkg/mapper/mapper_test.go (2)
127-127: Swap assert.Equal arguments to follow (expected, actual) convention.

The testify/assert.Equal function expects arguments in the order (expected, actual). Throughout these tests, the arguments are reversed (e.g., assert.Equal(t, numUpdates, 0) should be assert.Equal(t, 0, numUpdates)). This affects error messages when assertions fail.
♻️ Example fix for line 127
-	assert.Equal(t, numUpdates, 0)
+	assert.Equal(t, 0, numUpdates)
Apply similar fixes throughout the file.
Also applies to: 143-143, 163-163, 192-192, 219-219, 246-246, 267-267, 313-313, 352-352, 400-400

300-300: Remove debug fmt.Println statements.

These appear to be leftover debug statements. Use t.Log if logging is needed for test diagnostics, or remove them entirely.
♻️ Proposed fix
 		switch callCount {
 		case 1:
-			fmt.Println("error 1")
 			return true, nil, errors.NewConflict(
Also applies to: 342-342
metadata-collector/pkg/mapper/httpsclient_test.go (1)
453-466: Consider validating request in RoundTrip for more robust mocking.

The mock's RoundTrip method ignores the incoming request (m.Called() without arguments). While acceptable for current tests focused on response handling, future tests requiring request validation would need updates.
♻️ Optional enhancement for request validation
 func (m *MockHTTPRoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
-	args := m.Called()
+	args := m.Called(req)
 
 	var resp *http.Response
 	if args.Get(0) != nil {
 		resp = args.Get(0).(*http.Response)
 	}
 	return resp, args.Error(1)
 }
This would require updating mock setup to use mock.AnythingOfType("*http.Request").
metadata-collector/pkg/mapper/httpsclient.go (1)
97-155: Consider more selective retry logic.

The ListPods implementation is well-structured with proper error handling and token rotation support. However, retryAllErrors (line 153-155) retries all errors indiscriminately, which may lead to unnecessary retries for non-transient failures (e.g., 401 Unauthorized).

Consider implementing selective retry logic similar to the pattern in mapper.go that only retries transient errors like timeouts, rate limiting, and server errors.
♻️ Proposed refinement
-func retryAllErrors(_ error) bool {
-	return true
+func retryAllErrors(err error) bool {
+	// Retry on network timeouts and 5xx server errors
+	if errors.IsTimeout(err) || errors.IsServerTimeout(err) {
+		return true
+	}
+	// Check for temporary network errors in the error chain
+	// (http.Client wraps net errors)
+	return false
 }
metadata-collector/pkg/mapper/grpcclient.go (1)
185-187: Consider logging errors from Close.

The Close method ignores any error returned by connection.Close(). While this may be acceptable during cleanup, consider logging the error to aid debugging of connection issues.
♻️ Proposed refinement
 func (client *kubeletGRPClient) Close() {
-	client.connection.Close()
+	if err := client.connection.Close(); err != nil {
+		slog.Warn("Error closing gRPC connection", "error", err)
+	}
 }
metadata-collector/pkg/mapper/grpcclient_test.go (1)
159-193: Consider using standard testing package for simple assertions.

Based on learnings, avoid introducing testify for simple equality/inequality checks. The complex map comparison at line 179 justifies testify usage, but simple assertions like assert.NoError, assert.Error, and assert.NotNil could use the standard testing package (t.Error, t.Fatal, etc.).
♻️ Example refinement for simple checks
 func TestListPodResourcesError(t *testing.T) {
 	kubeletGRPCClient := newTestKubeletGRPCClient(errors.New("ListPodResources error"), 10, podDevices)
 	_, err := kubeletGRPCClient.ListPodResources()
-	assert.Error(t, err)
+	if err == nil {
+		t.Fatal("expected error, got nil")
+	}
 }
Based on learnings.
metadata-collector/pkg/mapper/mapper.go (1)
197-214: Include namespace in log and error messages for clarity.

The log message at line 204 and error message at line 209 should include the namespace along with the pod name to uniquely identify the pod being patched, as pod names are only unique within a namespace.
♻️ Proposed refinement
 		if err != nil && isRetryableError(err) {
 			slog.Warn("Retryable error patching pod annotation. Retrying...",
 				"pod", podName,
+				"namespace", namespace,
 				"error", err)
 		}

 		if err != nil {
-			return fmt.Errorf("failed to patch pod %s: %w", podName, err)
+			return fmt.Errorf("failed to patch pod %s/%s: %w", namespace, podName, err)
 		}

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14257ec and 171432f.

⛔ Files ignored due to path filters (1)

metadata-collector/go.sum is excluded by !**/*.sum

📒 Files selected for processing (16)

data-models/pkg/model/pod_device_annotation.go
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
docs/configuration/metadata-collector.md
docs/metadata-collector.md
metadata-collector/go.mod
metadata-collector/main.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/pkg/mapper/mapper_test.go

💤 Files with no reviewable changes (1)

distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

data-models/pkg/model/pod_device_annotation.go

🧰 Additional context used

📓 Path-based instructions (5)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/main.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient_test.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient_test.go

**/go.mod

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

metadata-collector/go.mod

**/daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

distros/kubernetes/**/*daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

🧠 Learnings (17)

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/go.mod

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

docs/metadata-collector.md

📚 Learning: 2025-11-06T10:59:26.687Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.

Applied to files:

docs/metadata-collector.md
docs/configuration/metadata-collector.md
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

metadata-collector/main.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

metadata-collector/pkg/mapper/grpcclient.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient_test.go

📚 Learning: 2026-01-06T21:31:36.113Z

Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:48:13.460Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

metadata-collector/go.mod
metadata-collector/pkg/mapper/httpsclient_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

🧬 Code graph analysis (5)

metadata-collector/main.go (1)

metadata-collector/pkg/mapper/mapper.go (1)

NewPodDeviceMapper (50-77)

metadata-collector/pkg/mapper/grpcclient.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

EntityTypeToResourceNames (22-27)

metadata-collector/pkg/mapper/mapper_test.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

metadata-collector/pkg/mapper/grpcclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (1)

DeviceAnnotation (30-32)

metadata-collector/pkg/mapper/httpsclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

🪛 ast-grep (0.40.4)

metadata-collector/pkg/mapper/httpsclient.go

[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🪛 LanguageTool

docs/metadata-collector.md

[grammar] ~9-~9: Use a hyphen to join words.
Context: ...adata Collector will also expose the pod to GPU mapping as an annotation on each ...

(QB_NEW_EN_HYPHEN)

[grammar] ~9-~9: Use a hyphen to join words.
Context: ...ta Collector will also expose the pod to GPU mapping as an annotation on each pod...

(QB_NEW_EN_HYPHEN)

[grammar] ~22-~22: Use a hyphen to join words.
Context: ...y, the node drainer module needs the pod to GPU mapping to map a set of pods to t...

(QB_NEW_EN_HYPHEN)

[grammar] ~22-~22: Use a hyphen to join words.
Context: ...the node drainer module needs the pod to GPU mapping to map a set of pods to the ...

(QB_NEW_EN_HYPHEN)

[grammar] ~101-~101: Use a hyphen to join words.
Context: ...th monitors. ### Kubernetes API The pod to GPU mapping is exposed on pods object...

(QB_NEW_EN_HYPHEN)

[grammar] ~101-~101: Use a hyphen to join words.
Context: ...monitors. ### Kubernetes API The pod to GPU mapping is exposed on pods objects a...

(QB_NEW_EN_HYPHEN)

🪛 YAMLlint (1.37.1)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

🔇 Additional comments (21)

docs/configuration/metadata-collector.md (1)

21-21: Clarify container lifecycle: is this still an init container?

Line 21 refers to the "metadata-collector init container," but the PR description and AI summary suggest a shift from "init-container pattern to a regular container lifecycle." Verify and update the documentation to reflect whether metadata-collector is now a regular (long-running) container, sidecar, or still an init container. This affects deployment expectations and resource management.

metadata-collector/go.mod (3)

7-17: Direct dependencies align with PR feature and coding guidelines.

The additions of google.golang.org/grpc, k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go are appropriate and follow the guideline to use client-go for Kubernetes API interactions. The semantic versioning at v0.35.0 is correct and represents the latest stable release of the Kubernetes libraries; v0.36.0 is only available as alpha/rc builds.

27-37: These subpackages are legitimately pulled in by k8s.io/kube-openapi and are correctly marked as indirect dependencies.

Lines 27–37 list subpackages of go-openapi/swag as indirect dependencies, which are legitimately required by k8s.io/kube-openapi (itself a dependency of your k8s.io packages). The go-openapi/swag module uses semantic import versioning and breaks its code into subpackages with individual go.mod entries; this is expected behavior and the result of standard go mod tidy operations. No code in this module directly imports these subpackages, and they are correctly marked as indirect. No action required.

12-16: Dependencies are correctly specified for PodResourcesLister gRPC feature.

Verification confirms that k8s.io/kubelet v0.35.0 provides the PodResourcesLister API (available in pkg/apis/podresources/v1alpha1), and the codebase actively uses v1.PodResourcesListerClient with these exact versions. The gRPC v1.77.0 and k8s.io/kubelet v0.35.0 combination is compatible and requires no changes.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)

15-20: LGTM - ServiceAccount template is correctly structured.

The template correctly uses Helm templating for name and labels. The YAMLlint syntax error is a false positive due to Helm template directives.

Consider adding a trailing newline at the end of the file for POSIX compliance.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)

21-33: LGTM - ClusterRole permissions follow least-privilege principle.

The permissions are appropriately scoped:

pods/patch - required to update pod annotations with GPU device mappings

nodes/proxy/get - required to access the Kubelet HTTPS /pods endpoint

Consider adding a trailing newline at the end of the file.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)

15-28: LGTM - ClusterRoleBinding correctly links ServiceAccount to ClusterRole.

The binding correctly references the ClusterRole and ServiceAccount using consistent Helm templating, with the ServiceAccount namespace properly set to the release namespace.

Consider adding a trailing newline at the end of the file.

metadata-collector/pkg/mapper/mapper_test.go (3)

359-359: Use &model.DeviceAnnotation{} instead of &model.DeviceAnnotation{}.

Minor style nit: when creating composite literal pointers, you can simplify &model.DeviceAnnotation{...} - but the current syntax is correct. No change needed.

Also applies to: 370-370, 414-414

89-112: LGTM - Test helper newTestDeviceMapper is well-structured.

The helper correctly sets up mocks and creates test pods in the fake clientset, with proper error handling and propagation.

274-314: LGTM - Retry logic test correctly validates conflict handling.

The test properly simulates a conflict error on first patch attempt and verifies the retry succeeds. The past review comment about missing error check has been addressed with assert.NoError(t, err) on line 287.

metadata-collector/pkg/mapper/httpsclient_test.go (2)

482-502: LGTM - TestListPods validates successful pod listing and annotation extraction.

The test properly verifies that pods are listed and device annotations can be extracted and deserialized correctly.

534-553: LGTM - Retry test correctly validates transient failure recovery.

The test properly simulates a 500 error followed by success, verifying the client's retry logic works as expected.

metadata-collector/main.go (2)

56-70: Verify intended behavior: runMapper blocks indefinitely until signal.

The current flow calls runMapper which runs an infinite loop until context cancellation. This means "Pod device mapper completed successfully" at line 70 will only be logged on graceful shutdown (SIGTERM/SIGINT). Ensure this is the intended behavior.

96-139: LGTM - runCollector is well-structured with proper resource cleanup.

The function correctly:

Initializes and defers NVML shutdown

Handles errors with appropriate context wrapping

Uses structured logging throughout

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (3)

38-38: LGTM: ServiceAccount reference added correctly.

The serviceAccountName field properly references the metadata-collector ServiceAccount created in the new RBAC templates, enabling the DaemonSet to interact with the Kubernetes API.

69-71: LGTM: Pod resources volume mount configured correctly.

The volume mount for /var/lib/kubelet/pod-resources is properly configured with readOnly: true, which is appropriate since the metadata-collector only needs to read pod resource allocations from the kubelet socket.

Also applies to: 81-84

85-101: No changes needed—this DaemonSet correctly handles its use case.

The metadata-collector collects GPU metadata and pod GPU resource information, not system logs. It does not require separate kata and regular variants or the log volume mounts specified in the learnings. Those requirements apply to services like syslog-health-monitor that access logs differently on kata vs. regular nodes. The current single-DaemonSet design with optional runtimeClassName is correct for metadata-collector's functionality.

Likely an incorrect or invalid review comment.

metadata-collector/pkg/mapper/grpcclient_test.go (1)

34-157: LGTM: Well-structured test infrastructure.

The test fixtures and mock implementation are comprehensive, covering multiple scenarios including:

Same GPU allocated multiple times to one container

Same GPU allocated to different containers

Different pods with different devices

Unsupported devices

Empty pods

The error injection mechanism for retry testing is well-designed.

metadata-collector/pkg/mapper/mapper.go (2)

122-195: LGTM: Well-implemented annotation synchronization logic.

The UpdatePodDevicesAnnotations implementation is comprehensive and well-documented:

Correctly uses MergePatch for adding/updating annotations (to handle missing annotations object)

Uses JSONPatch for removal operations

Includes proper error handling and logging

Acknowledges cyclomatic complexity with nolint directive

The detailed comment block (lines 79-120) provides excellent context for the implementation approach.

216-233: LGTM: Comprehensive retry error handling.

The isRetryableError function correctly identifies transient errors that should be retried:

Resource version conflicts

Server timeouts and rate limiting

Network timeouts and service unavailability

This aligns well with Kubernetes API best practices for retry logic.
metadata-collector/pkg/mapper/grpcclient.go (1)
37-39: Add godoc comment for exported interface.

As per coding guidelines, all exported interfaces require documentation comments. Add a godoc comment explaining the purpose of KubeletGRPClient and its ListPodResources method.
📝 Proposed documentation
+// KubeletGRPClient provides access to the Kubelet's PodResourcesLister gRPC service
+// for discovering device allocations to pods on the local node.
 type KubeletGRPClient interface {
+	// ListPodResources returns a mapping from "namespace/name" to device allocations
+	// for all pods on the node with supported GPU resources.
 	ListPodResources() (map[string]*model.DeviceAnnotation, error)
 }
As per coding guidelines, "Function comments required for all exported Go functions".
⛔ Skipped due to learnings
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

docs/configuration/metadata-collector.md

docs/metadata-collector.md

metadata-collector/main.go

github-actions · 2026-01-07T22:31:40Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper	46.88% (+46.88%)	🌟

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model/pod_device_annotation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/metadata-collector/main.go	0.00% (ø)	165 (+83)	0	165 (+83)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient.go	47.37% (+47.37%)	133 (+133)	63 (+63)	70 (+70)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient.go	50.51% (+50.51%)	99 (+99)	50 (+50)	49 (+49)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper.go	44.78% (+44.78%)	201 (+201)	90 (+90)	111 (+111)	🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper_test.go

github-actions · 2026-01-07T22:38:07Z

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (2)

45-101: Add kata vs regular node DaemonSet separation.

Per coding guidelines, this chart requires separate DaemonSets for kata and regular nodes using nodeAffinity with the nvsentinel.dgxc.nvidia.com/kata.enabled label. The existing daemonset-kata.yaml and daemonset-regular.yaml pattern in the syslog-health-monitor chart provides a reference implementation for this separation.

For metadata-collector, create:

daemonset-regular.yaml targeting nodes where kata.enabled: "false"

daemonset-kata.yaml targeting nodes where kata.enabled: "true"

If metadata-collector needs to collect logs, regular nodes should mount /var/log and kata nodes should mount /run/log/journal and /var/log/journal.

48-50: Privileged mode should be justified or replaced with minimal required capabilities.

The privileged: true setting grants broad host access, which appears excessive for GPU metadata collection. Consider whether specific Linux capabilities (similar to syslog-health-monitor's approach with SYSLOG and SYS_ADMIN) are sufficient for:

NVML GPU metadata access via /sys

Kubelet pod-resources gRPC access

ServiceAccount-based kubelet HTTPS access

If privileged access is unavoidable, document the security rationale.

🤖 Fix all issues with AI agents

In
@distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml:
- Around line 22-27: The ClusterRole entry granting apiGroups: "" resources:
["pods"] verbs: ["patch"] is overly broad; confirm whether the
metadata-collector actually needs cluster-wide pod patch rights for GPU metadata
collection and, if so, either restrict scope by adding resourceNames or
namespace/label-based constraints (or limit to specific subresources) or replace
patch with a less-privileged verb; if the broad permission is intentional, add
an inline YAML comment in the ClusterRole explaining the threat model/use case
and link to any security justification document so reviewers understand why
pods:patch is required.

In @metadata-collector/main.go:
- Around line 86-93: The loop handling ticker.C is returning on any error from
podDeviceMapper.UpdatePodDevicesAnnotations(), which will kill the periodic
monitoring; instead, change the error handling so that when
podDeviceMapper.UpdatePodDevicesAnnotations() returns an error you log it (using
slog.Error or similar with context such as "failed to update pod device
annotations" and the error value) and continue the loop to wait for the next
tick; keep the successful path that logs numUpdates with slog.Info. Ensure you
reference the ticker.C select case and the
podDeviceMapper.UpdatePodDevicesAnnotations() call when making the change.

In @metadata-collector/pkg/mapper/grpcclient_test.go:
- Line 179: The assert.Equal call has its arguments reversed which will produce
confusing failure messages; swap the arguments so the expected value
(expectedDevicesPerPod) is passed before the actual value (devicesPerPod) in the
assert.Equal(t, expected, actual) invocation that currently reads
assert.Equal(t, devicesPerPod, expectedDevicesPerPod).

In @metadata-collector/pkg/mapper/mapper.go:
- Around line 38-40: Add a proper godoc comment above the exported interface
PodDeviceMapper describing its purpose and behavior; mention what
implementations are expected to do and what the returned int and error from
UpdatePodDevicesAnnotations represent (e.g., number of pods updated and error
conditions). Ensure the comment follows godoc style (starts with
"PodDeviceMapper ...") and also document the UpdatePodDevicesAnnotations method
semantics briefly within the interface comment or as a line comment above the
method.
- Around line 50-77: Add a godoc comment immediately above the exported function
NewPodDeviceMapper that begins with the function name and briefly describes what
it does, its ctx parameter (context.Context), what it returns (a PodDeviceMapper
or an error), and any important behaviors (e.g., it creates Kubelet HTTPS/gRPC
clients and an in-cluster Kubernetes client and wraps them in a
podDeviceMapper); ensure the comment is a proper single-line summary followed by
any short details on failure conditions to satisfy linting for exported symbols
like NewPodDeviceMapper and to clarify the relationship to podDeviceMapper.

🧹 Nitpick comments (7)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)

45-101: Document DaemonSet variant selection logic.

Per coding guidelines, the Helm chart documentation should explain DaemonSet variant selection logic. Since this file is daemonset.yaml (not a variant-specific file), the documentation should clarify how administrators select between deployment modes and whether kata support is planned.

Based on coding guidelines.
data-models/pkg/model/pod_device_annotation.go (1)
17-32: Add documentation for exported entities.

The exported constant PodDeviceAnnotationName, variable EntityTypeToResourceNames, and struct DeviceAnnotation lack godoc comments. Per coding guidelines, exported functions and types should be documented.

Additionally, EntityTypeToResourceNames is declared as var but appears to be a read-only lookup table. Consider whether this should be documented as immutable or protected against accidental modification.
📝 Proposed documentation
+// PodDeviceAnnotationName is the annotation key used to store GPU device allocations on pods.
 const (
 	PodDeviceAnnotationName = "dgxc.nvidia.com/devices"
 )

+// EntityTypeToResourceNames maps entity types (e.g., GPU_UUID) to their corresponding
+// Kubernetes resource names for device allocation tracking.
 var (
 	EntityTypeToResourceNames = map[string][]string{
 		"GPU_UUID": {
 			"nvidia.com/gpu",
 			"nvidia.com/pgpu",
 		},
 	}
 )

+// DeviceAnnotation represents the structure of GPU device allocations stored
+// as a pod annotation. The Devices map keys are resource names (e.g., "nvidia.com/gpu")
+// and values are lists of allocated device IDs.
 type DeviceAnnotation struct {
 	Devices map[string][]string `json:"devices"`
 }
Based on coding guidelines, "Function comments required for all exported Go functions".
metadata-collector/pkg/mapper/httpsclient_test.go (1)
453-466: Consider simplifying mock struct.

The MockHTTPRoundTripper embeds http.RoundTripper but never uses the embedded interface. This could be simplified to only embed mock.Mock.
♻️ Simplified mock
 type MockHTTPRoundTripper struct {
-	http.RoundTripper
 	mock.Mock
 }
metadata-collector/pkg/mapper/grpcclient.go (1)
185-187: Ignored error from connection.Close().

The error returned by client.connection.Close() is discarded. While Close errors are often benign, logging or returning the error would aid debugging connection issues.
♻️ Proposed change
-func (client *kubeletGRPClient) Close() {
-	client.connection.Close()
+func (client *kubeletGRPClient) Close() error {
+	return client.connection.Close()
 }
Note: This would require updating the interface if Close() is part of it.
metadata-collector/pkg/mapper/mapper_test.go (1)
300-300: Remove debug print statements from tests.

Lines 300 and 342 contain fmt.Println("error 1") which appear to be debug statements. These should be removed or converted to t.Log() if logging is needed during test execution.
♻️ Proposed fix
 		switch callCount {
 		case 1:
-			fmt.Println("error 1")
 			return true, nil, errors.NewConflict(
Apply the same fix at line 342.
Also applies to: 342-342
metadata-collector/pkg/mapper/grpcclient_test.go (2)

22-22: Consider using standard testing package for simple assertions.

Based on learnings, the repository prefers avoiding testify for simple equality/inequality checks, using standard testing package assertions (t.Error, t.Errorf, t.Fatal) instead. Reserve testify for complex scenarios requiring richer diagnostics.

159-159: Test names could follow the convention TestFunctionName_Scenario_ExpectedBehavior.

Per coding guidelines, consider renaming tests to be more descriptive, e.g.:

TestListPodResources_ValidDevices_ReturnsAggregatedAnnotations

TestListPodResources_PersistentError_ReturnsError

TestListPodResources_TransientError_RetriesSuccessfully

Also applies to: 182-182, 188-188

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 171432f and 0c645fe.

⛔ Files ignored due to path filters (1)

metadata-collector/go.sum is excluded by !**/*.sum

📒 Files selected for processing (16)

data-models/pkg/model/pod_device_annotation.go
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
docs/configuration/metadata-collector.md
docs/metadata-collector.md
metadata-collector/go.mod
metadata-collector/main.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/pkg/mapper/mapper_test.go

💤 Files with no reviewable changes (1)

distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml

🧰 Additional context used

📓 Path-based instructions (5)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

metadata-collector/pkg/mapper/httpsclient_test.go
data-models/pkg/model/pod_device_annotation.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/main.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/grpcclient.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/mapper_test.go

**/go.mod

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

metadata-collector/go.mod

**/daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

distros/kubernetes/**/*daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

🧠 Learnings (18)

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/go.mod

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/mapper_test.go

📚 Learning: 2026-01-06T21:31:36.113Z

Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

metadata-collector/go.mod
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/grpcclient.go

📚 Learning: 2025-12-22T16:48:13.460Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

data-models/pkg/model/pod_device_annotation.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Include comprehensive comments for all fields in Protocol Buffer messages

Applied to files:

data-models/pkg/model/pod_device_annotation.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

data-models/pkg/model/pod_device_annotation.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/pkg/mapper/grpcclient.go

📚 Learning: 2025-11-06T10:59:26.687Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.

Applied to files:

docs/configuration/metadata-collector.md
docs/metadata-collector.md
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

metadata-collector/main.go

📚 Learning: 2025-11-07T04:16:43.079Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.

Applied to files:

docs/metadata-collector.md

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

🧬 Code graph analysis (6)

metadata-collector/pkg/mapper/httpsclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

metadata-collector/pkg/mapper/mapper.go (3)

metadata-collector/pkg/mapper/httpsclient.go (2)

KubeletHTTPSClient (39-41)

NewKubeletHTTPSClient (54-70)

metadata-collector/pkg/mapper/grpcclient.go (2)

KubeletGRPClient (37-39)

NewKubeletGRPClient (48-75)

data-models/pkg/model/pod_device_annotation.go (1)

PodDeviceAnnotationName (18-18)

metadata-collector/main.go (1)

metadata-collector/pkg/mapper/mapper.go (1)

NewPodDeviceMapper (50-77)

metadata-collector/pkg/mapper/grpcclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (1)

DeviceAnnotation (30-32)

metadata-collector/pkg/mapper/mapper_test.go (2)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

platform-connectors/pkg/pipeline/factory.go (1)

Create (32-39)

metadata-collector/pkg/mapper/grpcclient.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

EntityTypeToResourceNames (22-27)

🪛 ast-grep (0.40.4)

metadata-collector/pkg/mapper/httpsclient.go

[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🪛 YAMLlint (1.37.1)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

🔇 Additional comments (17)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)

1-20: LGTM!

The ServiceAccount template is correctly structured with proper Helm templating. The yamllint syntax error on line 20 is a false positive—the {{- syntax is valid Helm template syntax.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)

28-33: LGTM!

The get permission on nodes/proxy is appropriate for accessing kubelet HTTPS endpoints through the API server proxy, which aligns with the PR's stated approach of using local kubelet endpoints.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)

1-28: LGTM!

The ClusterRoleBinding correctly binds the ClusterRole to the ServiceAccount with proper Helm templating. The yamllint syntax error on line 20 is a false positive.

docs/metadata-collector.md (1)

38-44: Documentation looks comprehensive.

The new GPU-to-pod mapping annotation section clearly explains the workflow using Kubelet endpoints. The step-by-step breakdown (HTTPS /pods endpoint, gRPC PodResourcesLister, annotation updates, loop-based reconciliation) aligns with the implementation.

Consider specifying the default reconciliation interval in line 44 ("fixed threshold") for operational clarity, but this is optional.

metadata-collector/pkg/mapper/httpsclient_test.go (2)

482-502: Test coverage is thorough.

The test validates both the pod list parsing and the extraction of device annotations from pod metadata. The assertion on expectedDeviceAnnotation ensures the JSON unmarshalling logic works correctly with the model.DeviceAnnotation type.

534-553: Retry behavior test is well-structured.

This test correctly validates the retry logic by configuring the mock to return a 500 error on the first call and succeed on the second. The .Once() chaining ensures proper call ordering.

metadata-collector/go.mod (1)

12-16: Kubernetes and gRPC dependencies appropriately added.

The new direct dependencies on google.golang.org/grpc, k8s.io/api, k8s.io/apimachinery, k8s.io/client-go, and k8s.io/kubelet align with the Kubelet HTTPS and gRPC client implementations. Version pinning at v0.35.0 for all k8s.io modules ensures compatibility.

metadata-collector/pkg/mapper/httpsclient.go (1)

72-96: Excellent security rationale documentation.

The detailed comment block explaining the justification for InsecureSkipVerify (localhost SAN limitations) and the authentication/authorization flow (bearer token, TokenReview, SubjectAccessReview) is well-written and provides valuable context for future maintainers.

metadata-collector/pkg/mapper/grpcclient.go (2)

77-112: Comprehensive documentation for ListPodResources.

The detailed comment block with example output clearly explains the function's behavior, including deduplication, sorting, and the possibility of shared devices across pods. This is excellent for maintainability.

54-54: Global resolver state modification may affect other gRPC clients.

resolver.SetDefaultScheme("passthrough") modifies global state, which could affect other gRPC clients in the same process that expect the default DNS resolver. Consider using a connection-specific option (e.g., prefixing the target with "passthrough:///") or documenting this side effect.

metadata-collector/pkg/mapper/mapper_test.go (3)

89-112: Well-designed test helper.

The newTestDeviceMapper helper cleanly encapsulates mock setup and fake clientset creation, making individual tests concise and focused. Error propagation from pod creation is properly handled.

274-314: Retryable patch error test validates retry behavior.

The test correctly uses a PrependReactor to simulate a conflict error on the first patch attempt, then succeeds on retry. The assertion that numUpdates == 1 confirms the retry succeeded.

355-409: Multi-pod scenario provides good integration coverage.

This test validates the mapper handles multiple pods with different device allocations correctly, including pods with multiple resource types. The assertions verify both pods receive correct annotations.

metadata-collector/main.go (1)

97-140: Well-structured collector implementation.

The function properly handles NVML lifecycle with defer, wraps errors with context, and uses structured logging throughout.

metadata-collector/pkg/mapper/grpcclient_test.go (1)

98-149: Well-designed mock with error injection support.

The mock cleanly embeds PodResourcesListerClient, supports configurable error counts for retry testing, and correctly constructs pod resources from the fixture data.

metadata-collector/pkg/mapper/mapper.go (2)

79-120: Comprehensive implementation with excellent documentation.

The block comment thoroughly explains the algorithm and design decisions. The three-phase approach (list pods, get device allocations, reconcile annotations) is well-structured.

One consideration: early returns on errors (lines 149, 180, 187) mean a single pod failure stops processing remaining pods. This is acceptable for now but consider logging and continuing for transient per-pod errors in future iterations.

Also applies to: 122-195

197-233: Solid retry logic with appropriate error classification.

The isRetryableError function correctly handles transient failures (conflicts, timeouts, rate limiting, service unavailability), and the retry mechanism properly logs warnings before retrying. The error wrapping with %w preserves the original error for the k8s error type checks.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

metadata-collector/main.go

metadata-collector/pkg/mapper/httpsclient.go

metadata-collector/pkg/mapper/mapper.go

coderabbitai

Actionable comments posted: 9

🤖 Fix all issues with AI agents

In @metadata-collector/go.mod:
- Around line 47-48: The go.mod imports duplicate YAML module paths
(go.yaml.in/yaml/v3 v3.0.4 and gopkg.in/yaml.v3 v3.0.1); consolidate to a single
import path by keeping gopkg.in/yaml.v3 at the latest version (v3.0.4), remove
the go.yaml.in/yaml/v3 entry (and/or update the gopkg.in/yaml.v3 entry to
v3.0.4), then run go mod tidy to reconcile dependencies so only gopkg.in/yaml.v3
v3.0.4 remains.

In @metadata-collector/pkg/mapper/grpcclient_test.go:
- Line 179: The assertion in grpcclient_test.go uses reversed arguments; change
the assert.Equal call from assert.Equal(t, devicesPerPod, expectedDevicesPerPod)
to assert.Equal(t, expectedDevicesPerPod, devicesPerPod) so the testify
convention assert.Equal(t, expected, actual) is followed and failure messages
are correct.

In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 37-39: The exported interface KubeletGRPClient and its method
ListPodResources lack godoc; add a package-level comment line immediately above
the interface that begins with "KubeletGRPClient ..." describing its
responsibility (e.g., a gRPC client for querying kubelet pod device annotations)
and include a brief sentence for the ListPodResources method (or ensure the
interface comment documents the method's behavior), using standard Go doc style
that starts with the exact symbol names so tools like godoc and linters
recognize them.
- Around line 48-75: Add a Go doc comment for the exported constructor
NewKubeletGRPClient that briefly describes what the function does (creates and
returns a Kubelet gRPC client connected to the kubelet pod-resources Unix
socket), any important behavior (returns error if socket is missing or
connection fails, uses insecure credentials and a unix dialer), and its
parameters/return values (context.Context parameter and returns
KubeletGRPClient, error); place the comment immediately above the
NewKubeletGRPClient function declaration.
- Around line 153-168: The deduplication loop over devicesPerPod can panic when
a resource's devices slice is empty; update the loop in the devicesPerPod
iteration (the code touching deviceAnnotation.Devices, resourceName, and the
local devices slice) to skip processing when len(devices) == 0 before calling
sort.Strings and the in-place dedupe that uses devices[:j+1]. Ensure the guard
is placed right after obtaining devices :=
deviceAnnotation.Devices[resourceName] (or before sort.Strings) so empty slices
are left intact and no slicing beyond bounds occurs; leave behavior unchanged
for non-empty slices and still deduplicate in-place for those cases.

In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 55-62: Update the TLS config on the transport to set a minimum TLS
version to prevent downgrades: in the TLSClientConfig used when creating
transport (the variable named transport and its TLSClientConfig field where
InsecureSkipVerify is set), add a MinVersion field (e.g., tls.VersionTLS12) so
the client requires at least TLS 1.2 while preserving InsecureSkipVerify for
localhost kubelet communication.
- Around line 39-41: Add a godoc comment for the exported interface
KubeletHTTPSClient and its ListPods method: write a brief sentence above the
KubeletHTTPSClient type that describes its purpose (e.g., client used to call
the kubelet over HTTPS to retrieve pod information) and include a short comment
for ListPods describing that it returns the list of corev1.Pod for the node or
an error. Ensure the comments use proper Go doc style (starting with
"KubeletHTTPSClient ..." and "ListPods ...") immediately above the interface and
method declaration.
- Around line 54-70: Add a Go doc comment immediately above the exported
constructor NewKubeletHTTPSClient that briefly states what the function does,
its input and returned values, and any notable behavior (e.g., it creates a
KubeletHTTPSClient using an http.Transport configured with InsecureSkipVerify,
timeouts, and initialized listPodsURI and bearerTokenPath). Ensure the comment
starts with the function name "NewKubeletHTTPSClient" and describes the context
parameter and the returned (KubeletHTTPSClient, error) so it satisfies
godoc/coding guidelines.

In @metadata-collector/pkg/mapper/mapper.go:
- Line 15: Add a package-level godoc comment at the top of the mapper package:
insert a concise comment immediately above "package mapper" that describes the
package's responsibility (e.g., mapping raw metadata to canonical/internal
representations, key types it handles, and intended consumers). Keep it short,
present-tense, and follow Go doc conventions so tools like godoc and linters
accept it.

🧹 Nitpick comments (7)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)

15-20: LGTM with minor formatting suggestion.

The ServiceAccount definition is correct. The static analysis hint about line 20 is a false positive—YAMLlint doesn't understand Helm templating syntax, and {{- include ... | nindent 4 }} is valid.

Consider adding a trailing newline at end of file for POSIX compliance.
metadata-collector/pkg/mapper/mapper_test.go (2)
300-300: Remove debug print statement.

fmt.Println("error 1") appears to be leftover debug output. Remove it to keep test output clean.
🧹 Proposed fix
 		case 1:
-			fmt.Println("error 1")
 			return true, nil, errors.NewConflict(
342-342: Remove debug print statement.

Same issue—fmt.Println("error 1") should be removed.
🧹 Proposed fix
 		case 1:
-			fmt.Println("error 1")
 			return true, nil, errors.NewUnauthorized("unauthorized")
data-models/pkg/model/pod_device_annotation.go (1)
17-32: Add godoc comments for exported symbols.

As per coding guidelines, exported constants, variables, and types should have documentation comments explaining their purpose.
📝 Proposed documentation
+// PodDeviceAnnotationName is the annotation key used to store GPU device allocations on pods.
 const (
 	PodDeviceAnnotationName = "dgxc.nvidia.com/devices"
 )

+// EntityTypeToResourceNames maps entity types to their corresponding Kubernetes resource names.
 var (
 	EntityTypeToResourceNames = map[string][]string{
 		"GPU_UUID": {
 			"nvidia.com/gpu",
 			"nvidia.com/pgpu",
 		},
 	}
 )

+// DeviceAnnotation represents the GPU device allocation annotation value.
+// Devices maps resource names (e.g., "nvidia.com/gpu") to lists of GPU UUIDs.
 type DeviceAnnotation struct {
 	Devices map[string][]string `json:"devices"`
 }
metadata-collector/main.go (1)
73-95: Consider running initial update immediately.

The current implementation waits for the first ticker tick (30 seconds) before performing the first annotation update. Consider triggering an immediate update on startup.
🔧 Proposed change for immediate first update
 func runMapper(ctx context.Context) error {
 	podDeviceMapper, err := mapper.NewPodDeviceMapper(ctx)
 	if err != nil {
 		return fmt.Errorf("could not create mapper: %w", err)
 	}

 	ticker := time.NewTicker(defaultPodDeviceMonitorPeriod)
 	defer ticker.Stop()

+	// Perform initial update immediately
+	if numUpdates, err := podDeviceMapper.UpdatePodDevicesAnnotations(); err != nil {
+		slog.Error("Initial pod device annotation update failed", "error", err)
+	} else {
+		slog.Info("Device mapper pod annotation updates", "podCount", numUpdates)
+	}
+
 	for {
 		select {
 		case <-ctx.Done():
 			return nil
 		case <-ticker.C:
metadata-collector/pkg/mapper/mapper.go (2)
184-191: Consider handling NotFound errors to continue processing remaining pods.

If a pod is deleted between ListPods() and Patch(), the patch will fail with a NotFound error. Since isRetryableError doesn't handle NotFound, this causes an early return, skipping remaining pods. While the next reconciliation iteration will retry, gracefully skipping deleted pods would improve resilience.
♻️ Proposed fix
 		if len(patchBytes) > 0 {
 			err = mapper.patchPodWithRetry(pod.GetName(), pod.GetNamespace(), patchType, patchBytes)
 			if err != nil {
+				if errors.IsNotFound(err) {
+					slog.Debug("Pod no longer exists, skipping annotation update", "podKey", podKey)
+					continue
+				}
 				return numPodDevicesAnnotationsModified, fmt.Errorf("error patching device annotation: %w", err)
 			}

 			numPodDevicesAnnotationsModified++
 		}
Note: The wrapped error from patchPodWithRetry uses %w, so errors.IsNotFound() will correctly unwrap and detect the underlying error.

197-214: Include namespace in log and error messages for easier debugging.

Adding the namespace helps identify the exact pod when troubleshooting issues across multiple namespaces.
♻️ Proposed improvement
 func (mapper *podDeviceMapper) patchPodWithRetry(podName string, namespace string, patchType types.PatchType,
 	patch []byte) error {
 	return retry.OnError(retry.DefaultRetry, isRetryableError, func() error {
 		_, err := mapper.kubernetesClient.CoreV1().Pods(namespace).Patch(mapper.ctx, podName, patchType, patch,
 			metav1.PatchOptions{})
 		if err != nil && isRetryableError(err) {
 			slog.Warn("Retryable error patching pod annotation. Retrying...",
 				"pod", podName,
+				"namespace", namespace,
 				"error", err)
 		}

 		if err != nil {
-			return fmt.Errorf("failed to patch pod %s: %w", podName, err)
+			return fmt.Errorf("failed to patch pod %s/%s: %w", namespace, podName, err)
 		}

 		return nil
 	})
 }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c645fe and 473923e.

⛔ Files ignored due to path filters (1)

metadata-collector/go.sum is excluded by !**/*.sum

📒 Files selected for processing (16)

data-models/pkg/model/pod_device_annotation.go
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
docs/configuration/metadata-collector.md
docs/metadata-collector.md
metadata-collector/go.mod
metadata-collector/main.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/pkg/mapper/mapper_test.go

💤 Files with no reviewable changes (1)

distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

docs/metadata-collector.md
docs/configuration/metadata-collector.md

🧰 Additional context used

📓 Path-based instructions (5)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

data-models/pkg/model/pod_device_annotation.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/main.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper_test.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper_test.go

**/go.mod

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

metadata-collector/go.mod

**/daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

distros/kubernetes/**/*daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

🧠 Learnings (20)

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

data-models/pkg/model/pod_device_annotation.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Include comprehensive comments for all fields in Protocol Buffer messages

Applied to files:

data-models/pkg/model/pod_device_annotation.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

data-models/pkg/model/pod_device_annotation.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/grpcclient.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

metadata-collector/pkg/mapper/httpsclient.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

metadata-collector/main.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/go.mod

📚 Learning: 2026-01-06T21:31:36.113Z

Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:48:13.460Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Regular node DaemonSets should use `/var/log` volume mount for file-based logs

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-06T10:59:26.687Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

🧬 Code graph analysis (6)

metadata-collector/pkg/mapper/grpcclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (1)

DeviceAnnotation (30-32)

metadata-collector/pkg/mapper/mapper.go (3)

metadata-collector/pkg/mapper/httpsclient.go (2)

KubeletHTTPSClient (39-41)

NewKubeletHTTPSClient (54-70)

metadata-collector/pkg/mapper/grpcclient.go (2)

KubeletGRPClient (37-39)

NewKubeletGRPClient (48-75)

data-models/pkg/model/pod_device_annotation.go (1)

PodDeviceAnnotationName (18-18)

metadata-collector/pkg/mapper/grpcclient.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

EntityTypeToResourceNames (22-27)

metadata-collector/main.go (1)

metadata-collector/pkg/mapper/mapper.go (1)

NewPodDeviceMapper (50-77)

metadata-collector/pkg/mapper/httpsclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

metadata-collector/pkg/mapper/mapper_test.go (2)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

platform-connectors/pkg/pipeline/factory.go (1)

Create (32-39)

🪛 ast-grep (0.40.4)

metadata-collector/pkg/mapper/httpsclient.go

[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🪛 YAMLlint (1.37.1)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

🔇 Additional comments (25)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (2)

38-38: LGTM: ServiceAccount binding.

Correctly references the ServiceAccount created by the new serviceaccount.yaml template using the same fullname helper.

45-71: Changes from initContainers to containers with privileged access.

The container now runs as a long-lived process (matching the new two-phase lifecycle in main.go) with:

runAsGroup: 0 and privileged: true required for accessing kubelet pod-resources socket

New pod-gpu-resources volume mount for the PodResourcesLister gRPC endpoint

The privileged requirement is justified given the need to access /var/lib/kubelet/pod-resources.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)

28-33: LGTM: nodes/proxy get permission.

The nodes/proxy GET permission is required to access the Kubelet's HTTPS /pods endpoint for listing node-local pods without control-plane API server load. This aligns with the PR objective of using local Kubelet endpoints.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)

15-28: LGTM: ClusterRoleBinding correctly wires RBAC.

The binding correctly references the ClusterRole and ServiceAccount using consistent fullname templating. The namespace is properly set to {{ .Release.Namespace }}.

The static analysis hint is a false positive—YAMLlint doesn't understand Helm templating. Consider adding a trailing newline for POSIX compliance.

metadata-collector/pkg/mapper/mapper_test.go (3)

17-33: Testify usage note.

The tests use testify/assert, which is appropriate here given the moderate complexity and number of assertions. Based on learnings, avoid testify only for simple equality checks where standard testing package suffices—these tests have enough assertions to benefit from the cleaner syntax.

89-112: LGTM: Test helper setup.

The newTestDeviceMapper helper correctly assembles mocks and creates test pods in the fake clientset. Error handling is properly implemented with context-wrapped errors.

355-409: LGTM: Multi-pod test coverage.

Good coverage for multiple pods with different device allocations. The test validates that each pod receives the correct annotation independently.

metadata-collector/main.go (2)

79-80: LGTM: Ticker cleanup.

Good - defer ticker.Stop() properly prevents the resource leak that was flagged in a previous review.

97-140: LGTM: Collector phase.

The runCollector function properly encapsulates the NVML initialization, metadata collection, and file writing with appropriate error handling and logging. The deferred NVML shutdown ensures cleanup even on errors.

metadata-collector/pkg/mapper/httpsclient_test.go (7)

32-451: LGTM: Comprehensive test fixture.

The podJson constant provides realistic Kubernetes pod data including annotations, which appropriately exercises the ListPods parsing and annotation extraction logic.

453-466: LGTM: Mock implementation is appropriate.

The MockHTTPRoundTripper correctly implements http.RoundTripper for testing purposes using testify/mock.

468-480: LGTM: Test helper is well-designed.

The newTestKubeletHTTPSClient helper cleanly sets up the client with mocked HTTP transport for unit testing.

482-502: LGTM: Test validates core functionality.

TestListPods correctly verifies pod list parsing and device annotation extraction with proper assertion argument order.

504-532: LGTM: Error handling tests are thorough.

The error path tests appropriately cover token loading, request creation, network failures, and invalid response codes.

534-553: LGTM: Retry logic is properly tested.

TestListPodsWithRetry correctly validates that transient failures are retried and eventual success is achieved.

555-559: LGTM: JSON unmarshal error is tested.

TestListPodsWithUnmarshalError appropriately verifies error handling for malformed JSON responses.

metadata-collector/pkg/mapper/grpcclient_test.go (4)

29-96: LGTM: Test fixtures cover complex scenarios.

The podDevices fixture appropriately tests edge cases including duplicate allocations, cross-container sharing, and unsupported device filtering.

98-149: LGTM: Mock server implementation is well-structured.

The MockPodResourcesServer cleanly simulates gRPC responses with configurable error injection for testing retry logic.

151-157: LGTM: Test helper correctly wires mock.

The newTestKubeletGRPCClient helper appropriately constructs the client with the mocked gRPC server.

182-193: LGTM: Error and retry tests validate behavior.

Both error tests appropriately verify error propagation and retry logic with configurable failure counts.

metadata-collector/pkg/mapper/httpsclient.go (1)

72-151: LGTM: Excellent documentation and implementation.

The comprehensive comment explaining TLS and authentication considerations is exemplary. The ListPods implementation correctly handles token rotation, request construction, and response processing.

metadata-collector/pkg/mapper/grpcclient.go (1)

173-187: LGTM: Helper functions are correct.

The isSupportedResourceName function correctly filters resources against the entity-resource mappings, and the Close method properly cleans up the gRPC connection.

metadata-collector/pkg/mapper/mapper.go (3)

33-36: LGTM!

The JSON Patch path correctly escapes the / as ~1 per RFC 6901, and the merge patch template properly uses %q to produce valid JSON with escaped string values.

42-48: LGTM!

The struct properly encapsulates dependencies as interfaces, enabling testability and following dependency injection patterns.

216-233: LGTM!

The retryable error detection covers the appropriate transient failures: conflicts, timeouts, rate limiting, and service unavailability. The inline comments clarify the intent for each category.

metadata-collector/go.mod

metadata-collector/pkg/mapper/grpcclient_test.go

metadata-collector/pkg/mapper/grpcclient.go

metadata-collector/pkg/mapper/httpsclient.go

metadata-collector/pkg/mapper/mapper.go

github-actions · 2026-01-07T23:01:34Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper	46.88% (+46.88%)	🌟

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model/pod_device_annotation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/metadata-collector/main.go	0.00% (ø)	170 (+88)	0	170 (+88)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient.go	47.37% (+47.37%)	133 (+133)	63 (+63)	70 (+70)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient.go	50.51% (+50.51%)	99 (+99)	50 (+50)	49 (+49)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper.go	44.78% (+44.78%)	201 (+201)	90 (+90)	111 (+111)	🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper_test.go

github-actions · 2026-01-07T23:11:46Z

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

metadata-collector/pkg/mapper/grpcclient.go

metadata-collector/pkg/mapper/mapper.go

…h pod Signed-off-by: Nathan Herz <[email protected]>

coderabbitai

Actionable comments posted: 12

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents

In @data-models/pkg/model/pod_device_annotation.go:
- Line 15: Add a package-level godoc comment describing the responsibility and
purpose of the package, placed immediately above the "package model" declaration
in pod_device_annotation.go; the comment should briefly explain what the model
package contains (e.g., data model types and annotations used across the
project, their intended usage and any important invariants) so that go doc tools
and readers understand the package at a glance.
- Around line 30-32: Add a clear Go doc comment for the exported struct
DeviceAnnotation explaining its purpose and the shape of its Devices field;
describe that DeviceAnnotation represents pod device annotations and that
Devices is a map from device plugin names (string) to a list of device IDs
(slice of strings), so tools and godoc users understand its semantics and
expected JSON encoding.
- Around line 21-28: Add a Go doc comment for the exported variable
EntityTypeToResourceNames that succinctly describes its purpose, how keys and
values are used (e.g., mapping entity types like "GPU_UUID" to Kubernetes
resource names such as "nvidia.com/gpu" and "nvidia.com/pgpu"), and any
important usage details or invariants (e.g., consumers should treat it as
read-only). Place the comment immediately above the variable declaration so it
is included in generated documentation and linters do not flag the missing
comment.
- Around line 17-19: Add a doc comment above the exported constant
PodDeviceAnnotationName describing that it is the Kubernetes annotation key used
to record/list GPU/device assignments for a Pod (value:
"dgxc.nvidia.com/devices"), when and by which components it is set/consumed, and
the expected format of the annotation value; place this comment immediately
above the PodDeviceAnnotationName declaration to satisfy exported-symbol
documentation guidelines.

In @metadata-collector/pkg/mapper/grpcclient_test.go:
- Around line 111-117: Add a Go doc comment for the exported constructor
NewMockPodResourcesServer that briefly describes what the function creates,
documents each parameter (err, errReturnCount, podDevices) and the returned
*MockPodResourcesServer value, and follows Go doc comment conventions (starts
with the function name and is complete sentence). Ensure the comment explains
the meaning of err and errReturnCount and the structure/expected contents of
podDevices so callers understand usage and semantics.
- Around line 98-105: Add a doc comment above the exported
MockPodResourcesServer type describing its purpose as a test/mock implementation
of the PodResourcesListerClient, and briefly document its fields: err (simulated
error to return), errReturnCount (how many times to return the error), and
podDevices (mapping pod -> container -> slice of deviceInfo). Mention that it
embeds v1.PodResourcesListerClient to avoid implementing all interface methods
and that the type is intended for unit tests.

In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 37-39: The interface KubeletGRPClient must declare a Close() error
method so callers can release the gRPC connection; add Close() to the
KubeletGRPClient interface and implement it on kubeletGRPClient to call the
existing connection cleanup (e.g., invoke the struct’s closeConn() or
c.conn.Close() and return any error). Then update any callers that obtain a
KubeletGRPClient to call/ defer client.Close() to avoid leaking the gRPC
connection.
- Line 54: The code calls resolver.SetDefaultScheme("passthrough"), which
mutates global gRPC resolver state; remove that global call
(resolver.SetDefaultScheme) and instead specify the passthrough scheme directly
on the dial target used by your gRPC client (e.g., prefix the target passed to
grpc.Dial/grpc.DialContext with the explicit "passthrough:///" scheme or
otherwise construct the target string with the passthrough scheme in the
function that creates the client connection). Ensure no other code relies on
resolver.SetDefaultScheme so the change confines behavior to this client only.
- Around line 119-126: The build fails because retryAllErrors is undefined;
define a predicate function named retryAllErrors(error) bool and use it in the
retry.OnError call (or replace the reference with an existing predicate) so the
code compiles; specifically, add a function accessible in this package (e.g.,
retryAllErrors that returns true if you intend to retry every error, or
implement a more specific predicate that checks error types/messages) and keep
the retry.OnError block that calls client.podResourcesClient.List(client.ctx,
&v1.ListPodResourcesRequest{}) unchanged.

In @metadata-collector/pkg/mapper/httpsclient_test.go:
- Around line 453-456: Add a doc comment for the exported mock type
MockHTTPRoundTripper describing that it is a test-only mock implementation of
http.RoundTripper used to stub HTTP requests in unit tests; place the comment
immediately above the type declaration for MockHTTPRoundTripper so it satisfies
exported-type documentation guidelines and briefly mention that it embeds
http.RoundTripper and mock.Mock to allow custom response behavior in tests.

In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 39-41: Add a doc comment for the exported interface
KubeletHTTPSClient that succinctly explains its purpose and usage (e.g., that it
provides HTTPS access to the kubelet API to list pods and is used by the
metadata collector). Place the comment immediately above the type declaration
for KubeletHTTPSClient and keep it in sentence form starting with the interface
name to satisfy Go documentation conventions.
- Around line 54-70: The exported constructor NewKubeletHTTPSClient is missing a
GoDoc comment; add a descriptive comment immediately above the function that
explains what the constructor does, its parameter (ctx context.Context), the
return values (KubeletHTTPSClient, error), and key behavior (creates an HTTP
transport with TLS InsecureSkipVerify, timeout settings, and initializes
bearerTokenPath and listPodsURI). Keep the comment concise and follow GoDoc
style (complete sentence starting with NewKubeletHTTPSClient).

🟡 Minor comments (9)

docs/metadata-collector.md-104-105 (1)
104-105: Expand the Kubernetes API subsection with annotation details.

The current documentation is too brief. Add the annotation key, JSON structure, and explain how external components consume it. The annotation dgxc.nvidia.com/devices exposes pod-to-GPU mapping as a JSON object with device resource names (e.g., nvidia.com/gpu, nvidia.com/pgpu) mapped to lists of GPU UUIDs. External components like the node-drainer module query the Kubernetes API to retrieve this annotation and determine which pods use specific GPUs.
📝 Proposed expansion for Kubernetes API subsection
### Kubernetes API
The pod-to-GPU mapping is exposed as the `dgxc.nvidia.com/devices` annotation on pod objects.
External components (e.g., node-drainer module) can query the Kubernetes API to retrieve this annotation
and determine which pods are using specific GPUs. The annotation value is a JSON object mapping device
resource names to UUID lists:

```json
{"devices": {"nvidia.com/gpu": ["GPU-455d8f70-2051-db6c-0430-ffc457bff834"]}}
</details>

</blockquote></details>
<details>
<summary>docs/metadata-collector.md-71-72 (1)</summary><blockquote>

`71-72`: **Add annotation key and JSON format example to Pod Information subsection.**

The "Pod Information" subsection lacks crucial details for developers. Include the annotation key and JSON structure so users understand how to consume the pod-to-GPU mapping data.

<details>
<summary>📝 Proposed enhancement for Pod Information subsection</summary>

```markdown
### Pod Information
- GPU UUIDs allocated to each pod
- Exposed as annotation `dgxc.nvidia.com/devices` on pod objects in JSON format:
  ```json
  {"devices": {"nvidia.com/gpu": ["GPU-<UUID>", ...]}}
</details>

</blockquote></details>
<details>
<summary>docs/metadata-collector.md-36-41 (1)</summary><blockquote>

`36-41`: **Document the annotation key, format, and reconciliation interval.**

The workflow steps are clear, but the section is missing important implementation details that users and operators need:

1. **Annotation key**: `dgxc.nvidia.com/devices`
2. **Annotation format**: `{"devices": {"nvidia.com/gpu": ["GPU-<UUID>", "GPU-<UUID>"], "nvidia.com/pgpu": [...]}}` (supports both GPU and pGPU resource names)
3. **Reconciliation interval**: The metadata collector runs on a fixed 30-second reconciliation interval to keep pod-to-GPU mappings current
4. **Kubelet endpoints**: These are local kubelet endpoints—Unix socket for the PodResourcesLister gRPC service and HTTPS for the /pods endpoint—rather than control-plane API calls, which reduces control-plane load

<details>
<summary>📝 Documentation enhancement example</summary>

Consider expanding the workflow section with:

```markdown
### GPU-to-pod mapping annotation

1. To discover all pods running on the given node, this component calls the local Kubelet HTTPS `/pods` endpoint.
2. To discover the GPU devices allocated to each pod, this component leverages the local Kubelet PodResourcesLister gRPC service (via Unix socket).
3. For each pod with a change in GPU device allocation, the component updates the `dgxc.nvidia.com/devices` annotation with a JSON mapping:
   ```json
   {"devices": {"nvidia.com/gpu": ["GPU-<UUID>", "GPU-<UUID>"]}}
where the resource name can be nvidia.com/gpu or nvidia.com/pgpu.
4. The metadata collector runs this reconciliation in a loop at a fixed 30-second interval to keep mappings current.

Note: Using local Kubelet endpoints reduces control-plane load compared to querying the Kubernetes API for every pod.
</details>

</blockquote></details>
<details>
<summary>metadata-collector/pkg/mapper/mapper_test.go-300-300 (1)</summary><blockquote>

`300-300`: **Remove debugging print statements.**

Lines 300 and 342 contain `fmt.Println` calls that should be removed from production test code.



<details>
<summary>🧹 Proposed fix</summary>

```diff
 		case 1:
-			fmt.Println("error 1")
 			return true, nil, errors.NewConflict(
And:
 		case 1:
-			fmt.Println("error 1")
 			return true, nil, errors.NewUnauthorized("unauthorized")
Also applies to: 342-342
metadata-collector/pkg/mapper/grpcclient.go-15-15 (1)
15-15: Add package-level documentation.

The package is missing package-level godoc. Per coding guidelines, all Go packages require package-level documentation.
📝 Suggested package documentation
+// Package mapper provides components for discovering and annotating pods with their allocated GPU devices.
+// It includes clients for the Kubelet HTTPS and gRPC endpoints to gather pod and device allocation data.
 package mapper
Based on coding guidelines requiring package-level godoc for all Go packages.
metadata-collector/pkg/mapper/grpcclient.go-48-75 (1)
48-75: Add godoc for exported constructor.

The exported function NewKubeletGRPClient is missing documentation. Per coding guidelines, all exported Go functions require godoc comments.
📝 Suggested documentation
+// NewKubeletGRPClient creates a new gRPC client for the Kubelet pod-resources endpoint.
+// It connects to the local Unix socket at /var/lib/kubelet/pod-resources/kubelet.sock.
+// Returns an error if the socket doesn't exist or the connection fails.
 func NewKubeletGRPClient(ctx context.Context) (KubeletGRPClient, error) {
Based on coding guidelines requiring function comments for all exported Go functions.
metadata-collector/pkg/mapper/mapper.go-39-41 (1)
39-41: Add godoc for exported interface.

The exported interface PodDeviceMapper is missing documentation. Per coding guidelines, all exported entities require godoc comments.
📝 Suggested documentation
+// PodDeviceMapper manages the synchronization of GPU device allocations to pod annotations.
+// It discovers device assignments via Kubelet endpoints and updates pod annotations accordingly.
 type PodDeviceMapper interface {
 	UpdatePodDevicesAnnotations() (int, error)
 }
Based on coding guidelines requiring function comments for all exported Go functions.
metadata-collector/pkg/mapper/grpcclient.go-37-39 (1)
37-39: Add godoc for exported interface.

The exported interface KubeletGRPClient is missing documentation. Per coding guidelines, all exported entities require godoc comments.
📝 Suggested documentation
+// KubeletGRPClient provides access to the Kubelet's pod resources via gRPC.
+// It connects to the local Unix socket to query device allocations for pods.
 type KubeletGRPClient interface {
 	ListPodResources() (map[string]*model.DeviceAnnotation, error)
 }
Based on coding guidelines requiring function comments for all exported Go functions.
metadata-collector/pkg/mapper/mapper.go-51-78 (1)
51-78: Add godoc for exported constructor.

The exported function NewPodDeviceMapper is missing documentation. Per coding guidelines, all exported Go functions require godoc comments.
📝 Suggested documentation
+// NewPodDeviceMapper creates a new PodDeviceMapper that synchronizes GPU device allocations
+// to pod annotations. It initializes clients for Kubelet HTTPS, Kubelet gRPC, and the
+// Kubernetes API server. Returns an error if any client initialization fails.
 func NewPodDeviceMapper(ctx context.Context) (PodDeviceMapper, error) {
Based on coding guidelines requiring function comments for all exported Go functions.

🧹 Nitpick comments (6)

metadata-collector/pkg/mapper/mapper_test.go (1)

114-409: Consider consolidating into table-driven tests.

The current test functions each test a single scenario. Per coding guidelines, table-driven tests are preferred when testing multiple scenarios. Consider refactoring the UpdatePodDevicesAnnotations tests into a table-driven format to reduce duplication and improve maintainability.

For example, tests like TestUpdatePodDevicesAnnotationsWithAnnotationAdded, WithAnnotationUpdated, WithAnnotationNotChanged, and WithAnnotationRemoved could be consolidated into a single table-driven test with cases for each scenario.

Based on coding guidelines, table-driven tests should be used when testing multiple scenarios in Go.
metadata-collector/pkg/mapper/httpsclient.go (2)
120-131: Consider more selective retry logic.

The retryAllErrors function retries on all errors unconditionally. Some errors like authentication failures (401/403) or malformed requests (400) are non-transient and should not be retried.

Consider implementing more selective retry logic that only retries on network errors and 5xx status codes.
♻️ Proposed selective retry logic
-	err = retry.OnError(retry.DefaultRetry, retryAllErrors, func() error {
+	err = retry.OnError(retry.DefaultRetry, isRetryableError, func() error {
 		resp, err = client.httpRoundTripper.RoundTrip(req) //nolint:bodyclose // response body is closed outside OnError
 		if err != nil {
 			return fmt.Errorf("got an error making HTTP request to /pods endpoint: %w", err)
 		}
 
 		if resp.StatusCode != http.StatusOK {
 			return fmt.Errorf("got a non-200 response code from /pods endpoint: %d", resp.StatusCode)
 		}
 
 		return nil
 	})
And replace retryAllErrors with:
-func retryAllErrors(_ error) bool {
-	return true
+func isRetryableError(err error) bool {
+	// Retry on network errors or server errors (5xx)
+	// Don't retry on client errors (4xx) like auth failures
+	return true // For now, keep current behavior but add TODO
+	// TODO: Implement selective retry for transient errors only
 }
55-62: Consider setting TLS MinVersion for security hardening.

The TLS configuration lacks a MinVersion setting. While the connection is to localhost and the security model is well-justified in the comments, explicitly setting MinVersion: tls.VersionTLS12 would follow security best practices and prevent any issues with TLS version negotiation.
🔒 Proposed TLS configuration enhancement
 	transport := &http.Transport{
 		TLSClientConfig: &tls.Config{
 			InsecureSkipVerify: true, //nolint:gosec
+			MinVersion:         tls.VersionTLS12,
 		},
 		TLSHandshakeTimeout:   30 * time.Second,
metadata-collector/pkg/mapper/grpcclient.go (1)
177-187: Consider optimizing resource name lookup.

The nested loop approach works but could be more efficient. With the current small set of resource names (2), the performance impact is negligible, but this could be optimized to O(1) lookup.
♻️ Optional optimization using a flat set

Initialize a package-level set once:
+var supportedResourceNames map[string]bool
+
+func init() {
+	supportedResourceNames = make(map[string]bool)
+	for _, resourceNames := range model.EntityTypeToResourceNames {
+		for _, resourceName := range resourceNames {
+			supportedResourceNames[resourceName] = true
+		}
+	}
+}
+
 func isSupportedResourceName(resourceName string) bool {
-	for _, resourceNamesForEntityType := range model.EntityTypeToResourceNames {
-		for _, currentResourceName := range resourceNamesForEntityType {
-			if resourceName == currentResourceName {
-				return true
-			}
-		}
-	}
-
-	return false
+	return supportedResourceNames[resourceName]
 }
metadata-collector/pkg/mapper/mapper.go (2)
167-189: Remove redundant variable declaration.

Line 168 declares deviceAnnotationJSON but it's immediately reassigned on line 170 using :=. The declaration on line 168 is unnecessary.
♻️ Remove redundant declaration
 	if podHasGPUDevices {
-		var deviceAnnotationJSON []byte
-
 		deviceAnnotationJSON, err := json.Marshal(deviceAnnotation)
 		if err != nil {
 			return "", nil, fmt.Errorf("error marshalling device annotation: %w", err)
 		}
215-219: Redundant error check.

Line 215 checks isRetryableError(err) again, but retry.OnError already uses isRetryableError as its predicate. This check is redundant and the warning will only be logged for errors that will be retried anyway.
♻️ Simplify error handling
 	return retry.OnError(retry.DefaultRetry, isRetryableError, func() error {
 		_, err := mapper.kubernetesClient.CoreV1().Pods(namespace).Patch(mapper.ctx, podName, patchType, patch,
 			metav1.PatchOptions{})
-		if err != nil && isRetryableError(err) {
-			slog.Warn("Retryable error patching pod annotation. Retrying...",
-				"pod", podName,
-				"error", err)
-		}
 
 		if err != nil {
 			return fmt.Errorf("failed to patch pod %s: %w", podName, err)
 		}
If you want to keep the logging, you can add it after the retry loop completes or rely on structured logging elsewhere.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 473923e and b73d463.

⛔ Files ignored due to path filters (1)

metadata-collector/go.sum is excluded by !**/*.sum

📒 Files selected for processing (16)

data-models/pkg/model/pod_device_annotation.go
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
docs/configuration/metadata-collector.md
docs/metadata-collector.md
metadata-collector/go.mod
metadata-collector/main.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/httpsclient_test.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/pkg/mapper/mapper_test.go

💤 Files with no reviewable changes (1)

distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

docs/configuration/metadata-collector.md

🧰 Additional context used

📓 Path-based instructions (5)

**/*.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

data-models/pkg/model/pod_device_annotation.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/mapper.go
metadata-collector/main.go
metadata-collector/pkg/mapper/httpsclient_test.go

**/*_test.go

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient_test.go

**/go.mod

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

metadata-collector/go.mod

**/daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Explain DaemonSet variant selection logic in Helm chart documentation

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

distros/kubernetes/**/*daemonset*.yaml

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes using nodeAffinity based on kata.enabled label
Regular node DaemonSets should use /var/log volume mount for file-based logs
Kata node DaemonSets should use /run/log/journal and /var/log/journal volume mounts for systemd journal

Files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

🧠 Learnings (20)

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages

Applied to files:

data-models/pkg/model/pod_device_annotation.go
metadata-collector/pkg/mapper/mapper.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

data-models/pkg/model/pod_device_annotation.go
metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/pkg/mapper/mapper.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

metadata-collector/pkg/mapper/grpcclient.go
metadata-collector/pkg/mapper/httpsclient.go
metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

metadata-collector/pkg/mapper/httpsclient.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

metadata-collector/pkg/mapper/mapper_test.go

📚 Learning: 2025-12-22T16:16:24.320Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.

Applied to files:

metadata-collector/pkg/mapper/mapper_test.go
metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/pkg/mapper/httpsclient_test.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

metadata-collector/pkg/mapper/grpcclient_test.go
metadata-collector/go.mod
metadata-collector/pkg/mapper/httpsclient_test.go

📚 Learning: 2026-01-06T21:31:36.113Z

Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:48:13.460Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-12-22T16:16:31.660Z

Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

metadata-collector/go.mod

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

metadata-collector/main.go

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-24T22:20:48.152Z

Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Regular node DaemonSets should use `/var/log` volume mount for file-based logs

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml

📚 Learning: 2025-11-06T10:59:26.687Z

Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.

Applied to files:

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
docs/metadata-collector.md

🧬 Code graph analysis (6)

metadata-collector/pkg/mapper/grpcclient.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

EntityTypeToResourceNames (22-27)

metadata-collector/pkg/mapper/mapper_test.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

metadata-collector/pkg/mapper/grpcclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (1)

DeviceAnnotation (30-32)

metadata-collector/pkg/mapper/mapper.go (3)

metadata-collector/pkg/mapper/httpsclient.go (2)

KubeletHTTPSClient (39-41)

NewKubeletHTTPSClient (54-70)

metadata-collector/pkg/mapper/grpcclient.go (2)

KubeletGRPClient (37-39)

NewKubeletGRPClient (48-75)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

metadata-collector/main.go (1)

metadata-collector/pkg/mapper/mapper.go (1)

NewPodDeviceMapper (51-78)

metadata-collector/pkg/mapper/httpsclient_test.go (1)

data-models/pkg/model/pod_device_annotation.go (2)

DeviceAnnotation (30-32)

PodDeviceAnnotationName (18-18)

🪛 ast-grep (0.40.4)

metadata-collector/pkg/mapper/httpsclient.go

[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🪛 YAMLlint (1.37.1)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml

[error] 20-20: syntax error: expected the node content, but found '-'

(syntax)

🔇 Additional comments (25)

metadata-collector/go.mod (4)

12-16: Well-aligned Kubernetes and gRPC dependencies for PodResourcesLister feature.

The addition of direct dependencies on k8s.io/api, k8s.io/apimachinery, k8s.io/client-go, and k8s.io/kubelet (all v0.35.0) correctly supports the PR objective of querying Kubelet endpoints for pod-to-device mappings via gRPC. Versions are consistent, recent, and appropriate for semantic versioning. This aligns with repository learnings on using client-go for Kubernetes interactions.

47-48: Verify non-standard YAML import paths.

Lines 47–48 introduce go.yaml.in/yaml/v2 and go.yaml.in/yaml/v3 as indirect dependencies. This is an unusual import path; standard Kubernetes/Go projects typically use gopkg.in/yaml.v2 and gopkg.in/yaml.v3. Verify whether these are intentional transitive pulls from k8s.io modules or reflect a dependency resolution artifact that could be simplified.

27-37: Granular go-openapi/swag submodule breakdown warrants verification.

Lines 27–37 list fine-grained go-openapi/swag submodules (cmdutils, conv, fileutils, jsonname, jsonutils, loading, mangling, netutils, stringutils, typeutils, yamlutils) rather than a single github.com/go-openapi/swag entry. This granular structure suggests either a recent upstream change or a side effect of go mod tidy. Verify these submodule dependencies are intentional and necessary; if not, consider a single parent module import to simplify maintenance.

1-67: go.mod and go.sum are consistent; no dependency conflicts or issues detected.

The verification confirms that go mod verify passes successfully and there are no duplicate or conflicting dependency versions. The Kubernetes module dependencies and their transitive indirect dependencies (including YAML packages and granular go-openapi/swag imports) are properly tracked in go.sum. Semantic versioning is correctly applied, with local module dependencies properly managed via the replace directive.

docs/metadata-collector.md (2)

9-9: Clear and well-integrated overview of the new feature.

The addition nicely contextualizes the pod-to-GPU mapping annotation exposure as a complement to local topology persistence.

20-22: Good addition of the node-drainer use case context.

This clarifies the motivation for the pod-to-GPU mapping and aligns well with the PR objectives.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (3)

38-38: LGTM: ServiceAccount binding added.

The ServiceAccount reference enables the RBAC permissions defined in the ClusterRole.

48-50: Verify privileged access scope and security implications.

The container now runs with privileged: true and mounts /var/lib/kubelet/pod-resources to access the Kubelet PodResourcesLister gRPC endpoint. While necessary for discovering GPU device allocations, this grants broad host access.

Ensure the privileged mode is required for your specific use case and that alternative approaches (e.g., using only specific capabilities) were considered. The privileged flag bypasses most container isolation mechanisms.

Also applies to: 69-71, 81-84

45-101: Verify kata node handling and log volume mount requirements.

Based on coding guidelines, DaemonSets in this path should include:

Separate variants for kata vs regular nodes using nodeAffinity based on the kata.enabled label

/var/log volume mount for regular nodes (file-based logs)

/run/log/journal and /var/log/journal volume mounts for kata nodes (systemd journal)

Documentation explaining DaemonSet variant selection logic

The current DaemonSet does not implement these separations. Confirm whether kata node support is required for the metadata-collector component.

Based on learnings, separate DaemonSets should be created for kata vs regular nodes using nodeAffinity, and log volume mounts should differ between the two variants.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)

21-33: LGTM: Appropriate RBAC permissions.

The ClusterRole grants minimal permissions required for the metadata-collector:

patch on pods to add/update/remove device annotations

get on nodes/proxy to access kubelet HTTPS endpoints

This follows the principle of least privilege.

distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)

21-28: LGTM: ClusterRoleBinding correctly configured.

The binding properly connects the ClusterRole to the ServiceAccount in the release namespace, completing the RBAC setup.

metadata-collector/pkg/mapper/mapper_test.go (2)

17-33: LGTM: Appropriate test dependencies.

The imports include testify/assert for complex test assertions, which is acceptable given the scenarios involve mocks, multiple assertions, and detailed validation of pod annotations and retry behavior.

35-112: Well-structured test infrastructure.

The mock implementations and test helper function provide clean, reusable test setup. The newTestDeviceMapper properly initializes all dependencies with mocks and prepopulates the fake clientset.

metadata-collector/pkg/mapper/httpsclient_test.go (2)

482-502: Excellent test coverage for the happy path.

The test comprehensively validates pod parsing and device annotation extraction. The assertion structure clearly verifies both the pod count and the correct decoding of device annotations.

534-553: Well-designed retry test with mock expectations.

The test effectively validates retry behavior by configuring the mock to fail once then succeed, confirming that the retry logic works as expected.

metadata-collector/main.go (2)

73-95: Consider non-fatal error handling for transient mapper failures.

The mapper loop exits immediately on any error from UpdatePodDevicesAnnotations(), causing the entire pod to terminate. This means transient failures (network issues, temporary API unavailability) will restart the container.

Consider whether errors should be logged and the loop should continue, reserving exit only for unrecoverable errors. This would align with the periodic scraping pattern described in the PR objectives.

Do you want the pod to restart on any mapper error, or should transient errors be logged while continuing the periodic updates?

56-71: Well-structured two-phase lifecycle.

The separation of metadata collection (one-time) and pod device mapping (continuous) is clear and follows good practices for error handling and logging.

metadata-collector/pkg/mapper/grpcclient_test.go (2)

159-180: Comprehensive test validates device aggregation and deduplication.

The test effectively validates that:

Devices are correctly aggregated per pod across containers

Duplicate device IDs are deduplicated

Unsupported resource types are filtered out

The expected device mappings match the consolidated view

188-193: Good retry behavior validation.

The test confirms that a single retry attempt succeeds, validating the retry logic handles transient errors correctly.

metadata-collector/pkg/mapper/httpsclient.go (2)

72-96: Excellent security justification documentation.

The extensive comment clearly explains the rationale for InsecureSkipVerify and the authentication/authorization model. This level of detail is valuable for future maintainers and security auditors.

97-108: Well-reasoned token handling for rotation support.

The decision to read the token file on every request correctly handles service account token rotation. The comment clearly explains the rationale.

metadata-collector/pkg/mapper/grpcclient.go (1)

159-175: LGTM: Efficient in-place deduplication.

The implementation correctly sorts device lists and removes duplicates using an efficient two-pointer algorithm. This ensures deterministic ordering for downstream comparisons.

metadata-collector/pkg/mapper/mapper.go (3)

123-153: LGTM: Clear implementation with good observability.

The method correctly lists pods and devices, compares them, and applies patches as needed. The counter return value provides useful observability into how many annotations were modified. Early return on error is appropriate since the operation will be retried in the next reconciliation cycle.

182-188: Excellent explanation of patch type choice.

The comment clearly explains why MergePatchType is necessary for add/update operations instead of JSONPatchType. This is valuable context for future maintainers.

229-246: LGTM: Comprehensive retry predicate.

The function correctly identifies common retryable Kubernetes API errors including conflicts, timeouts, rate limiting, and service unavailability. The error checks use the standard errors.Is* helpers which properly unwrap error chains.

data-models/pkg/model/pod_device_annotation.go

metadata-collector/pkg/mapper/grpcclient_test.go

metadata-collector/pkg/mapper/grpcclient.go

metadata-collector/pkg/mapper/httpsclient_test.go

metadata-collector/pkg/mapper/httpsclient.go

github-actions · 2026-01-08T17:56:04Z

Merging this branch will increase overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector	0.00% (ø)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper	46.34% (+46.34%)	🌟

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/nvidia/nvsentinel/data-models/pkg/model/pod_device_annotation.go	0.00% (ø)	0	0	0
github.com/nvidia/nvsentinel/metadata-collector/main.go	0.00% (ø)	170 (+88)	0	170 (+88)
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient.go	47.45% (+47.45%)	137 (+137)	65 (+65)	72 (+72)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient.go	50.51% (+50.51%)	99 (+99)	50 (+50)	49 (+49)	🌟
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper.go	43.72% (+43.72%)	215 (+215)	94 (+94)	121 (+121)	🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/grpcclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/httpsclient_test.go
github.com/nvidia/nvsentinel/metadata-collector/pkg/mapper/mapper_test.go

github-actions · 2026-01-08T18:00:56Z

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

github-actions · 2026-01-08T19:12:23Z

🛡️ CodeQL Analysis

🚨 Found 1 security alert(s)

🔗 View details

natherz97 changed the title ~~feature: update metadata-collector to track the GPUs allocated to each pod~~ feature: update metadata-collector to track the pod GPU device allocation Jan 6, 2026

natherz97 changed the title ~~feature: update metadata-collector to track the pod GPU device allocation~~ feature: update metadata-collector to track pod GPU device allocation Jan 6, 2026

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

metadata-collector/pkg/mapper/grpcclient.go Show resolved Hide resolved

metadata-collector/pkg/mapper/mapper_test.go Show resolved Hide resolved

metadata-collector/pkg/mapper/mapper_test.go Show resolved Hide resolved

metadata-collector/pkg/mapper/mapper.go Outdated Show resolved Hide resolved

natherz97 force-pushed the metadata-collector-annotation branch from 3705a73 to 14257ec Compare January 6, 2026 23:57

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

metadata-collector/pkg/mapper/httpsclient.go Show resolved Hide resolved

natherz97 force-pushed the metadata-collector-annotation branch from 14257ec to 171432f Compare January 7, 2026 22:19

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

docs/configuration/metadata-collector.md Outdated Show resolved Hide resolved

docs/metadata-collector.md Outdated Show resolved Hide resolved

docs/metadata-collector.md Outdated Show resolved Hide resolved

metadata-collector/main.go Show resolved Hide resolved

natherz97 force-pushed the metadata-collector-annotation branch from 171432f to 0c645fe Compare January 7, 2026 22:42

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

natherz97 force-pushed the metadata-collector-annotation branch from 0c645fe to 473923e Compare January 7, 2026 22:51

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

XRFXLP reviewed Jan 8, 2026

View reviewed changes

metadata-collector/pkg/mapper/grpcclient.go Outdated Show resolved Hide resolved

metadata-collector/pkg/mapper/mapper.go Outdated Show resolved Hide resolved

feature: update metadata-collector to track the GPUs allocated to eac…

b73d463

…h pod Signed-off-by: Nathan Herz <[email protected]>

natherz97 force-pushed the metadata-collector-annotation branch from 473923e to b73d463 Compare January 8, 2026 17:43

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

XRFXLP approved these changes Jan 8, 2026

View reviewed changes

Merge branch 'main' into metadata-collector-annotation

2affec8

lalitadithya enabled auto-merge (squash) January 8, 2026 18:54

lalitadithya merged commit 4038245 into NVIDIA:main Jan 8, 2026
54 checks passed

feature: update metadata-collector to track pod GPU device allocation #663

feature: update metadata-collector to track pod GPU device allocation #663

Uh oh!

Conversation

natherz97 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Type of Change

Component(s) Affected

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 6, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions bot commented Jan 6, 2026

🛡️ CodeQL Analysis

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026

🛡️ CodeQL Analysis

Uh oh!

github-actions bot commented Jan 7, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026

Merging this branch will increase overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions bot commented Jan 7, 2026

🛡️ CodeQL Analysis

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

natherz97 commented Jan 6, 2026 •

edited

Loading

coderabbitai bot commented Jan 6, 2026 •

edited

Loading