-
Notifications
You must be signed in to change notification settings - Fork 33
feature: update metadata-collector to track pod GPU device allocation #663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: update metadata-collector to track pod GPU device allocation #663
Conversation
📝 WalkthroughWalkthroughAdds a pod device annotation data model, kubelet gRPC and HTTPS clients, a PodDeviceMapper that periodically patches pods with GPU allocations, Helm RBAC/ServiceAccount/DaemonSet templates, refactors metadata-collector into separate collector and mapper phases, and includes tests and docs updates. Changes
Sequence Diagram(s)sequenceDiagram
actor Ticker as Ticker (30s)
participant Mapper as PodDeviceMapper
participant HTTPS as Kubelet HTTPS Client
participant gRPC as Kubelet gRPC Client
participant Kubelet as Kubelet
participant K8sAPI as Kubernetes API
Ticker ->> Mapper: UpdatePodDevicesAnnotations()
Mapper ->> HTTPS: ListPods()
HTTPS ->> Kubelet: HTTPS GET /pods (token)
Kubelet -->> HTTPS: PodList (JSON)
HTTPS -->> Mapper: []Pod
Mapper ->> gRPC: ListPodResources()
gRPC ->> Kubelet: gRPC over UDS ListPodResources
Kubelet -->> gRPC: PodResources
gRPC -->> Mapper: map[pod] DeviceAnnotation
Mapper ->> Mapper: compute desired annotations
loop per pod with change
Mapper ->> K8sAPI: Patch pod annotation (MergePatch / JSONPatch)
alt Success
K8sAPI -->> Mapper: 200 OK
else Retryable error
K8sAPI -->> Mapper: Error
Mapper ->> K8sAPI: Retry Patch
K8sAPI -->> Mapper: 200 OK
end
end
Mapper -->> Ticker: return update count
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)
45-50: Separate DaemonSets for kata and regular nodes, and verify privileged mode necessity.The current single DaemonSet violates the coding guideline requiring separate DaemonSets for kata vs regular nodes using
nodeAffinitybased on thekata.enabledlabel. Thesyslog-health-monitorchart provides the correct pattern: separatedaemonset-regular.yamlanddaemonset-kata.yamlfiles with different volume mounts per variant. Implement kata vs regular node separation using this pattern instead of the currentnodeSelectorapproach.Additionally, the
privileged: truesecurity context poses unnecessary risk. Thesyslog-health-monitorreference implementation accesses similar kubelet resources using granular Linux capabilities (SYSLOG,SYS_ADMIN) instead of privileged mode. Verify whether pod-resources socket access truly requires full privileged mode or if granular capabilities suffice, and replaceprivileged: truewith a minimal capability set if possible.
🤖 Fix all issues with AI Agents
In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 152-166: The deduplication loop in the devicesPerPod handling can
panic when a resource's devices slice is empty; update the inner loop over
deviceAnnotation.Devices (keyed by resourceName) to skip slices of length 0 or 1
by adding a guard like `if len(devices) <= 1 { continue }` before running the
in-place dedupe that uses j and assigns `devices[:j+1]`, so empty or
single-element slices are left unchanged and the slice expression cannot panic.
In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 55-62: The TLS config on the http.Transport sets
InsecureSkipVerify but omits a minimum TLS version; update the tls.Config used
in transport (the tls.Config literal in the http.Transport assigned to variable
transport) to include MinVersion set to a safe constant (e.g., tls.VersionTLS12
or tls.VersionTLS13) so the client refuses older protocol versions while
retaining InsecureSkipVerify for localhost use.
In @metadata-collector/pkg/mapper/mapper_test.go:
- Around line 327-332: The call to newTestDeviceMapper assigns an error to err
that is never checked; update the test to verify this by adding an assertion
(e.g., require.NoError or assert.NoError) immediately after calling
newTestDeviceMapper (before using mapper), referencing the newTestDeviceMapper
result so the test fails fast if setup failed.
- Around line 286-291: The call to newTestDeviceMapper in mapper_test.go may
return an error that is currently ignored; after "mapper, err :=
newTestDeviceMapper(testCase, []*corev1.Pod{existingTestPod})" add an explicit
test failure on error (e.g., assert.NoError(t, err) or t.Fatalf with the error)
before using mapper so the test stops instead of continuing with a nil mapper.
In @metadata-collector/pkg/mapper/mapper.go:
- Around line 49-76: Add a godoc comment for the exported constructor
NewPodDeviceMapper explaining its purpose and parameters; fix the typo gpcClient
-> grpcClient (references: NewKubeletGRPClient -> NewKubeletGRPCClient and
variable grpcClient) and update all uses (kubeletGRPCClient); wrap returned
errors from NewKubeletHTTPSClient, rest.InClusterConfig, NewKubeletGRPCClient,
and kubernetes.NewForConfig with contextual messages (e.g., using
fmt.Errorf("failed to ...: %w", err)) so callers get actionable context while
preserving the original error.
- Around line 193-209: In patchPodWithRetry, the log and error messages
incorrectly reference "node" instead of "pod"; update the slog.Warn call and the
returned fmt.Errorf message inside the patchPodWithRetry function to use "pod"
(and keep podName variable) so messages correctly say "patching pod annotation"
and "failed to patch pod <name>: %w" respectively; ensure you only change the
strings in slog.Warn and fmt.Errorf and not the variable names or logic.
🧹 Nitpick comments (9)
metadata-collector/pkg/mapper/mapper_test.go (2)
127-127: Incorrect argument order inassert.Equal.The
assert.Equalfunction expects arguments in the order(expected, actual), but here and throughout the file, the order is reversed as(actual, expected). This won't affect test correctness but will produce confusing error messages on failure.🔎 Proposed fix
- assert.Equal(t, numUpdates, 0) + assert.Equal(t, 0, numUpdates)Similar fixes should be applied to other
assert.Equalcalls in this file (lines 143, 163, 167, 179, 192, 196, 219, 223, 246, 250, 267, 271, 312, 350, 398, 402, 406).
17-33: Consider using standard library assertions for simpler checks.Based on learnings, this repository prefers avoiding testify for simple equality/inequality checks. While testify is acceptable for complex scenarios requiring richer diagnostics, consider using standard testing package assertions (e.g.,
t.Errorf) for straightforward checks to maintain consistency across the codebase.metadata-collector/main.go (1)
79-93: Ticker resource leak and delayed first update.Two issues:
- The ticker is never stopped, causing a minor resource leak when context is cancelled.
- The first pod device annotation update only happens after 30 seconds. Consider performing an immediate update before entering the ticker loop.
🔎 Proposed fix
ticker := time.NewTicker(defaultPodDeviceMonitorPeriod) + defer ticker.Stop() + + // Perform an immediate update before entering the periodic loop + numUpdates, err := podDeviceMapper.UpdatePodDevicesAnnotations() + if err != nil { + return fmt.Errorf("could not update mapper pod devices annotations: %w", err) + } + slog.Info("Device mapper pod annotation updates", "podCount", numUpdates) for { select { case <-ctx.Done(): return nil case <-ticker.C:data-models/pkg/model/pod_device_annotation.go (1)
15-32: LGTM - Clean data model definition.The model definitions are well-structured. The
EntityTypeToResourceNamesbeing avarrather thanconstis a Go necessity for map types. Consider adding a package-level godoc comment to document the purpose of this file, as per coding guidelines.Add a package-level comment:
// Package model provides data structures for GPU device annotations on Kubernetes pods. package modelmetadata-collector/pkg/mapper/httpsclient.go (1)
153-155: Consider limiting retries for authentication errors.The
retryAllErrorsfunction retries all errors including HTTP 401/403, which typically indicate persistent authorization issues that won't resolve with retries. While the impact is bounded byDefaultRetrylimits, avoiding unnecessary retries for auth errors would improve failure diagnosis.🔎 Proposed fix (if you want to distinguish retriable errors)
func isRetriableHTTPError(err error) bool { // Don't retry if we got a response - the error is in the status code // and would need to be extracted from the error message or a custom error type return true // For now, keep existing behavior }Alternatively, consider using a custom error type that carries the status code for more granular retry decisions.
metadata-collector/pkg/mapper/grpcclient.go (2)
54-54: Global state modification withresolver.SetDefaultScheme.
resolver.SetDefaultScheme("passthrough")modifies global gRPC resolver state, which could affect other gRPC clients in the same process. Consider using the scheme prefix directly in the target address instead:🔎 Proposed fix
- resolver.SetDefaultScheme("passthrough") - connection, err := grpc.NewClient( - podResourcesKubeletSocket, + "passthrough:///"+podResourcesKubeletSocket, grpc.WithTransportCredentials(insecure.NewCredentials()),This avoids global state mutation and makes the intent explicit per connection.
183-185: Consider logging or returning error fromClose().The
Close()method ignores the error returned byconnection.Close(). While this is common for cleanup methods, logging the error could aid debugging connection issues.🔎 Option 1: Log the error
func (client *kubeletGRPClient) Close() { - client.connection.Close() + if err := client.connection.Close(); err != nil { + slog.Warn("Failed to close gRPC connection", "error", err) + } }🔎 Option 2: Return the error
-func (client *kubeletGRPClient) Close() { - client.connection.Close() +func (client *kubeletGRPClient) Close() error { + return client.connection.Close() }Note: This would require updating the interface definition.
metadata-collector/pkg/mapper/mapper.go (2)
15-15: Missing package-level godoc comment.As per coding guidelines, package-level godoc is required for all Go packages. Add a doc comment before the package declaration.
🔎 Proposed fix
+// Package mapper provides functionality for annotating Kubernetes pods with +// their allocated GPU device information by querying Kubelet endpoints. package mapper
37-39: Missing godoc comment for exported interface.As per coding guidelines, exported interfaces should have documentation comments explaining their purpose and method contracts.
🔎 Proposed fix
+// PodDeviceMapper defines the interface for updating pod annotations +// with allocated GPU device information. type PodDeviceMapper interface { + // UpdatePodDevicesAnnotations discovers pods and their GPU allocations, + // then patches the dgxc.nvidia.com/devices annotation accordingly. + // Returns the number of pods modified and any error encountered. UpdatePodDevicesAnnotations() (int, error) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
metadata-collector/go.sumis excluded by!**/*.sum
📒 Files selected for processing (14)
data-models/pkg/model/pod_device_annotation.godistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/values.yamlmetadata-collector/go.modmetadata-collector/main.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper.gometadata-collector/pkg/mapper/mapper_test.go
💤 Files with no reviewable changes (1)
- distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
🧰 Additional context used
📓 Path-based instructions (5)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
metadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper.godata-models/pkg/model/pod_device_annotation.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/main.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
metadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient_test.go
**/go.mod
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
metadata-collector/go.mod
**/daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/**/*daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
🧠 Learnings (14)
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
metadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
metadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
metadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/go.mod
📚 Learning: 2026-01-06T21:31:31.115Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:31.115Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-06T10:59:26.687Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
🧬 Code graph analysis (6)
metadata-collector/pkg/mapper/mapper_test.go (2)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)platform-connectors/pkg/pipeline/factory.go (1)
Create(32-39)
metadata-collector/pkg/mapper/grpcclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (1)
DeviceAnnotation(30-32)
metadata-collector/pkg/mapper/grpcclient.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)EntityTypeToResourceNames(22-27)
metadata-collector/pkg/mapper/httpsclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
metadata-collector/pkg/mapper/mapper.go (3)
metadata-collector/pkg/mapper/httpsclient.go (2)
KubeletHTTPSClient(39-41)NewKubeletHTTPSClient(54-70)metadata-collector/pkg/mapper/grpcclient.go (2)
KubeletGRPClient(37-39)NewKubeletGRPClient(48-75)data-models/pkg/model/pod_device_annotation.go (1)
PodDeviceAnnotationName(18-18)
metadata-collector/main.go (1)
metadata-collector/pkg/mapper/mapper.go (1)
NewPodDeviceMapper(49-76)
🪛 ast-grep (0.40.3)
metadata-collector/pkg/mapper/httpsclient.go
[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(missing-ssl-minversion-go)
🪛 YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
🔇 Additional comments (19)
metadata-collector/go.mod (3)
7-17: Direct dependencies align with PR objectives.The addition of google.golang.org/grpc for PodResourcesLister communication and k8s.io packages (api, apimachinery, client-go, kubelet) for Kubernetes interactions are appropriate and follow the coding guideline to use client-go for Kubernetes API interactions. All Kubernetes dependencies are consistently versioned at v0.35.0, which is good practice.
27-37: This dependency structure is expected and standard. When Go modules use semantic import versioning and explicitly import subpackages (as shown by the metadata-collector/pkg/mapper imports), those subpackages appear as separate entries in go.mod. The parent module's internal dependencies on its subpackages further explains why they're listed. No action needed.
12-16: No action required. Go 1.25.0 is compatible with k8s.io/client-go v0.35.0, which requires Go 1.23 as a minimum. The subpackages listed for go-openapi/swag in the indirect dependencies are valid Go module behavior when transitive dependencies import from submodules. The dependency structure is correct.Likely an incorrect or invalid review comment.
metadata-collector/pkg/mapper/httpsclient_test.go (2)
468-480: LGTM - Well-structured test helper.The
newTestKubeletHTTPSClienthelper properly configures the mock and returns a testable client instance. The use ofio.NopCloserfor response body handling is appropriate for test scenarios.
534-553: Good retry behavior test.The test properly validates retry logic by configuring the mock to return a failure on the first call and success on the second, verifying that transient failures are handled correctly.
metadata-collector/main.go (1)
86-91: Consider whether all update errors should be fatal.Currently, any error from
UpdatePodDevicesAnnotationscauses the mapper to exit. For a DaemonSet, this triggers a restart, but frequent transient errors could cause restart loops. Consider logging errors and continuing for certain error types, or implementing a circuit breaker pattern.metadata-collector/pkg/mapper/grpcclient_test.go (2)
159-180: LGTM - Comprehensive test for device aggregation.The test validates that devices are properly aggregated per pod with:
- Deduplication of device IDs
- Sorting for deterministic ordering
- Filtering of unsupported resources
- Correct namespace/name key format
The expected output correctly reflects the sorted, unique device lists.
182-193: Good coverage of retry behavior.The error tests properly validate both scenarios:
errReturnCount=10exceeds retry limit, ensuring error propagationerrReturnCount=1allows retry to succeed on second attemptmetadata-collector/pkg/mapper/httpsclient.go (1)
72-96: Excellent security documentation.The detailed comment thoroughly documents:
- Why
InsecureSkipVerifyis used (localhost SAN limitation)- AuthN/AuthZ requirements and token usage
- RBAC requirements (nodes/proxy, HostNetwork)
- Equivalent CLI command for debugging
This level of documentation is exemplary for security-sensitive code.
metadata-collector/pkg/mapper/grpcclient.go (1)
77-112: Good documentation for the ListPodResources function.The documentation clearly explains:
- The socket requirement for the DaemonSet
- Output structure with example JSON
- Behavior guarantees (unique, sorted device lists)
- Edge case handling (same device in multiple pods)
metadata-collector/pkg/mapper/mapper.go (2)
78-191: Well-documented implementation with clear logic.The block comment thoroughly explains the design rationale, tradeoffs, and the three-step process. The implementation correctly handles annotation add, update, and remove cases with proper error wrapping.
211-228: LGTM!The retry classification covers the appropriate transient error cases: conflicts, timeouts, rate limiting, and service unavailability.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)
15-20: LGTM! ServiceAccount correctly configured.The ServiceAccount manifest is properly structured with appropriate Helm template placeholders for name and labels.
Note: The YAMLlint syntax error on line 20 is a false positive—YAMLlint doesn't recognize Helm template syntax.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (2)
15-20: LGTM! ClusterRole structure is correct.The ClusterRole is properly defined with appropriate Helm template placeholders.
Note: The YAMLlint syntax error on line 20 is a false positive—YAMLlint doesn't recognize Helm template syntax.
21-33: Thegetverb onnodes/proxyappears to be unnecessary and is not actually used by the code.The implementation accesses the Kubelet directly at
https://localhost:10250/podsrather than through the Kubernetes API proxy (/api/v1/nodes/{node}/proxy/pods). The Kubelet validates the request using its own TokenReview and SubjectAccessReview mechanisms, not RBAC permissions on thenodes/proxyresource.The only RBAC permission actually required is
patchonpodsfor the pod annotation updates atmapper.go:195. There is nogetorlistoperation on pods via the Kubernetes API server. Remove thenodes/proxyrule from the ClusterRole unless this is a planned future enhancement.Likely an incorrect or invalid review comment.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)
15-28: LGTM! ClusterRoleBinding correctly configured.The ClusterRoleBinding properly binds the ClusterRole to the ServiceAccount with correct name references and namespace handling.
Note: The YAMLlint syntax error on line 20 is a false positive—YAMLlint doesn't recognize Helm template syntax.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (3)
38-38: LGTM! ServiceAccount correctly referenced.The DaemonSet now properly references the ServiceAccount, enabling RBAC-controlled access to the Kubernetes API.
69-71: LGTM! Pod resources volume correctly configured.The volume mount for
/var/lib/kubelet/pod-resourcesis correctly configured to access the kubelet's PodResourcesLister gRPC socket, which is essential for discovering pod-to-device mappings as described in the PR objectives.Also applies to: 81-84
85-101: Consider kata-specific configuration and update documentation.Based on learnings, separate DaemonSets should be created for kata vs. regular nodes using
nodeAffinitybased on thekata.enabledlabel. Additionally, DaemonSet variant selection logic should be explained in Helm chart documentation.Since this metadata-collector appears to interact with kubelet and host resources, verify whether:
- It requires separate kata-specific configuration (e.g., different volume mounts for systemd journal)
- It should be excluded from kata nodes or configured differently
Based on learnings, as per coding guidelines for DaemonSet configuration and documentation.
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
3705a73 to
14257ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)
85-101: Create separate DaemonSets for kata and regular nodes with appropriate log volume mounts.The coding guidelines require DaemonSets in
distros/kubernetes/**/*daemonset*.yamlto create separate variants for kata vs regular nodes. The syslog-health-monitor chart in this repository demonstrates the correct pattern: separatedaemonset-kata.yamlanddaemonset-regular.yamlfiles with a shared template that configures different volume mounts based on node type.metadata-collector should follow the same pattern:
- Create separate DaemonSet variants for kata and regular nodes
- Use nodeSelector with
nvsentinel.dgxc.nvidia.com/kata.enabledlabel to target the appropriate node type- Regular nodes: Mount
/var/logfor file-based logs- Kata nodes: Mount
/run/systemd/journaland/var/log/journalfor systemd journal accessRefer to the syslog-health-monitor templates for the implementation pattern.
🤖 Fix all issues with AI Agents
In @data-models/pkg/model/pod_device_annotation.go:
- Line 15: Add a package-level godoc comment for the package declaration in
pod_device_annotation.go: write a one- or two-sentence comment immediately above
"package model" that succinctly describes the purpose of the model package (what
domain objects it contains and its role), following Go godoc style (starts with
"Package model ...") so the package complies with the coding guideline requiring
package-level documentation.
In @metadata-collector/main.go:
- Around line 79-93: The ticker created as ticker :=
time.NewTicker(defaultPodDeviceMonitorPeriod) is never stopped and
UpdatePodDevicesAnnotations errors currently abort the loop; fix by deferring
ticker.Stop() or stopping the ticker when ctx.Done() is observed and ensure the
select exit path calls ticker.Stop() before returning, and change the error
handling for podDeviceMapper.UpdatePodDevicesAnnotations() so it logs the error
(use slog.Error) and continues the loop (optionally add simple retry/backoff
logic or skip until the next tick) instead of returning the error; keep
references to ticker, ctx.Done(), podDeviceMapper.UpdatePodDevicesAnnotations,
and the slog.Info call when numUpdates succeeds.
In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 37-39: Add a godoc comment above the exported interface
KubeletGRPClient that briefly describes its purpose (e.g., a gRPC client
abstraction for querying kubelet pod device/resource annotations) and include a
short sentence describing the ListPodResources method and its return values (map
of pod/device annotations and error). Place the comment immediately above the
type declaration and use the `KubeletGRPClient` identifier as the first word to
satisfy godoc conventions.
In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 54-62: In NewKubeletHTTPSClient, the TLSClientConfig currently
only sets InsecureSkipVerify; add a minimum TLS version to harden against
downgrades by setting TLSClientConfig.MinVersion = tls.VersionTLS12 (or
tls.VersionTLS13 if you require TLS 1.3) within the tls.Config passed to the
http.Transport so the kubelet client enforces at least TLS 1.2 while keeping
InsecureSkipVerify for localhost use.
🧹 Nitpick comments (8)
data-models/pkg/model/pod_device_annotation.go (1)
21-28: Public variable is mutable and could lead to unexpected behavior.
EntityTypeToResourceNamesis a public variable holding a map, which means external code can modify it. This could cause race conditions or unexpected behavior if multiple components access it concurrently or if callers accidentally mutate it.🔎 Recommended approaches
Option 1: Make it a function that returns a new map (immutable)
-var ( - EntityTypeToResourceNames = map[string][]string{ +func GetEntityTypeToResourceNames() map[string][]string { + return map[string][]string{ "GPU_UUID": { "nvidia.com/gpu", "nvidia.com/pgpu", }, } - } -) +}Option 2: Keep as variable but document it should not be modified
var ( + // EntityTypeToResourceNames maps entity types to their corresponding Kubernetes resource names. + // This variable should be treated as read-only; modifying it may cause unexpected behavior. EntityTypeToResourceNames = map[string][]string{metadata-collector/pkg/mapper/grpcclient.go (3)
154-168: Deduplication logic can be made safer and clearer.The past review comment correctly identified that
devices[:j+1]at line 166 could panic if the slice is empty. While the current code flow (line 145 checkslen(device.GetDeviceIds()) > 0) should prevent empty slices, the logic is fragile and non-obvious.The deduplication algorithm also has unnecessary complexity for the common case of single-element or already-unique slices.
🔎 Proposed fix for safety and clarity
// ensure that we have a consistent ordering of devices for _, deviceAnnotation := range devicesPerPod { for resourceName, devices := range deviceAnnotation.Devices { + if len(devices) == 0 { + delete(deviceAnnotation.Devices, resourceName) + continue + } + sort.Strings(devices) + + if len(devices) == 1 { + continue + } j := 0 for i := 1; i < len(devices); i++ { if devices[i] != devices[j] { j++ devices[j] = devices[i] } } deviceAnnotation.Devices[resourceName] = devices[:j+1] } }
54-54: Global state modification may affect other gRPC clients.
resolver.SetDefaultScheme("passthrough")modifies global gRPC state, which could impact other gRPC clients in the same process. This is acceptable if metadata-collector is the only component using gRPC, but worth documenting or considering per-connection configuration.Consider adding a comment explaining this is safe in the metadata-collector context, or investigate if the passthrough scheme can be specified per-connection rather than globally.
185-187: Close method should follow best practices for cleanup.The
Closemethod should return an error to allow callers to handle connection closure failures. Additionally, it should check if the connection is nil before attempting to close it.🔎 Proposed fix
-func (client *kubeletGRPClient) Close() { - client.connection.Close() +func (client *kubeletGRPClient) Close() error { + if client.connection != nil { + return client.connection.Close() + } + return nil }Note: This will require updating the interface definition to include
Close() error.distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)
1-28: LGTM! The ClusterRoleBinding structure is correct.The ClusterRoleBinding properly references the ClusterRole and ServiceAccount using Helm templating. The static analysis warning about line 20 is a false positive—the
{{-syntax is valid Helm template notation.Minor: Add newline at end of file
Per POSIX conventions and common YAML style guides, add a trailing newline:
- kind: ServiceAccount name: {{ include "metadata-collector.fullname" . }} namespace: {{ .Release.Namespace }} +distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)
22-33: LGTM! ClusterRole permissions correctly scoped for the feature.The two policy rules are appropriately minimal:
patchonpodsenables annotation updates with GPU device mappingsgetonnodes/proxyallows accessing the kubelet endpoints on localhostThese align with the PR objectives for discovering and annotating pod GPU allocations.
Minor: Add newline at end of file
Per POSIX conventions and common YAML style guides, add a trailing newline:
resources: - nodes/proxy verbs: - get +metadata-collector/pkg/mapper/mapper_test.go (1)
286-309: Refactor to eliminate duplicate clientset and pod creation.Lines 289-292 recreate a clientset and pod that were already created in
newTestDeviceMapper(line 286), only to replace the mapper's client at line 309. This duplication wastes setup work.Refactoring options
Option 1: Pass
nilfor test pods tonewTestDeviceMapperand do all setup here:- mapper, err := newTestDeviceMapper(testCase, []*corev1.Pod{existingTestPod}) + mapper, err := newTestDeviceMapper(testCase, nil) assert.NoError(t, err) ctx := context.TODO() clientSet := fake.NewSimpleClientset() _, err = clientSet.CoreV1().Pods(existingTestPod.GetNamespace()).Create(ctx, existingTestPod, metav1.CreateOptions{}) assert.NoError(t, err) + mapper.kubernetesClient = clientSet - callCount := 0 clientSet.Fake.PrependReactor("patch", "pods", func(action ktesting.Action) (handled bool, ret runtime.Object, err error) { ... }) - mapper.kubernetesClient = clientSetOption 2: Extend
newTestDeviceMapperto accept an optional clientset parameter for tests that need custom reactors.Apply similar refactoring to
TestUpdatePodDevicesAnnotationsWithNonRetryablePatchError(lines 328-348).distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)
15-28: Consider documenting the DaemonSet selection logic.The coding guidelines recommend explaining DaemonSet variant selection logic in Helm chart documentation. Consider adding comments or documentation that describes:
- When this DaemonSet will be deployed (nodeSelector requirements)
- How the runtimeClassName affects deployment
- Any configuration options that control DaemonSet behavior
Based on coding guidelines learnings.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
metadata-collector/go.sumis excluded by!**/*.sum
📒 Files selected for processing (14)
data-models/pkg/model/pod_device_annotation.godistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/values.yamlmetadata-collector/go.modmetadata-collector/main.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper.gometadata-collector/pkg/mapper/mapper_test.go
💤 Files with no reviewable changes (1)
- distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- metadata-collector/pkg/mapper/grpcclient_test.go
- metadata-collector/pkg/mapper/mapper.go
🧰 Additional context used
📓 Path-based instructions (5)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
metadata-collector/main.godata-models/pkg/model/pod_device_annotation.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/httpsclient.go
**/go.mod
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
metadata-collector/go.mod
**/daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/**/*daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
metadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper_test.go
🧠 Learnings (16)
📚 Learning: 2026-01-06T21:31:31.115Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:31.115Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
metadata-collector/go.modmetadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/httpsclient.go
📚 Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
metadata-collector/go.modmetadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Regular node DaemonSets should use `/var/log` volume mount for file-based logs
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-06T10:59:26.687Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
metadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
metadata-collector/pkg/mapper/mapper_test.go
🧬 Code graph analysis (4)
metadata-collector/main.go (1)
metadata-collector/pkg/mapper/mapper.go (1)
NewPodDeviceMapper(49-76)
metadata-collector/pkg/mapper/grpcclient.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)EntityTypeToResourceNames(22-27)
metadata-collector/pkg/mapper/httpsclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
metadata-collector/pkg/mapper/mapper_test.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
🪛 ast-grep (0.40.3)
metadata-collector/pkg/mapper/httpsclient.go
[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(missing-ssl-minversion-go)
🪛 YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
🔇 Additional comments (7)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)
15-20: LGTM! The static analysis hint is a false positive.The YAMLlint error flagged at line 20 is a false positive. The
{{- include ... | nindent 4 }}syntax is valid Helm templating, where:
{{-trims preceding whitespacenindent 4indents the included content by 4 spacesYAMLlint cannot properly parse Helm template directives and incorrectly interprets them as malformed YAML. The template structure is correct and follows standard Helm patterns.
metadata-collector/go.mod (1)
12-16: Update dependency versions to use stable releases instead of pre-release versions.The review comment's concern is valid. While
google.golang.org/grpc v1.77.0is current and free from known vulnerabilities, the k8s.io packages (api,apimachinery,client-go,kubelet) are specified at v0.35.0, which does not have a stable release yet—only release candidates (v0.35.0-rc.0 and v0.35.0-rc.1). Using unreleased versions in a module creates maintainability and compatibility risks. Specify the latest stable Kubernetes release (v1.31.x or v1.32.x) or wait for v0.35.0 stable release before adding these as direct dependencies.⛔ Skipped due to learnings
Learnt from: jtschelling Repo: NVIDIA/NVSentinel PR: 490 File: janitor-provider/go.mod:70-70 Timestamp: 2026-01-06T21:31:31.115Z Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.Learnt from: CR Repo: NVIDIA/NVSentinel PR: 0 File: .github/copilot-instructions.md:0-0 Timestamp: 2025-11-24T22:20:48.152Z Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go codemetadata-collector/pkg/mapper/mapper_test.go (1)
89-428: Test coverage is comprehensive and well-structured.The test suite thoroughly covers the PodDeviceMapper functionality:
- Error handling for kubelet client failures
- Annotation lifecycle (add, update, remove, no-change, no-annotation)
- Retry behavior on conflicts
- Non-retryable error handling
- Multiple pods with varied device allocations
Past review comments regarding missing error checks have been properly addressed throughout.
metadata-collector/pkg/mapper/httpsclient_test.go (1)
32-559: LGTM! Comprehensive test coverage for kubelet HTTPS client.The test suite covers all critical paths:
- Successful pod listing with device annotation extraction
- Error handling for token loading, request creation, network failures, invalid responses, and JSON unmarshalling
- Retry behavior on transient failures
The large JSON constant provides realistic test data, and the mock implementation cleanly isolates HTTP interactions.
metadata-collector/pkg/mapper/httpsclient.go (1)
97-151: LGTM! ListPods implementation is robust and well-documented.The implementation properly handles:
- Token refresh on each request (supports rotation via projected volumes)
- Context-aware HTTP requests with proper timeouts
- Retry logic with error wrapping for diagnostics
- Resource cleanup with deferred body close
The extensive comments (lines 72-96) clearly justify the security trade-offs for localhost communication.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (2)
38-38: LGTM: Service account binding added.The serviceAccountName binding correctly enables RBAC permissions for the metadata-collector to patch pod annotations via the Kubernetes API.
45-84: LGTM: Container security context and pod-resources volume mount are appropriate.The privileged security context (runAsUser: 0, runAsGroup: 0, privileged: true) combined with the pod-gpu-resources volume mount at
/var/lib/kubelet/pod-resourcescorrectly enables access to the Kubelet PodResourcesLister gRPC endpoint for discovering GPU device allocations. The readOnly flag appropriately restricts write access.
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
14257ec to
171432f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)
45-50: Implement kata vs regular node separation with proper volume mounts per coding guidelines.This DaemonSet must be split into separate variants for kata and regular nodes using
nodeAffinitybased on thenvsentinel.dgxc.nvidia.com/kata.enabledlabel. Regular node DaemonSets should mount/var/logfor file-based logs, while Kata node DaemonSets should mount/run/log/journaland/var/log/journalfor systemd journal. The singlenodeSelectorapproach is insufficient—explicitnodeAffinityfiltering is required.Additionally, reconsider whether
privileged: trueis necessary for accessing the kubelet pod-resources socket, or whether specific capabilities would provide adequate access with better security boundaries.
🤖 Fix all issues with AI agents
In @docs/configuration/metadata-collector.md:
- Line 5: Add a new "Configuration" subsection to the metadata-collector docs
that documents the pod-to-GPU mapping annotation (dgxc.nvidia.com/devices):
specify a boolean flag to enable/disable pod device annotation (e.g.,
enablePodDeviceAnnotation), configuration knobs for scrape/poll intervals (e.g.,
podResourcesPollInterval) and any refresh backoff, the configurable Kubelet
PodResourcesLister socket path/name if supported (e.g.,
kubeletPodResourcesSocket), the exact annotation key (dgxc.nvidia.com/devices)
and example key/value formats (single GPU, multiple GPUs, UUIDs vs indexes), and
any RBAC/serviceAccount requirements needed to update pod annotations (which
role, verbs, and serviceAccount name) so operators know how to enable and secure
the feature.
In @docs/metadata-collector.md:
- Line 9: Replace instances of the compound adjective "pod to GPU" used to
modify nouns (e.g., "pod to GPU mapping") with the hyphenated form "pod-to-GPU"
so the phrase reads "pod-to-GPU mapping" (and any similar uses like "pod-to-GPU
annotation"); update all occurrences mentioned in the review (the three places
where "pod to GPU" is used as a modifier) to use the hyphenated form for correct
compound-adjective hyphenation.
- Line 28: Fix the grammar in the sentence "The Metadata Collector runs as an
regular container on each GPU node:" by replacing "an regular" with "a regular"
so it reads "The Metadata Collector runs as a regular container on each GPU
node:"; update the single occurrence of that sentence in the
docs/metadata-collector.md content.
In @metadata-collector/go.mod:
- Around line 47-48: Consolidate the duplicate YAML imports by using a single
canonical import path and version: replace the go.yaml.in/yaml/v3 v3.0.4 entry
with gopkg.in/yaml.v3 v3.0.4 (or update the existing gopkg.in/yaml.v3 v3.0.1
entry to v3.0.4) and remove the redundant alternative path so only one
gopkg.in/yaml.v3 entry remains at the chosen version.
In @metadata-collector/main.go:
- Around line 79-93: The ticker created with
time.NewTicker(defaultPodDeviceMonitorPeriod) is never stopped and
UpdatePodDevicesAnnotations errors currently abort the loop; fix by calling
ticker.Stop() via a defer immediately after creating ticker to avoid leaking its
goroutine, and change the error handling for
podDeviceMapper.UpdatePodDevicesAnnotations() to log the error (using slog.Error
or slog.Warn with context) and continue the loop instead of returning fmt.Errorf
so transient failures don’t terminate the periodic monitor; keep the ctx.Done()
case unchanged to allow graceful shutdown.
In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 48-75: Add a godoc comment above the exported constructor
NewKubeletGRPClient describing what the function does, its parameters and return
values, and any important behavior (e.g., it verifies the kubelet pod resources
socket, establishes a gRPC connection over a unix domain socket using insecure
credentials and a custom dialer, and returns a KubeletGRPClient or an error).
Mention side-effects (like calling resolver.SetDefaultScheme) and the meaning of
errors returned to make usage clear.
- Around line 153-168: The deduplication loop over deviceAnnotation.Devices can
panic for empty slices; update the loop in grpcclient.go (the block iterating
devicesPerPod and resourceName) to skip processing when len(devices) == 0 (i.e.,
check and continue before calling sort.Strings and the dedupe logic) so you
never attempt devices[:j+1] on an empty slice; keep the existing sort and
in-place dedupe for non-empty slices and assign back to
deviceAnnotation.Devices[resourceName] only after the guard.
In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 39-41: Add a godoc comment for the exported interface
KubeletHTTPSClient explaining its purpose (a client for communicating with the
kubelet over HTTPS to retrieve pod information) and document the ListPods method
(returns the current list of corev1.Pod objects from the kubelet or an error).
Place the comment immediately above the "type KubeletHTTPSClient interface"
declaration and follow Go doc style (start with "KubeletHTTPSClient ..." and
mention the behavior/return values of ListPods).
- Around line 55-62: The TLS config on the http.Transport currently sets
InsecureSkipVerify but omits a minimum TLS version; update the TLSClientConfig
in the transport construction to set MinVersion to a safe minimum (e.g.,
tls.VersionTLS12) to prevent protocol downgrades. Locate the transport variable
and its TLSClientConfig block and add MinVersion: tls.VersionTLS12 (or
tls.VersionTLS13 if you require only TLS1.3) alongside InsecureSkipVerify to
harden the connection.
- Around line 54-70: Add a Go doc comment immediately above the exported
constructor NewKubeletHTTPSClient that explains what the function returns (a
KubeletHTTPSClient configured to talk to the kubelet over HTTPS), summarizes key
behavior (uses an http.Transport with InsecureSkipVerify=true, timeouts, and
sets bearerTokenPath and listPodsURI), and notes any important context or side
effects (takes a context.Context, may read the bearer token from
bearerTokenPath). Reference the constructor name NewKubeletHTTPSClient and the
returned type KubeletHTTPSClient in the comment.
In @metadata-collector/pkg/mapper/mapper.go:
- Around line 38-40: Add a godoc comment above the exported interface
PodDeviceMapper describing its responsibility (e.g., mapping device information
into Pod annotations) and document its method UpdatePodDevicesAnnotations to
explain what it does, what the return values represent (the int and error), and
any side effects or expectations; place the comment immediately above the type
PodDeviceMapper and use standard Go doc style starting with "PodDeviceMapper
..." and "UpdatePodDevicesAnnotations ..." to satisfy linting/godoc
requirements.
🧹 Nitpick comments (7)
metadata-collector/pkg/mapper/mapper_test.go (2)
127-127: Swapassert.Equalarguments to follow(expected, actual)convention.The
testify/assert.Equalfunction expects arguments in the order(expected, actual). Throughout these tests, the arguments are reversed (e.g.,assert.Equal(t, numUpdates, 0)should beassert.Equal(t, 0, numUpdates)). This affects error messages when assertions fail.♻️ Example fix for line 127
- assert.Equal(t, numUpdates, 0) + assert.Equal(t, 0, numUpdates)Apply similar fixes throughout the file.
Also applies to: 143-143, 163-163, 192-192, 219-219, 246-246, 267-267, 313-313, 352-352, 400-400
300-300: Remove debugfmt.Printlnstatements.These appear to be leftover debug statements. Use
t.Logif logging is needed for test diagnostics, or remove them entirely.♻️ Proposed fix
switch callCount { case 1: - fmt.Println("error 1") return true, nil, errors.NewConflict(Also applies to: 342-342
metadata-collector/pkg/mapper/httpsclient_test.go (1)
453-466: Consider validating request inRoundTripfor more robust mocking.The mock's
RoundTripmethod ignores the incoming request (m.Called()without arguments). While acceptable for current tests focused on response handling, future tests requiring request validation would need updates.♻️ Optional enhancement for request validation
func (m *MockHTTPRoundTripper) RoundTrip(req *http.Request) (*http.Response, error) { - args := m.Called() + args := m.Called(req) var resp *http.Response if args.Get(0) != nil { resp = args.Get(0).(*http.Response) } return resp, args.Error(1) }This would require updating mock setup to use
mock.AnythingOfType("*http.Request").metadata-collector/pkg/mapper/httpsclient.go (1)
97-155: Consider more selective retry logic.The
ListPodsimplementation is well-structured with proper error handling and token rotation support. However,retryAllErrors(line 153-155) retries all errors indiscriminately, which may lead to unnecessary retries for non-transient failures (e.g., 401 Unauthorized).Consider implementing selective retry logic similar to the pattern in
mapper.gothat only retries transient errors like timeouts, rate limiting, and server errors.♻️ Proposed refinement
-func retryAllErrors(_ error) bool { - return true +func retryAllErrors(err error) bool { + // Retry on network timeouts and 5xx server errors + if errors.IsTimeout(err) || errors.IsServerTimeout(err) { + return true + } + // Check for temporary network errors in the error chain + // (http.Client wraps net errors) + return false }metadata-collector/pkg/mapper/grpcclient.go (1)
185-187: Consider logging errors from Close.The
Closemethod ignores any error returned byconnection.Close(). While this may be acceptable during cleanup, consider logging the error to aid debugging of connection issues.♻️ Proposed refinement
func (client *kubeletGRPClient) Close() { - client.connection.Close() + if err := client.connection.Close(); err != nil { + slog.Warn("Error closing gRPC connection", "error", err) + } }metadata-collector/pkg/mapper/grpcclient_test.go (1)
159-193: Consider using standard testing package for simple assertions.Based on learnings, avoid introducing testify for simple equality/inequality checks. The complex map comparison at line 179 justifies testify usage, but simple assertions like
assert.NoError,assert.Error, andassert.NotNilcould use the standard testing package (t.Error,t.Fatal, etc.).♻️ Example refinement for simple checks
func TestListPodResourcesError(t *testing.T) { kubeletGRPCClient := newTestKubeletGRPCClient(errors.New("ListPodResources error"), 10, podDevices) _, err := kubeletGRPCClient.ListPodResources() - assert.Error(t, err) + if err == nil { + t.Fatal("expected error, got nil") + } }Based on learnings.
metadata-collector/pkg/mapper/mapper.go (1)
197-214: Include namespace in log and error messages for clarity.The log message at line 204 and error message at line 209 should include the namespace along with the pod name to uniquely identify the pod being patched, as pod names are only unique within a namespace.
♻️ Proposed refinement
if err != nil && isRetryableError(err) { slog.Warn("Retryable error patching pod annotation. Retrying...", "pod", podName, + "namespace", namespace, "error", err) } if err != nil { - return fmt.Errorf("failed to patch pod %s: %w", podName, err) + return fmt.Errorf("failed to patch pod %s/%s: %w", namespace, podName, err) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
metadata-collector/go.sumis excluded by!**/*.sum
📒 Files selected for processing (16)
data-models/pkg/model/pod_device_annotation.godistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/values.yamldocs/configuration/metadata-collector.mddocs/metadata-collector.mdmetadata-collector/go.modmetadata-collector/main.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper.gometadata-collector/pkg/mapper/mapper_test.go
💤 Files with no reviewable changes (1)
- distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- data-models/pkg/model/pod_device_annotation.go
🧰 Additional context used
📓 Path-based instructions (5)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
metadata-collector/pkg/mapper/httpsclient.gometadata-collector/main.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/mapper.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient_test.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
metadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient_test.go
**/go.mod
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
metadata-collector/go.mod
**/daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/**/*daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
🧠 Learnings (17)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
metadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/go.mod
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
docs/metadata-collector.md
📚 Learning: 2025-11-06T10:59:26.687Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.
Applied to files:
docs/metadata-collector.mddocs/configuration/metadata-collector.mddistros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
metadata-collector/main.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
metadata-collector/pkg/mapper/grpcclient.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
metadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient_test.go
📚 Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
metadata-collector/go.modmetadata-collector/pkg/mapper/httpsclient_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
🧬 Code graph analysis (5)
metadata-collector/main.go (1)
metadata-collector/pkg/mapper/mapper.go (1)
NewPodDeviceMapper(50-77)
metadata-collector/pkg/mapper/grpcclient.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)EntityTypeToResourceNames(22-27)
metadata-collector/pkg/mapper/mapper_test.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
metadata-collector/pkg/mapper/grpcclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (1)
DeviceAnnotation(30-32)
metadata-collector/pkg/mapper/httpsclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
🪛 ast-grep (0.40.4)
metadata-collector/pkg/mapper/httpsclient.go
[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(missing-ssl-minversion-go)
🪛 LanguageTool
docs/metadata-collector.md
[grammar] ~9-~9: Use a hyphen to join words.
Context: ...adata Collector will also expose the pod to GPU mapping as an annotation on each ...
(QB_NEW_EN_HYPHEN)
[grammar] ~9-~9: Use a hyphen to join words.
Context: ...ta Collector will also expose the pod to GPU mapping as an annotation on each pod...
(QB_NEW_EN_HYPHEN)
[grammar] ~22-~22: Use a hyphen to join words.
Context: ...y, the node drainer module needs the pod to GPU mapping to map a set of pods to t...
(QB_NEW_EN_HYPHEN)
[grammar] ~22-~22: Use a hyphen to join words.
Context: ...the node drainer module needs the pod to GPU mapping to map a set of pods to the ...
(QB_NEW_EN_HYPHEN)
[grammar] ~101-~101: Use a hyphen to join words.
Context: ...th monitors. ### Kubernetes API The pod to GPU mapping is exposed on pods object...
(QB_NEW_EN_HYPHEN)
[grammar] ~101-~101: Use a hyphen to join words.
Context: ...monitors. ### Kubernetes API The pod to GPU mapping is exposed on pods objects a...
(QB_NEW_EN_HYPHEN)
🪛 YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
🔇 Additional comments (21)
docs/configuration/metadata-collector.md (1)
21-21: Clarify container lifecycle: is this still an init container?Line 21 refers to the "metadata-collector init container," but the PR description and AI summary suggest a shift from "init-container pattern to a regular container lifecycle." Verify and update the documentation to reflect whether metadata-collector is now a regular (long-running) container, sidecar, or still an init container. This affects deployment expectations and resource management.
metadata-collector/go.mod (3)
7-17: Direct dependencies align with PR feature and coding guidelines.The additions of
google.golang.org/grpc,k8s.io/api,k8s.io/apimachinery, andk8s.io/client-goare appropriate and follow the guideline to useclient-gofor Kubernetes API interactions. The semantic versioning at v0.35.0 is correct and represents the latest stable release of the Kubernetes libraries; v0.36.0 is only available as alpha/rc builds.
27-37: These subpackages are legitimately pulled in by k8s.io/kube-openapi and are correctly marked as indirect dependencies.Lines 27–37 list subpackages of
go-openapi/swagas indirect dependencies, which are legitimately required byk8s.io/kube-openapi(itself a dependency of your k8s.io packages). The go-openapi/swag module uses semantic import versioning and breaks its code into subpackages with individual go.mod entries; this is expected behavior and the result of standardgo mod tidyoperations. No code in this module directly imports these subpackages, and they are correctly marked as indirect. No action required.
12-16: Dependencies are correctly specified for PodResourcesLister gRPC feature.Verification confirms that k8s.io/kubelet v0.35.0 provides the PodResourcesLister API (available in pkg/apis/podresources/v1alpha1), and the codebase actively uses v1.PodResourcesListerClient with these exact versions. The gRPC v1.77.0 and k8s.io/kubelet v0.35.0 combination is compatible and requires no changes.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)
15-20: LGTM - ServiceAccount template is correctly structured.The template correctly uses Helm templating for name and labels. The YAMLlint syntax error is a false positive due to Helm template directives.
Consider adding a trailing newline at the end of the file for POSIX compliance.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)
21-33: LGTM - ClusterRole permissions follow least-privilege principle.The permissions are appropriately scoped:
pods/patch- required to update pod annotations with GPU device mappingsnodes/proxy/get- required to access the Kubelet HTTPS/podsendpointConsider adding a trailing newline at the end of the file.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)
15-28: LGTM - ClusterRoleBinding correctly links ServiceAccount to ClusterRole.The binding correctly references the ClusterRole and ServiceAccount using consistent Helm templating, with the ServiceAccount namespace properly set to the release namespace.
Consider adding a trailing newline at the end of the file.
metadata-collector/pkg/mapper/mapper_test.go (3)
359-359: Use&model.DeviceAnnotation{}instead of&model.DeviceAnnotation{}.Minor style nit: when creating composite literal pointers, you can simplify
&model.DeviceAnnotation{...}- but the current syntax is correct. No change needed.Also applies to: 370-370, 414-414
89-112: LGTM - Test helpernewTestDeviceMapperis well-structured.The helper correctly sets up mocks and creates test pods in the fake clientset, with proper error handling and propagation.
274-314: LGTM - Retry logic test correctly validates conflict handling.The test properly simulates a conflict error on first patch attempt and verifies the retry succeeds. The past review comment about missing error check has been addressed with
assert.NoError(t, err)on line 287.metadata-collector/pkg/mapper/httpsclient_test.go (2)
482-502: LGTM -TestListPodsvalidates successful pod listing and annotation extraction.The test properly verifies that pods are listed and device annotations can be extracted and deserialized correctly.
534-553: LGTM - Retry test correctly validates transient failure recovery.The test properly simulates a 500 error followed by success, verifying the client's retry logic works as expected.
metadata-collector/main.go (2)
56-70: Verify intended behavior:runMapperblocks indefinitely until signal.The current flow calls
runMapperwhich runs an infinite loop until context cancellation. This means "Pod device mapper completed successfully" at line 70 will only be logged on graceful shutdown (SIGTERM/SIGINT). Ensure this is the intended behavior.
96-139: LGTM -runCollectoris well-structured with proper resource cleanup.The function correctly:
- Initializes and defers NVML shutdown
- Handles errors with appropriate context wrapping
- Uses structured logging throughout
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (3)
38-38: LGTM: ServiceAccount reference added correctly.The serviceAccountName field properly references the metadata-collector ServiceAccount created in the new RBAC templates, enabling the DaemonSet to interact with the Kubernetes API.
69-71: LGTM: Pod resources volume mount configured correctly.The volume mount for
/var/lib/kubelet/pod-resourcesis properly configured withreadOnly: true, which is appropriate since the metadata-collector only needs to read pod resource allocations from the kubelet socket.Also applies to: 81-84
85-101: No changes needed—this DaemonSet correctly handles its use case.The metadata-collector collects GPU metadata and pod GPU resource information, not system logs. It does not require separate kata and regular variants or the log volume mounts specified in the learnings. Those requirements apply to services like syslog-health-monitor that access logs differently on kata vs. regular nodes. The current single-DaemonSet design with optional runtimeClassName is correct for metadata-collector's functionality.
Likely an incorrect or invalid review comment.
metadata-collector/pkg/mapper/grpcclient_test.go (1)
34-157: LGTM: Well-structured test infrastructure.The test fixtures and mock implementation are comprehensive, covering multiple scenarios including:
- Same GPU allocated multiple times to one container
- Same GPU allocated to different containers
- Different pods with different devices
- Unsupported devices
- Empty pods
The error injection mechanism for retry testing is well-designed.
metadata-collector/pkg/mapper/mapper.go (2)
122-195: LGTM: Well-implemented annotation synchronization logic.The
UpdatePodDevicesAnnotationsimplementation is comprehensive and well-documented:
- Correctly uses MergePatch for adding/updating annotations (to handle missing annotations object)
- Uses JSONPatch for removal operations
- Includes proper error handling and logging
- Acknowledges cyclomatic complexity with nolint directive
The detailed comment block (lines 79-120) provides excellent context for the implementation approach.
216-233: LGTM: Comprehensive retry error handling.The
isRetryableErrorfunction correctly identifies transient errors that should be retried:
- Resource version conflicts
- Server timeouts and rate limiting
- Network timeouts and service unavailability
This aligns well with Kubernetes API best practices for retry logic.
metadata-collector/pkg/mapper/grpcclient.go (1)
37-39: Add godoc comment for exported interface.As per coding guidelines, all exported interfaces require documentation comments. Add a godoc comment explaining the purpose of
KubeletGRPClientand itsListPodResourcesmethod.📝 Proposed documentation
+// KubeletGRPClient provides access to the Kubelet's PodResourcesLister gRPC service +// for discovering device allocations to pods on the local node. type KubeletGRPClient interface { + // ListPodResources returns a mapping from "namespace/name" to device allocations + // for all pods on the node with supported GPU resources. ListPodResources() (map[string]*model.DeviceAnnotation, error) }As per coding guidelines, "Function comments required for all exported Go functions".
⛔ Skipped due to learnings
Learnt from: CR Repo: NVIDIA/NVSentinel PR: 0 File: .github/copilot-instructions.md:0-0 Timestamp: 2025-11-24T22:20:48.152Z Learning: Applies to **/*.go : Function comments required for all exported Go functions
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
171432f to
0c645fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (2)
45-101: Add kata vs regular node DaemonSet separation.Per coding guidelines, this chart requires separate DaemonSets for kata and regular nodes using
nodeAffinitywith thenvsentinel.dgxc.nvidia.com/kata.enabledlabel. The existingdaemonset-kata.yamlanddaemonset-regular.yamlpattern in the syslog-health-monitor chart provides a reference implementation for this separation.For metadata-collector, create:
daemonset-regular.yamltargeting nodes wherekata.enabled: "false"daemonset-kata.yamltargeting nodes wherekata.enabled: "true"If metadata-collector needs to collect logs, regular nodes should mount
/var/logand kata nodes should mount/run/log/journaland/var/log/journal.
48-50: Privileged mode should be justified or replaced with minimal required capabilities.The
privileged: truesetting grants broad host access, which appears excessive for GPU metadata collection. Consider whether specific Linux capabilities (similar to syslog-health-monitor's approach withSYSLOGandSYS_ADMIN) are sufficient for:
- NVML GPU metadata access via
/sys- Kubelet pod-resources gRPC access
- ServiceAccount-based kubelet HTTPS access
If privileged access is unavoidable, document the security rationale.
🤖 Fix all issues with AI agents
In
@distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml:
- Around line 22-27: The ClusterRole entry granting apiGroups: "" resources:
["pods"] verbs: ["patch"] is overly broad; confirm whether the
metadata-collector actually needs cluster-wide pod patch rights for GPU metadata
collection and, if so, either restrict scope by adding resourceNames or
namespace/label-based constraints (or limit to specific subresources) or replace
patch with a less-privileged verb; if the broad permission is intentional, add
an inline YAML comment in the ClusterRole explaining the threat model/use case
and link to any security justification document so reviewers understand why
pods:patch is required.
In @metadata-collector/main.go:
- Around line 86-93: The loop handling ticker.C is returning on any error from
podDeviceMapper.UpdatePodDevicesAnnotations(), which will kill the periodic
monitoring; instead, change the error handling so that when
podDeviceMapper.UpdatePodDevicesAnnotations() returns an error you log it (using
slog.Error or similar with context such as "failed to update pod device
annotations" and the error value) and continue the loop to wait for the next
tick; keep the successful path that logs numUpdates with slog.Info. Ensure you
reference the ticker.C select case and the
podDeviceMapper.UpdatePodDevicesAnnotations() call when making the change.
In @metadata-collector/pkg/mapper/grpcclient_test.go:
- Line 179: The assert.Equal call has its arguments reversed which will produce
confusing failure messages; swap the arguments so the expected value
(expectedDevicesPerPod) is passed before the actual value (devicesPerPod) in the
assert.Equal(t, expected, actual) invocation that currently reads
assert.Equal(t, devicesPerPod, expectedDevicesPerPod).
In @metadata-collector/pkg/mapper/mapper.go:
- Around line 38-40: Add a proper godoc comment above the exported interface
PodDeviceMapper describing its purpose and behavior; mention what
implementations are expected to do and what the returned int and error from
UpdatePodDevicesAnnotations represent (e.g., number of pods updated and error
conditions). Ensure the comment follows godoc style (starts with
"PodDeviceMapper ...") and also document the UpdatePodDevicesAnnotations method
semantics briefly within the interface comment or as a line comment above the
method.
- Around line 50-77: Add a godoc comment immediately above the exported function
NewPodDeviceMapper that begins with the function name and briefly describes what
it does, its ctx parameter (context.Context), what it returns (a PodDeviceMapper
or an error), and any important behaviors (e.g., it creates Kubelet HTTPS/gRPC
clients and an in-cluster Kubernetes client and wraps them in a
podDeviceMapper); ensure the comment is a proper single-line summary followed by
any short details on failure conditions to satisfy linting for exported symbols
like NewPodDeviceMapper and to clarify the relationship to podDeviceMapper.
🧹 Nitpick comments (7)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (1)
45-101: Document DaemonSet variant selection logic.Per coding guidelines, the Helm chart documentation should explain DaemonSet variant selection logic. Since this file is
daemonset.yaml(not a variant-specific file), the documentation should clarify how administrators select between deployment modes and whether kata support is planned.Based on coding guidelines.
data-models/pkg/model/pod_device_annotation.go (1)
17-32: Add documentation for exported entities.The exported constant
PodDeviceAnnotationName, variableEntityTypeToResourceNames, and structDeviceAnnotationlack godoc comments. Per coding guidelines, exported functions and types should be documented.Additionally,
EntityTypeToResourceNamesis declared asvarbut appears to be a read-only lookup table. Consider whether this should be documented as immutable or protected against accidental modification.📝 Proposed documentation
+// PodDeviceAnnotationName is the annotation key used to store GPU device allocations on pods. const ( PodDeviceAnnotationName = "dgxc.nvidia.com/devices" ) +// EntityTypeToResourceNames maps entity types (e.g., GPU_UUID) to their corresponding +// Kubernetes resource names for device allocation tracking. var ( EntityTypeToResourceNames = map[string][]string{ "GPU_UUID": { "nvidia.com/gpu", "nvidia.com/pgpu", }, } ) +// DeviceAnnotation represents the structure of GPU device allocations stored +// as a pod annotation. The Devices map keys are resource names (e.g., "nvidia.com/gpu") +// and values are lists of allocated device IDs. type DeviceAnnotation struct { Devices map[string][]string `json:"devices"` }Based on coding guidelines, "Function comments required for all exported Go functions".
metadata-collector/pkg/mapper/httpsclient_test.go (1)
453-466: Consider simplifying mock struct.The
MockHTTPRoundTripperembedshttp.RoundTripperbut never uses the embedded interface. This could be simplified to only embedmock.Mock.♻️ Simplified mock
type MockHTTPRoundTripper struct { - http.RoundTripper mock.Mock }metadata-collector/pkg/mapper/grpcclient.go (1)
185-187: Ignored error from connection.Close().The error returned by
client.connection.Close()is discarded. While Close errors are often benign, logging or returning the error would aid debugging connection issues.♻️ Proposed change
-func (client *kubeletGRPClient) Close() { - client.connection.Close() +func (client *kubeletGRPClient) Close() error { + return client.connection.Close() }Note: This would require updating the interface if
Close()is part of it.metadata-collector/pkg/mapper/mapper_test.go (1)
300-300: Remove debug print statements from tests.Lines 300 and 342 contain
fmt.Println("error 1")which appear to be debug statements. These should be removed or converted tot.Log()if logging is needed during test execution.♻️ Proposed fix
switch callCount { case 1: - fmt.Println("error 1") return true, nil, errors.NewConflict(Apply the same fix at line 342.
Also applies to: 342-342
metadata-collector/pkg/mapper/grpcclient_test.go (2)
22-22: Consider using standardtestingpackage for simple assertions.Based on learnings, the repository prefers avoiding testify for simple equality/inequality checks, using standard testing package assertions (
t.Error,t.Errorf,t.Fatal) instead. Reserve testify for complex scenarios requiring richer diagnostics.
159-159: Test names could follow the conventionTestFunctionName_Scenario_ExpectedBehavior.Per coding guidelines, consider renaming tests to be more descriptive, e.g.:
TestListPodResources_ValidDevices_ReturnsAggregatedAnnotationsTestListPodResources_PersistentError_ReturnsErrorTestListPodResources_TransientError_RetriesSuccessfullyAlso applies to: 182-182, 188-188
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
metadata-collector/go.sumis excluded by!**/*.sum
📒 Files selected for processing (16)
data-models/pkg/model/pod_device_annotation.godistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/values.yamldocs/configuration/metadata-collector.mddocs/metadata-collector.mdmetadata-collector/go.modmetadata-collector/main.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper.gometadata-collector/pkg/mapper/mapper_test.go
💤 Files with no reviewable changes (1)
- distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
🧰 Additional context used
📓 Path-based instructions (5)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
metadata-collector/pkg/mapper/httpsclient_test.godata-models/pkg/model/pod_device_annotation.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/mapper.gometadata-collector/main.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/grpcclient.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
metadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/mapper_test.go
**/go.mod
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
metadata-collector/go.mod
**/daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/**/*daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
🧠 Learnings (18)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
metadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/go.mod
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
metadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/mapper_test.go
📚 Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
metadata-collector/go.modmetadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/grpcclient.go
📚 Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
data-models/pkg/model/pod_device_annotation.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Include comprehensive comments for all fields in Protocol Buffer messages
Applied to files:
data-models/pkg/model/pod_device_annotation.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
data-models/pkg/model/pod_device_annotation.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/mapper.gometadata-collector/pkg/mapper/grpcclient.go
📚 Learning: 2025-11-06T10:59:26.687Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.
Applied to files:
docs/configuration/metadata-collector.mddocs/metadata-collector.mddistros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
metadata-collector/main.go
📚 Learning: 2025-11-07T04:16:43.079Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 288
File: health-monitors/syslog-health-monitor/pkg/metadata/reader.go:45-49
Timestamp: 2025-11-07T04:16:43.079Z
Learning: In the syslog-health-monitor, the metadata Reader uses lazy loading (via sync.Once and ensureLoaded) because the metadata file is generated by metadata-collector which requires GPU drivers to be installed. XID/SXID errors only occur after GPU drivers are installed, so the metadata file may not exist when the health monitor initializes. Therefore, NewReader should not eagerly load or validate file existence in the constructor.
Applied to files:
docs/metadata-collector.md
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
🧬 Code graph analysis (6)
metadata-collector/pkg/mapper/httpsclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
metadata-collector/pkg/mapper/mapper.go (3)
metadata-collector/pkg/mapper/httpsclient.go (2)
KubeletHTTPSClient(39-41)NewKubeletHTTPSClient(54-70)metadata-collector/pkg/mapper/grpcclient.go (2)
KubeletGRPClient(37-39)NewKubeletGRPClient(48-75)data-models/pkg/model/pod_device_annotation.go (1)
PodDeviceAnnotationName(18-18)
metadata-collector/main.go (1)
metadata-collector/pkg/mapper/mapper.go (1)
NewPodDeviceMapper(50-77)
metadata-collector/pkg/mapper/grpcclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (1)
DeviceAnnotation(30-32)
metadata-collector/pkg/mapper/mapper_test.go (2)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)platform-connectors/pkg/pipeline/factory.go (1)
Create(32-39)
metadata-collector/pkg/mapper/grpcclient.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)EntityTypeToResourceNames(22-27)
🪛 ast-grep (0.40.4)
metadata-collector/pkg/mapper/httpsclient.go
[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(missing-ssl-minversion-go)
🪛 YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
🔇 Additional comments (17)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)
1-20: LGTM!The ServiceAccount template is correctly structured with proper Helm templating. The yamllint syntax error on line 20 is a false positive—the
{{-syntax is valid Helm template syntax.distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)
28-33: LGTM!The get permission on nodes/proxy is appropriate for accessing kubelet HTTPS endpoints through the API server proxy, which aligns with the PR's stated approach of using local kubelet endpoints.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)
1-28: LGTM!The ClusterRoleBinding correctly binds the ClusterRole to the ServiceAccount with proper Helm templating. The yamllint syntax error on line 20 is a false positive.
docs/metadata-collector.md (1)
38-44: Documentation looks comprehensive.The new GPU-to-pod mapping annotation section clearly explains the workflow using Kubelet endpoints. The step-by-step breakdown (HTTPS /pods endpoint, gRPC PodResourcesLister, annotation updates, loop-based reconciliation) aligns with the implementation.
Consider specifying the default reconciliation interval in line 44 ("fixed threshold") for operational clarity, but this is optional.
metadata-collector/pkg/mapper/httpsclient_test.go (2)
482-502: Test coverage is thorough.The test validates both the pod list parsing and the extraction of device annotations from pod metadata. The assertion on
expectedDeviceAnnotationensures the JSON unmarshalling logic works correctly with themodel.DeviceAnnotationtype.
534-553: Retry behavior test is well-structured.This test correctly validates the retry logic by configuring the mock to return a 500 error on the first call and succeed on the second. The
.Once()chaining ensures proper call ordering.metadata-collector/go.mod (1)
12-16: Kubernetes and gRPC dependencies appropriately added.The new direct dependencies on
google.golang.org/grpc,k8s.io/api,k8s.io/apimachinery,k8s.io/client-go, andk8s.io/kubeletalign with the Kubelet HTTPS and gRPC client implementations. Version pinning at v0.35.0 for all k8s.io modules ensures compatibility.metadata-collector/pkg/mapper/httpsclient.go (1)
72-96: Excellent security rationale documentation.The detailed comment block explaining the justification for
InsecureSkipVerify(localhost SAN limitations) and the authentication/authorization flow (bearer token, TokenReview, SubjectAccessReview) is well-written and provides valuable context for future maintainers.metadata-collector/pkg/mapper/grpcclient.go (2)
77-112: Comprehensive documentation for ListPodResources.The detailed comment block with example output clearly explains the function's behavior, including deduplication, sorting, and the possibility of shared devices across pods. This is excellent for maintainability.
54-54: Global resolver state modification may affect other gRPC clients.
resolver.SetDefaultScheme("passthrough")modifies global state, which could affect other gRPC clients in the same process that expect the default DNS resolver. Consider using a connection-specific option (e.g., prefixing the target with"passthrough:///") or documenting this side effect.metadata-collector/pkg/mapper/mapper_test.go (3)
89-112: Well-designed test helper.The
newTestDeviceMapperhelper cleanly encapsulates mock setup and fake clientset creation, making individual tests concise and focused. Error propagation from pod creation is properly handled.
274-314: Retryable patch error test validates retry behavior.The test correctly uses a
PrependReactorto simulate a conflict error on the first patch attempt, then succeeds on retry. The assertion thatnumUpdates == 1confirms the retry succeeded.
355-409: Multi-pod scenario provides good integration coverage.This test validates the mapper handles multiple pods with different device allocations correctly, including pods with multiple resource types. The assertions verify both pods receive correct annotations.
metadata-collector/main.go (1)
97-140: Well-structured collector implementation.The function properly handles NVML lifecycle with defer, wraps errors with context, and uses structured logging throughout.
metadata-collector/pkg/mapper/grpcclient_test.go (1)
98-149: Well-designed mock with error injection support.The mock cleanly embeds
PodResourcesListerClient, supports configurable error counts for retry testing, and correctly constructs pod resources from the fixture data.metadata-collector/pkg/mapper/mapper.go (2)
79-120: Comprehensive implementation with excellent documentation.The block comment thoroughly explains the algorithm and design decisions. The three-phase approach (list pods, get device allocations, reconcile annotations) is well-structured.
One consideration: early returns on errors (lines 149, 180, 187) mean a single pod failure stops processing remaining pods. This is acceptable for now but consider logging and continuing for transient per-pod errors in future iterations.
Also applies to: 122-195
197-233: Solid retry logic with appropriate error classification.The
isRetryableErrorfunction correctly handles transient failures (conflicts, timeouts, rate limiting, service unavailability), and the retry mechanism properly logs warnings before retrying. The error wrapping with%wpreserves the original error for the k8s error type checks.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
Show resolved
Hide resolved
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
Show resolved
Hide resolved
0c645fe to
473923e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In @metadata-collector/go.mod:
- Around line 47-48: The go.mod imports duplicate YAML module paths
(go.yaml.in/yaml/v3 v3.0.4 and gopkg.in/yaml.v3 v3.0.1); consolidate to a single
import path by keeping gopkg.in/yaml.v3 at the latest version (v3.0.4), remove
the go.yaml.in/yaml/v3 entry (and/or update the gopkg.in/yaml.v3 entry to
v3.0.4), then run go mod tidy to reconcile dependencies so only gopkg.in/yaml.v3
v3.0.4 remains.
In @metadata-collector/pkg/mapper/grpcclient_test.go:
- Line 179: The assertion in grpcclient_test.go uses reversed arguments; change
the assert.Equal call from assert.Equal(t, devicesPerPod, expectedDevicesPerPod)
to assert.Equal(t, expectedDevicesPerPod, devicesPerPod) so the testify
convention assert.Equal(t, expected, actual) is followed and failure messages
are correct.
In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 37-39: The exported interface KubeletGRPClient and its method
ListPodResources lack godoc; add a package-level comment line immediately above
the interface that begins with "KubeletGRPClient ..." describing its
responsibility (e.g., a gRPC client for querying kubelet pod device annotations)
and include a brief sentence for the ListPodResources method (or ensure the
interface comment documents the method's behavior), using standard Go doc style
that starts with the exact symbol names so tools like godoc and linters
recognize them.
- Around line 48-75: Add a Go doc comment for the exported constructor
NewKubeletGRPClient that briefly describes what the function does (creates and
returns a Kubelet gRPC client connected to the kubelet pod-resources Unix
socket), any important behavior (returns error if socket is missing or
connection fails, uses insecure credentials and a unix dialer), and its
parameters/return values (context.Context parameter and returns
KubeletGRPClient, error); place the comment immediately above the
NewKubeletGRPClient function declaration.
- Around line 153-168: The deduplication loop over devicesPerPod can panic when
a resource's devices slice is empty; update the loop in the devicesPerPod
iteration (the code touching deviceAnnotation.Devices, resourceName, and the
local devices slice) to skip processing when len(devices) == 0 before calling
sort.Strings and the in-place dedupe that uses devices[:j+1]. Ensure the guard
is placed right after obtaining devices :=
deviceAnnotation.Devices[resourceName] (or before sort.Strings) so empty slices
are left intact and no slicing beyond bounds occurs; leave behavior unchanged
for non-empty slices and still deduplicate in-place for those cases.
In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 55-62: Update the TLS config on the transport to set a minimum TLS
version to prevent downgrades: in the TLSClientConfig used when creating
transport (the variable named transport and its TLSClientConfig field where
InsecureSkipVerify is set), add a MinVersion field (e.g., tls.VersionTLS12) so
the client requires at least TLS 1.2 while preserving InsecureSkipVerify for
localhost kubelet communication.
- Around line 39-41: Add a godoc comment for the exported interface
KubeletHTTPSClient and its ListPods method: write a brief sentence above the
KubeletHTTPSClient type that describes its purpose (e.g., client used to call
the kubelet over HTTPS to retrieve pod information) and include a short comment
for ListPods describing that it returns the list of corev1.Pod for the node or
an error. Ensure the comments use proper Go doc style (starting with
"KubeletHTTPSClient ..." and "ListPods ...") immediately above the interface and
method declaration.
- Around line 54-70: Add a Go doc comment immediately above the exported
constructor NewKubeletHTTPSClient that briefly states what the function does,
its input and returned values, and any notable behavior (e.g., it creates a
KubeletHTTPSClient using an http.Transport configured with InsecureSkipVerify,
timeouts, and initialized listPodsURI and bearerTokenPath). Ensure the comment
starts with the function name "NewKubeletHTTPSClient" and describes the context
parameter and the returned (KubeletHTTPSClient, error) so it satisfies
godoc/coding guidelines.
In @metadata-collector/pkg/mapper/mapper.go:
- Line 15: Add a package-level godoc comment at the top of the mapper package:
insert a concise comment immediately above "package mapper" that describes the
package's responsibility (e.g., mapping raw metadata to canonical/internal
representations, key types it handles, and intended consumers). Keep it short,
present-tense, and follow Go doc conventions so tools like godoc and linters
accept it.
🧹 Nitpick comments (7)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml (1)
15-20: LGTM with minor formatting suggestion.The ServiceAccount definition is correct. The static analysis hint about line 20 is a false positive—YAMLlint doesn't understand Helm templating syntax, and
{{- include ... | nindent 4 }}is valid.Consider adding a trailing newline at end of file for POSIX compliance.
metadata-collector/pkg/mapper/mapper_test.go (2)
300-300: Remove debug print statement.
fmt.Println("error 1")appears to be leftover debug output. Remove it to keep test output clean.🧹 Proposed fix
case 1: - fmt.Println("error 1") return true, nil, errors.NewConflict(
342-342: Remove debug print statement.Same issue—
fmt.Println("error 1")should be removed.🧹 Proposed fix
case 1: - fmt.Println("error 1") return true, nil, errors.NewUnauthorized("unauthorized")data-models/pkg/model/pod_device_annotation.go (1)
17-32: Add godoc comments for exported symbols.As per coding guidelines, exported constants, variables, and types should have documentation comments explaining their purpose.
📝 Proposed documentation
+// PodDeviceAnnotationName is the annotation key used to store GPU device allocations on pods. const ( PodDeviceAnnotationName = "dgxc.nvidia.com/devices" ) +// EntityTypeToResourceNames maps entity types to their corresponding Kubernetes resource names. var ( EntityTypeToResourceNames = map[string][]string{ "GPU_UUID": { "nvidia.com/gpu", "nvidia.com/pgpu", }, } ) +// DeviceAnnotation represents the GPU device allocation annotation value. +// Devices maps resource names (e.g., "nvidia.com/gpu") to lists of GPU UUIDs. type DeviceAnnotation struct { Devices map[string][]string `json:"devices"` }metadata-collector/main.go (1)
73-95: Consider running initial update immediately.The current implementation waits for the first ticker tick (30 seconds) before performing the first annotation update. Consider triggering an immediate update on startup.
🔧 Proposed change for immediate first update
func runMapper(ctx context.Context) error { podDeviceMapper, err := mapper.NewPodDeviceMapper(ctx) if err != nil { return fmt.Errorf("could not create mapper: %w", err) } ticker := time.NewTicker(defaultPodDeviceMonitorPeriod) defer ticker.Stop() + // Perform initial update immediately + if numUpdates, err := podDeviceMapper.UpdatePodDevicesAnnotations(); err != nil { + slog.Error("Initial pod device annotation update failed", "error", err) + } else { + slog.Info("Device mapper pod annotation updates", "podCount", numUpdates) + } + for { select { case <-ctx.Done(): return nil case <-ticker.C:metadata-collector/pkg/mapper/mapper.go (2)
184-191: Consider handling NotFound errors to continue processing remaining pods.If a pod is deleted between
ListPods()andPatch(), the patch will fail with a NotFound error. SinceisRetryableErrordoesn't handle NotFound, this causes an early return, skipping remaining pods. While the next reconciliation iteration will retry, gracefully skipping deleted pods would improve resilience.♻️ Proposed fix
if len(patchBytes) > 0 { err = mapper.patchPodWithRetry(pod.GetName(), pod.GetNamespace(), patchType, patchBytes) if err != nil { + if errors.IsNotFound(err) { + slog.Debug("Pod no longer exists, skipping annotation update", "podKey", podKey) + continue + } return numPodDevicesAnnotationsModified, fmt.Errorf("error patching device annotation: %w", err) } numPodDevicesAnnotationsModified++ }Note: The wrapped error from
patchPodWithRetryuses%w, soerrors.IsNotFound()will correctly unwrap and detect the underlying error.
197-214: Include namespace in log and error messages for easier debugging.Adding the namespace helps identify the exact pod when troubleshooting issues across multiple namespaces.
♻️ Proposed improvement
func (mapper *podDeviceMapper) patchPodWithRetry(podName string, namespace string, patchType types.PatchType, patch []byte) error { return retry.OnError(retry.DefaultRetry, isRetryableError, func() error { _, err := mapper.kubernetesClient.CoreV1().Pods(namespace).Patch(mapper.ctx, podName, patchType, patch, metav1.PatchOptions{}) if err != nil && isRetryableError(err) { slog.Warn("Retryable error patching pod annotation. Retrying...", "pod", podName, + "namespace", namespace, "error", err) } if err != nil { - return fmt.Errorf("failed to patch pod %s: %w", podName, err) + return fmt.Errorf("failed to patch pod %s/%s: %w", namespace, podName, err) } return nil }) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
metadata-collector/go.sumis excluded by!**/*.sum
📒 Files selected for processing (16)
data-models/pkg/model/pod_device_annotation.godistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/values.yamldocs/configuration/metadata-collector.mddocs/metadata-collector.mdmetadata-collector/go.modmetadata-collector/main.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper.gometadata-collector/pkg/mapper/mapper_test.go
💤 Files with no reviewable changes (1)
- distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/metadata-collector.md
- docs/configuration/metadata-collector.md
🧰 Additional context used
📓 Path-based instructions (5)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
data-models/pkg/model/pod_device_annotation.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/mapper.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/main.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper_test.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
metadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper_test.go
**/go.mod
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
metadata-collector/go.mod
**/daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/**/*daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
🧠 Learnings (20)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
data-models/pkg/model/pod_device_annotation.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to data-models/protobufs/**/*.proto : Include comprehensive comments for all fields in Protocol Buffer messages
Applied to files:
data-models/pkg/model/pod_device_annotation.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
data-models/pkg/model/pod_device_annotation.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/grpcclient.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
metadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
metadata-collector/pkg/mapper/httpsclient.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
metadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
metadata-collector/main.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
metadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/go.mod
📚 Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Regular node DaemonSets should use `/var/log` volume mount for file-based logs
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-06T10:59:26.687Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
🧬 Code graph analysis (6)
metadata-collector/pkg/mapper/grpcclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (1)
DeviceAnnotation(30-32)
metadata-collector/pkg/mapper/mapper.go (3)
metadata-collector/pkg/mapper/httpsclient.go (2)
KubeletHTTPSClient(39-41)NewKubeletHTTPSClient(54-70)metadata-collector/pkg/mapper/grpcclient.go (2)
KubeletGRPClient(37-39)NewKubeletGRPClient(48-75)data-models/pkg/model/pod_device_annotation.go (1)
PodDeviceAnnotationName(18-18)
metadata-collector/pkg/mapper/grpcclient.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)EntityTypeToResourceNames(22-27)
metadata-collector/main.go (1)
metadata-collector/pkg/mapper/mapper.go (1)
NewPodDeviceMapper(50-77)
metadata-collector/pkg/mapper/httpsclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
metadata-collector/pkg/mapper/mapper_test.go (2)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)platform-connectors/pkg/pipeline/factory.go (1)
Create(32-39)
🪛 ast-grep (0.40.4)
metadata-collector/pkg/mapper/httpsclient.go
[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(missing-ssl-minversion-go)
🪛 YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
🔇 Additional comments (25)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (2)
38-38: LGTM: ServiceAccount binding.Correctly references the ServiceAccount created by the new
serviceaccount.yamltemplate using the samefullnamehelper.
45-71: Changes from initContainers to containers with privileged access.The container now runs as a long-lived process (matching the new two-phase lifecycle in
main.go) with:
runAsGroup: 0andprivileged: truerequired for accessing kubelet pod-resources socket- New
pod-gpu-resourcesvolume mount for the PodResourcesLister gRPC endpointThe privileged requirement is justified given the need to access
/var/lib/kubelet/pod-resources.distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)
28-33: LGTM: nodes/proxy get permission.The
nodes/proxyGET permission is required to access the Kubelet's HTTPS/podsendpoint for listing node-local pods without control-plane API server load. This aligns with the PR objective of using local Kubelet endpoints.distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)
15-28: LGTM: ClusterRoleBinding correctly wires RBAC.The binding correctly references the ClusterRole and ServiceAccount using consistent
fullnametemplating. The namespace is properly set to{{ .Release.Namespace }}.The static analysis hint is a false positive—YAMLlint doesn't understand Helm templating. Consider adding a trailing newline for POSIX compliance.
metadata-collector/pkg/mapper/mapper_test.go (3)
17-33: Testify usage note.The tests use
testify/assert, which is appropriate here given the moderate complexity and number of assertions. Based on learnings, avoidtestifyonly for simple equality checks where standardtestingpackage suffices—these tests have enough assertions to benefit from the cleaner syntax.
89-112: LGTM: Test helper setup.The
newTestDeviceMapperhelper correctly assembles mocks and creates test pods in the fake clientset. Error handling is properly implemented with context-wrapped errors.
355-409: LGTM: Multi-pod test coverage.Good coverage for multiple pods with different device allocations. The test validates that each pod receives the correct annotation independently.
metadata-collector/main.go (2)
79-80: LGTM: Ticker cleanup.Good -
defer ticker.Stop()properly prevents the resource leak that was flagged in a previous review.
97-140: LGTM: Collector phase.The
runCollectorfunction properly encapsulates the NVML initialization, metadata collection, and file writing with appropriate error handling and logging. The deferred NVML shutdown ensures cleanup even on errors.metadata-collector/pkg/mapper/httpsclient_test.go (7)
32-451: LGTM: Comprehensive test fixture.The podJson constant provides realistic Kubernetes pod data including annotations, which appropriately exercises the ListPods parsing and annotation extraction logic.
453-466: LGTM: Mock implementation is appropriate.The MockHTTPRoundTripper correctly implements http.RoundTripper for testing purposes using testify/mock.
468-480: LGTM: Test helper is well-designed.The newTestKubeletHTTPSClient helper cleanly sets up the client with mocked HTTP transport for unit testing.
482-502: LGTM: Test validates core functionality.TestListPods correctly verifies pod list parsing and device annotation extraction with proper assertion argument order.
504-532: LGTM: Error handling tests are thorough.The error path tests appropriately cover token loading, request creation, network failures, and invalid response codes.
534-553: LGTM: Retry logic is properly tested.TestListPodsWithRetry correctly validates that transient failures are retried and eventual success is achieved.
555-559: LGTM: JSON unmarshal error is tested.TestListPodsWithUnmarshalError appropriately verifies error handling for malformed JSON responses.
metadata-collector/pkg/mapper/grpcclient_test.go (4)
29-96: LGTM: Test fixtures cover complex scenarios.The podDevices fixture appropriately tests edge cases including duplicate allocations, cross-container sharing, and unsupported device filtering.
98-149: LGTM: Mock server implementation is well-structured.The MockPodResourcesServer cleanly simulates gRPC responses with configurable error injection for testing retry logic.
151-157: LGTM: Test helper correctly wires mock.The newTestKubeletGRPCClient helper appropriately constructs the client with the mocked gRPC server.
182-193: LGTM: Error and retry tests validate behavior.Both error tests appropriately verify error propagation and retry logic with configurable failure counts.
metadata-collector/pkg/mapper/httpsclient.go (1)
72-151: LGTM: Excellent documentation and implementation.The comprehensive comment explaining TLS and authentication considerations is exemplary. The ListPods implementation correctly handles token rotation, request construction, and response processing.
metadata-collector/pkg/mapper/grpcclient.go (1)
173-187: LGTM: Helper functions are correct.The isSupportedResourceName function correctly filters resources against the entity-resource mappings, and the Close method properly cleans up the gRPC connection.
metadata-collector/pkg/mapper/mapper.go (3)
33-36: LGTM!The JSON Patch path correctly escapes the
/as~1per RFC 6901, and the merge patch template properly uses%qto produce valid JSON with escaped string values.
42-48: LGTM!The struct properly encapsulates dependencies as interfaces, enabling testability and following dependency injection patterns.
216-233: LGTM!The retryable error detection covers the appropriate transient failures: conflicts, timeouts, rate limiting, and service unavailability. The inline comments clarify the intent for each category.
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
…h pod Signed-off-by: Nathan Herz <[email protected]>
473923e to
b73d463
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In @data-models/pkg/model/pod_device_annotation.go:
- Line 15: Add a package-level godoc comment describing the responsibility and
purpose of the package, placed immediately above the "package model" declaration
in pod_device_annotation.go; the comment should briefly explain what the model
package contains (e.g., data model types and annotations used across the
project, their intended usage and any important invariants) so that go doc tools
and readers understand the package at a glance.
- Around line 30-32: Add a clear Go doc comment for the exported struct
DeviceAnnotation explaining its purpose and the shape of its Devices field;
describe that DeviceAnnotation represents pod device annotations and that
Devices is a map from device plugin names (string) to a list of device IDs
(slice of strings), so tools and godoc users understand its semantics and
expected JSON encoding.
- Around line 21-28: Add a Go doc comment for the exported variable
EntityTypeToResourceNames that succinctly describes its purpose, how keys and
values are used (e.g., mapping entity types like "GPU_UUID" to Kubernetes
resource names such as "nvidia.com/gpu" and "nvidia.com/pgpu"), and any
important usage details or invariants (e.g., consumers should treat it as
read-only). Place the comment immediately above the variable declaration so it
is included in generated documentation and linters do not flag the missing
comment.
- Around line 17-19: Add a doc comment above the exported constant
PodDeviceAnnotationName describing that it is the Kubernetes annotation key used
to record/list GPU/device assignments for a Pod (value:
"dgxc.nvidia.com/devices"), when and by which components it is set/consumed, and
the expected format of the annotation value; place this comment immediately
above the PodDeviceAnnotationName declaration to satisfy exported-symbol
documentation guidelines.
In @metadata-collector/pkg/mapper/grpcclient_test.go:
- Around line 111-117: Add a Go doc comment for the exported constructor
NewMockPodResourcesServer that briefly describes what the function creates,
documents each parameter (err, errReturnCount, podDevices) and the returned
*MockPodResourcesServer value, and follows Go doc comment conventions (starts
with the function name and is complete sentence). Ensure the comment explains
the meaning of err and errReturnCount and the structure/expected contents of
podDevices so callers understand usage and semantics.
- Around line 98-105: Add a doc comment above the exported
MockPodResourcesServer type describing its purpose as a test/mock implementation
of the PodResourcesListerClient, and briefly document its fields: err (simulated
error to return), errReturnCount (how many times to return the error), and
podDevices (mapping pod -> container -> slice of deviceInfo). Mention that it
embeds v1.PodResourcesListerClient to avoid implementing all interface methods
and that the type is intended for unit tests.
In @metadata-collector/pkg/mapper/grpcclient.go:
- Around line 37-39: The interface KubeletGRPClient must declare a Close() error
method so callers can release the gRPC connection; add Close() to the
KubeletGRPClient interface and implement it on kubeletGRPClient to call the
existing connection cleanup (e.g., invoke the struct’s closeConn() or
c.conn.Close() and return any error). Then update any callers that obtain a
KubeletGRPClient to call/ defer client.Close() to avoid leaking the gRPC
connection.
- Line 54: The code calls resolver.SetDefaultScheme("passthrough"), which
mutates global gRPC resolver state; remove that global call
(resolver.SetDefaultScheme) and instead specify the passthrough scheme directly
on the dial target used by your gRPC client (e.g., prefix the target passed to
grpc.Dial/grpc.DialContext with the explicit "passthrough:///" scheme or
otherwise construct the target string with the passthrough scheme in the
function that creates the client connection). Ensure no other code relies on
resolver.SetDefaultScheme so the change confines behavior to this client only.
- Around line 119-126: The build fails because retryAllErrors is undefined;
define a predicate function named retryAllErrors(error) bool and use it in the
retry.OnError call (or replace the reference with an existing predicate) so the
code compiles; specifically, add a function accessible in this package (e.g.,
retryAllErrors that returns true if you intend to retry every error, or
implement a more specific predicate that checks error types/messages) and keep
the retry.OnError block that calls client.podResourcesClient.List(client.ctx,
&v1.ListPodResourcesRequest{}) unchanged.
In @metadata-collector/pkg/mapper/httpsclient_test.go:
- Around line 453-456: Add a doc comment for the exported mock type
MockHTTPRoundTripper describing that it is a test-only mock implementation of
http.RoundTripper used to stub HTTP requests in unit tests; place the comment
immediately above the type declaration for MockHTTPRoundTripper so it satisfies
exported-type documentation guidelines and briefly mention that it embeds
http.RoundTripper and mock.Mock to allow custom response behavior in tests.
In @metadata-collector/pkg/mapper/httpsclient.go:
- Around line 39-41: Add a doc comment for the exported interface
KubeletHTTPSClient that succinctly explains its purpose and usage (e.g., that it
provides HTTPS access to the kubelet API to list pods and is used by the
metadata collector). Place the comment immediately above the type declaration
for KubeletHTTPSClient and keep it in sentence form starting with the interface
name to satisfy Go documentation conventions.
- Around line 54-70: The exported constructor NewKubeletHTTPSClient is missing a
GoDoc comment; add a descriptive comment immediately above the function that
explains what the constructor does, its parameter (ctx context.Context), the
return values (KubeletHTTPSClient, error), and key behavior (creates an HTTP
transport with TLS InsecureSkipVerify, timeout settings, and initializes
bearerTokenPath and listPodsURI). Keep the comment concise and follow GoDoc
style (complete sentence starting with NewKubeletHTTPSClient).
🟡 Minor comments (9)
docs/metadata-collector.md-104-105 (1)
104-105: Expand the Kubernetes API subsection with annotation details.The current documentation is too brief. Add the annotation key, JSON structure, and explain how external components consume it. The annotation
dgxc.nvidia.com/devicesexposes pod-to-GPU mapping as a JSON object with device resource names (e.g.,nvidia.com/gpu,nvidia.com/pgpu) mapped to lists of GPU UUIDs. External components like the node-drainer module query the Kubernetes API to retrieve this annotation and determine which pods use specific GPUs.📝 Proposed expansion for Kubernetes API subsection
### Kubernetes API The pod-to-GPU mapping is exposed as the `dgxc.nvidia.com/devices` annotation on pod objects. External components (e.g., node-drainer module) can query the Kubernetes API to retrieve this annotation and determine which pods are using specific GPUs. The annotation value is a JSON object mapping device resource names to UUID lists: ```json {"devices": {"nvidia.com/gpu": ["GPU-455d8f70-2051-db6c-0430-ffc457bff834"]}}</details> </blockquote></details> <details> <summary>docs/metadata-collector.md-71-72 (1)</summary><blockquote> `71-72`: **Add annotation key and JSON format example to Pod Information subsection.** The "Pod Information" subsection lacks crucial details for developers. Include the annotation key and JSON structure so users understand how to consume the pod-to-GPU mapping data. <details> <summary>📝 Proposed enhancement for Pod Information subsection</summary> ```markdown ### Pod Information - GPU UUIDs allocated to each pod - Exposed as annotation `dgxc.nvidia.com/devices` on pod objects in JSON format: ```json {"devices": {"nvidia.com/gpu": ["GPU-<UUID>", ...]}}</details> </blockquote></details> <details> <summary>docs/metadata-collector.md-36-41 (1)</summary><blockquote> `36-41`: **Document the annotation key, format, and reconciliation interval.** The workflow steps are clear, but the section is missing important implementation details that users and operators need: 1. **Annotation key**: `dgxc.nvidia.com/devices` 2. **Annotation format**: `{"devices": {"nvidia.com/gpu": ["GPU-<UUID>", "GPU-<UUID>"], "nvidia.com/pgpu": [...]}}` (supports both GPU and pGPU resource names) 3. **Reconciliation interval**: The metadata collector runs on a fixed 30-second reconciliation interval to keep pod-to-GPU mappings current 4. **Kubelet endpoints**: These are local kubelet endpoints—Unix socket for the PodResourcesLister gRPC service and HTTPS for the /pods endpoint—rather than control-plane API calls, which reduces control-plane load <details> <summary>📝 Documentation enhancement example</summary> Consider expanding the workflow section with: ```markdown ### GPU-to-pod mapping annotation 1. To discover all pods running on the given node, this component calls the local Kubelet HTTPS `/pods` endpoint. 2. To discover the GPU devices allocated to each pod, this component leverages the local Kubelet PodResourcesLister gRPC service (via Unix socket). 3. For each pod with a change in GPU device allocation, the component updates the `dgxc.nvidia.com/devices` annotation with a JSON mapping: ```json {"devices": {"nvidia.com/gpu": ["GPU-<UUID>", "GPU-<UUID>"]}}where the resource name can be
nvidia.com/gpuornvidia.com/pgpu.
4. The metadata collector runs this reconciliation in a loop at a fixed 30-second interval to keep mappings current.Note: Using local Kubelet endpoints reduces control-plane load compared to querying the Kubernetes API for every pod.
</details> </blockquote></details> <details> <summary>metadata-collector/pkg/mapper/mapper_test.go-300-300 (1)</summary><blockquote> `300-300`: **Remove debugging print statements.** Lines 300 and 342 contain `fmt.Println` calls that should be removed from production test code. <details> <summary>🧹 Proposed fix</summary> ```diff case 1: - fmt.Println("error 1") return true, nil, errors.NewConflict(And:
case 1: - fmt.Println("error 1") return true, nil, errors.NewUnauthorized("unauthorized")Also applies to: 342-342
metadata-collector/pkg/mapper/grpcclient.go-15-15 (1)
15-15: Add package-level documentation.The package is missing package-level godoc. Per coding guidelines, all Go packages require package-level documentation.
📝 Suggested package documentation
+// Package mapper provides components for discovering and annotating pods with their allocated GPU devices. +// It includes clients for the Kubelet HTTPS and gRPC endpoints to gather pod and device allocation data. package mapperBased on coding guidelines requiring package-level godoc for all Go packages.
metadata-collector/pkg/mapper/grpcclient.go-48-75 (1)
48-75: Add godoc for exported constructor.The exported function
NewKubeletGRPClientis missing documentation. Per coding guidelines, all exported Go functions require godoc comments.📝 Suggested documentation
+// NewKubeletGRPClient creates a new gRPC client for the Kubelet pod-resources endpoint. +// It connects to the local Unix socket at /var/lib/kubelet/pod-resources/kubelet.sock. +// Returns an error if the socket doesn't exist or the connection fails. func NewKubeletGRPClient(ctx context.Context) (KubeletGRPClient, error) {Based on coding guidelines requiring function comments for all exported Go functions.
metadata-collector/pkg/mapper/mapper.go-39-41 (1)
39-41: Add godoc for exported interface.The exported interface
PodDeviceMapperis missing documentation. Per coding guidelines, all exported entities require godoc comments.📝 Suggested documentation
+// PodDeviceMapper manages the synchronization of GPU device allocations to pod annotations. +// It discovers device assignments via Kubelet endpoints and updates pod annotations accordingly. type PodDeviceMapper interface { UpdatePodDevicesAnnotations() (int, error) }Based on coding guidelines requiring function comments for all exported Go functions.
metadata-collector/pkg/mapper/grpcclient.go-37-39 (1)
37-39: Add godoc for exported interface.The exported interface
KubeletGRPClientis missing documentation. Per coding guidelines, all exported entities require godoc comments.📝 Suggested documentation
+// KubeletGRPClient provides access to the Kubelet's pod resources via gRPC. +// It connects to the local Unix socket to query device allocations for pods. type KubeletGRPClient interface { ListPodResources() (map[string]*model.DeviceAnnotation, error) }Based on coding guidelines requiring function comments for all exported Go functions.
metadata-collector/pkg/mapper/mapper.go-51-78 (1)
51-78: Add godoc for exported constructor.The exported function
NewPodDeviceMapperis missing documentation. Per coding guidelines, all exported Go functions require godoc comments.📝 Suggested documentation
+// NewPodDeviceMapper creates a new PodDeviceMapper that synchronizes GPU device allocations +// to pod annotations. It initializes clients for Kubelet HTTPS, Kubelet gRPC, and the +// Kubernetes API server. Returns an error if any client initialization fails. func NewPodDeviceMapper(ctx context.Context) (PodDeviceMapper, error) {Based on coding guidelines requiring function comments for all exported Go functions.
🧹 Nitpick comments (6)
metadata-collector/pkg/mapper/mapper_test.go (1)
114-409: Consider consolidating into table-driven tests.The current test functions each test a single scenario. Per coding guidelines, table-driven tests are preferred when testing multiple scenarios. Consider refactoring the UpdatePodDevicesAnnotations tests into a table-driven format to reduce duplication and improve maintainability.
For example, tests like
TestUpdatePodDevicesAnnotationsWithAnnotationAdded,WithAnnotationUpdated,WithAnnotationNotChanged, andWithAnnotationRemovedcould be consolidated into a single table-driven test with cases for each scenario.Based on coding guidelines, table-driven tests should be used when testing multiple scenarios in Go.
metadata-collector/pkg/mapper/httpsclient.go (2)
120-131: Consider more selective retry logic.The
retryAllErrorsfunction retries on all errors unconditionally. Some errors like authentication failures (401/403) or malformed requests (400) are non-transient and should not be retried.Consider implementing more selective retry logic that only retries on network errors and 5xx status codes.
♻️ Proposed selective retry logic
- err = retry.OnError(retry.DefaultRetry, retryAllErrors, func() error { + err = retry.OnError(retry.DefaultRetry, isRetryableError, func() error { resp, err = client.httpRoundTripper.RoundTrip(req) //nolint:bodyclose // response body is closed outside OnError if err != nil { return fmt.Errorf("got an error making HTTP request to /pods endpoint: %w", err) } if resp.StatusCode != http.StatusOK { return fmt.Errorf("got a non-200 response code from /pods endpoint: %d", resp.StatusCode) } return nil })And replace
retryAllErrorswith:-func retryAllErrors(_ error) bool { - return true +func isRetryableError(err error) bool { + // Retry on network errors or server errors (5xx) + // Don't retry on client errors (4xx) like auth failures + return true // For now, keep current behavior but add TODO + // TODO: Implement selective retry for transient errors only }
55-62: Consider setting TLS MinVersion for security hardening.The TLS configuration lacks a MinVersion setting. While the connection is to localhost and the security model is well-justified in the comments, explicitly setting
MinVersion: tls.VersionTLS12would follow security best practices and prevent any issues with TLS version negotiation.🔒 Proposed TLS configuration enhancement
transport := &http.Transport{ TLSClientConfig: &tls.Config{ InsecureSkipVerify: true, //nolint:gosec + MinVersion: tls.VersionTLS12, }, TLSHandshakeTimeout: 30 * time.Second,metadata-collector/pkg/mapper/grpcclient.go (1)
177-187: Consider optimizing resource name lookup.The nested loop approach works but could be more efficient. With the current small set of resource names (2), the performance impact is negligible, but this could be optimized to O(1) lookup.
♻️ Optional optimization using a flat set
Initialize a package-level set once:
+var supportedResourceNames map[string]bool + +func init() { + supportedResourceNames = make(map[string]bool) + for _, resourceNames := range model.EntityTypeToResourceNames { + for _, resourceName := range resourceNames { + supportedResourceNames[resourceName] = true + } + } +} + func isSupportedResourceName(resourceName string) bool { - for _, resourceNamesForEntityType := range model.EntityTypeToResourceNames { - for _, currentResourceName := range resourceNamesForEntityType { - if resourceName == currentResourceName { - return true - } - } - } - - return false + return supportedResourceNames[resourceName] }metadata-collector/pkg/mapper/mapper.go (2)
167-189: Remove redundant variable declaration.Line 168 declares
deviceAnnotationJSONbut it's immediately reassigned on line 170 using:=. The declaration on line 168 is unnecessary.♻️ Remove redundant declaration
if podHasGPUDevices { - var deviceAnnotationJSON []byte - deviceAnnotationJSON, err := json.Marshal(deviceAnnotation) if err != nil { return "", nil, fmt.Errorf("error marshalling device annotation: %w", err) }
215-219: Redundant error check.Line 215 checks
isRetryableError(err)again, butretry.OnErroralready usesisRetryableErroras its predicate. This check is redundant and the warning will only be logged for errors that will be retried anyway.♻️ Simplify error handling
return retry.OnError(retry.DefaultRetry, isRetryableError, func() error { _, err := mapper.kubernetesClient.CoreV1().Pods(namespace).Patch(mapper.ctx, podName, patchType, patch, metav1.PatchOptions{}) - if err != nil && isRetryableError(err) { - slog.Warn("Retryable error patching pod annotation. Retrying...", - "pod", podName, - "error", err) - } if err != nil { return fmt.Errorf("failed to patch pod %s: %w", podName, err) }If you want to keep the logging, you can add it after the retry loop completes or rely on structured logging elsewhere.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
metadata-collector/go.sumis excluded by!**/*.sum
📒 Files selected for processing (16)
data-models/pkg/model/pod_device_annotation.godistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yamldistros/kubernetes/nvsentinel/charts/metadata-collector/values.yamldocs/configuration/metadata-collector.mddocs/metadata-collector.mdmetadata-collector/go.modmetadata-collector/main.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/httpsclient_test.gometadata-collector/pkg/mapper/mapper.gometadata-collector/pkg/mapper/mapper_test.go
💤 Files with no reviewable changes (1)
- distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/configuration/metadata-collector.md
🧰 Additional context used
📓 Path-based instructions (5)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
data-models/pkg/model/pod_device_annotation.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/mapper.gometadata-collector/main.gometadata-collector/pkg/mapper/httpsclient_test.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
metadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient_test.go
**/go.mod
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
metadata-collector/go.mod
**/daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Explain DaemonSet variant selection logic in Helm chart documentation
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
distros/kubernetes/**/*daemonset*.yaml
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
distros/kubernetes/**/*daemonset*.yaml: Separate DaemonSets should be created for kata vs regular nodes usingnodeAffinitybased on kata.enabled label
Regular node DaemonSets should use/var/logvolume mount for file-based logs
Kata node DaemonSets should use/run/log/journaland/var/log/journalvolume mounts for systemd journal
Files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
🧠 Learnings (20)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
data-models/pkg/model/pod_device_annotation.gometadata-collector/pkg/mapper/mapper.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
data-models/pkg/model/pod_device_annotation.gometadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/pkg/mapper/mapper.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
metadata-collector/pkg/mapper/grpcclient.gometadata-collector/pkg/mapper/httpsclient.gometadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
metadata-collector/pkg/mapper/httpsclient.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
metadata-collector/pkg/mapper/mapper_test.go
📚 Learning: 2025-12-22T16:16:24.320Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:24.320Z
Learning: In Go tests across the repository, avoid introducing the testify dependency for simple equality/inequality checks. Use the standard testing package assertions (t.Error, t.Errorf, t.Fatal, etc.) for straightforward checks. Reserve third-party assertion libraries for complex scenarios that require richer diagnostics or expressive matchers.
Applied to files:
metadata-collector/pkg/mapper/mapper_test.gometadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/pkg/mapper/httpsclient_test.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
metadata-collector/pkg/mapper/grpcclient_test.gometadata-collector/go.modmetadata-collector/pkg/mapper/httpsclient_test.go
📚 Learning: 2026-01-06T21:31:36.113Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 490
File: janitor-provider/go.mod:70-70
Timestamp: 2026-01-06T21:31:36.113Z
Learning: In janitor-provider/go.mod, the dependency github.com/golang-jwt/jwt/v4 v4.5.1 is a transitive dependency from github.com/nebius/gosdk and cannot be directly upgraded without a replace directive or upstream fix in nebius/gosdk.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:48:13.460Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/hack/overlays/client/versioned/typed/device/v1alpha1/gpu.go:15-27
Timestamp: 2025-12-22T16:48:13.460Z
Learning: In the client-go module, files under `hack/overlays/` use the overlay pattern: they are copied into generated code directories and are not compiled standalone. These overlay files may reference types (e.g., `GPUExpansion`) that are generated by Kubernetes code-gen tools and only exist in the final destination. The build excludes overlays via grep patterns like `grep -vE '/hack/overlays/|/examples/'`. Do not flag missing type references in overlay files as compilation errors.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-12-22T16:16:31.660Z
Learnt from: pteranodan
Repo: NVIDIA/NVSentinel PR: 607
File: client-go/nvgrpc/config_test.go:61-80
Timestamp: 2025-12-22T16:16:31.660Z
Learning: In the NVIDIA/NVSentinel repository, prefer not to introduce a dependency on `stretchr/testify` for simple comparison assertions in Go tests. Use standard `testing` package assertions (t.Error, t.Errorf, etc.) for straightforward checks.
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
metadata-collector/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
metadata-collector/main.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Kata node DaemonSets should use `/run/log/journal` and `/var/log/journal` volume mounts for systemd journal
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Separate DaemonSets should be created for kata vs regular nodes using `nodeAffinity` based on kata.enabled label
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/daemonset*.yaml : Explain DaemonSet variant selection logic in Helm chart documentation
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to distros/kubernetes/**/*daemonset*.yaml : Regular node DaemonSets should use `/var/log` volume mount for file-based logs
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml
📚 Learning: 2025-11-06T10:59:26.687Z
Learnt from: XRFXLP
Repo: NVIDIA/NVSentinel PR: 281
File: distros/kubernetes/nvsentinel/charts/metadata-collector/values.yaml:20-0
Timestamp: 2025-11-06T10:59:26.687Z
Learning: The pause container image registry.k8s.io/pause:3.9+ requires eBPF and cgroup v2 support (kernel 4.15+, CONFIG_BPF, CONFIG_CGROUP_BPF, /sys/fs/bpf mounted). Older pause:3.0 images from gcr.io/google-containers/pause work on systems without full BPF/cgroup v2 support. When suggesting pause image updates, verify the target environment has necessary kernel features to avoid BPF_PROG_ATTACH errors.
Applied to files:
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yamldocs/metadata-collector.md
🧬 Code graph analysis (6)
metadata-collector/pkg/mapper/grpcclient.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)EntityTypeToResourceNames(22-27)
metadata-collector/pkg/mapper/mapper_test.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
metadata-collector/pkg/mapper/grpcclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (1)
DeviceAnnotation(30-32)
metadata-collector/pkg/mapper/mapper.go (3)
metadata-collector/pkg/mapper/httpsclient.go (2)
KubeletHTTPSClient(39-41)NewKubeletHTTPSClient(54-70)metadata-collector/pkg/mapper/grpcclient.go (2)
KubeletGRPClient(37-39)NewKubeletGRPClient(48-75)data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
metadata-collector/main.go (1)
metadata-collector/pkg/mapper/mapper.go (1)
NewPodDeviceMapper(51-78)
metadata-collector/pkg/mapper/httpsclient_test.go (1)
data-models/pkg/model/pod_device_annotation.go (2)
DeviceAnnotation(30-32)PodDeviceAnnotationName(18-18)
🪛 ast-grep (0.40.4)
metadata-collector/pkg/mapper/httpsclient.go
[warning] 55-57: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
InsecureSkipVerify: true, //nolint:gosec
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures
(missing-ssl-minversion-go)
🪛 YAMLlint (1.37.1)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/serviceaccount.yaml
[error] 20-20: syntax error: expected the node content, but found '-'
(syntax)
🔇 Additional comments (25)
metadata-collector/go.mod (4)
12-16: Well-aligned Kubernetes and gRPC dependencies for PodResourcesLister feature.The addition of direct dependencies on k8s.io/api, k8s.io/apimachinery, k8s.io/client-go, and k8s.io/kubelet (all v0.35.0) correctly supports the PR objective of querying Kubelet endpoints for pod-to-device mappings via gRPC. Versions are consistent, recent, and appropriate for semantic versioning. This aligns with repository learnings on using client-go for Kubernetes interactions.
47-48: Verify non-standard YAML import paths.Lines 47–48 introduce
go.yaml.in/yaml/v2andgo.yaml.in/yaml/v3as indirect dependencies. This is an unusual import path; standard Kubernetes/Go projects typically usegopkg.in/yaml.v2andgopkg.in/yaml.v3. Verify whether these are intentional transitive pulls from k8s.io modules or reflect a dependency resolution artifact that could be simplified.
27-37: Granular go-openapi/swag submodule breakdown warrants verification.Lines 27–37 list fine-grained go-openapi/swag submodules (cmdutils, conv, fileutils, jsonname, jsonutils, loading, mangling, netutils, stringutils, typeutils, yamlutils) rather than a single
github.com/go-openapi/swagentry. This granular structure suggests either a recent upstream change or a side effect ofgo mod tidy. Verify these submodule dependencies are intentional and necessary; if not, consider a single parent module import to simplify maintenance.
1-67: go.mod and go.sum are consistent; no dependency conflicts or issues detected.The verification confirms that
go mod verifypasses successfully and there are no duplicate or conflicting dependency versions. The Kubernetes module dependencies and their transitive indirect dependencies (including YAML packages and granular go-openapi/swag imports) are properly tracked in go.sum. Semantic versioning is correctly applied, with local module dependencies properly managed via thereplacedirective.docs/metadata-collector.md (2)
9-9: Clear and well-integrated overview of the new feature.The addition nicely contextualizes the pod-to-GPU mapping annotation exposure as a complement to local topology persistence.
20-22: Good addition of the node-drainer use case context.This clarifies the motivation for the pod-to-GPU mapping and aligns well with the PR objectives.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/daemonset.yaml (3)
38-38: LGTM: ServiceAccount binding added.The ServiceAccount reference enables the RBAC permissions defined in the ClusterRole.
48-50: Verify privileged access scope and security implications.The container now runs with
privileged: trueand mounts/var/lib/kubelet/pod-resourcesto access the Kubelet PodResourcesLister gRPC endpoint. While necessary for discovering GPU device allocations, this grants broad host access.Ensure the privileged mode is required for your specific use case and that alternative approaches (e.g., using only specific capabilities) were considered. The privileged flag bypasses most container isolation mechanisms.
Also applies to: 69-71, 81-84
45-101: Verify kata node handling and log volume mount requirements.Based on coding guidelines, DaemonSets in this path should include:
- Separate variants for kata vs regular nodes using
nodeAffinitybased on thekata.enabledlabel/var/logvolume mount for regular nodes (file-based logs)/run/log/journaland/var/log/journalvolume mounts for kata nodes (systemd journal)- Documentation explaining DaemonSet variant selection logic
The current DaemonSet does not implement these separations. Confirm whether kata node support is required for the metadata-collector component.
Based on learnings, separate DaemonSets should be created for kata vs regular nodes using nodeAffinity, and log volume mounts should differ between the two variants.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrole.yaml (1)
21-33: LGTM: Appropriate RBAC permissions.The ClusterRole grants minimal permissions required for the metadata-collector:
patchonpodsto add/update/remove device annotationsgetonnodes/proxyto access kubelet HTTPS endpointsThis follows the principle of least privilege.
distros/kubernetes/nvsentinel/charts/metadata-collector/templates/clusterrolebinding.yaml (1)
21-28: LGTM: ClusterRoleBinding correctly configured.The binding properly connects the ClusterRole to the ServiceAccount in the release namespace, completing the RBAC setup.
metadata-collector/pkg/mapper/mapper_test.go (2)
17-33: LGTM: Appropriate test dependencies.The imports include testify/assert for complex test assertions, which is acceptable given the scenarios involve mocks, multiple assertions, and detailed validation of pod annotations and retry behavior.
35-112: Well-structured test infrastructure.The mock implementations and test helper function provide clean, reusable test setup. The
newTestDeviceMapperproperly initializes all dependencies with mocks and prepopulates the fake clientset.metadata-collector/pkg/mapper/httpsclient_test.go (2)
482-502: Excellent test coverage for the happy path.The test comprehensively validates pod parsing and device annotation extraction. The assertion structure clearly verifies both the pod count and the correct decoding of device annotations.
534-553: Well-designed retry test with mock expectations.The test effectively validates retry behavior by configuring the mock to fail once then succeed, confirming that the retry logic works as expected.
metadata-collector/main.go (2)
73-95: Consider non-fatal error handling for transient mapper failures.The mapper loop exits immediately on any error from
UpdatePodDevicesAnnotations(), causing the entire pod to terminate. This means transient failures (network issues, temporary API unavailability) will restart the container.Consider whether errors should be logged and the loop should continue, reserving exit only for unrecoverable errors. This would align with the periodic scraping pattern described in the PR objectives.
Do you want the pod to restart on any mapper error, or should transient errors be logged while continuing the periodic updates?
56-71: Well-structured two-phase lifecycle.The separation of metadata collection (one-time) and pod device mapping (continuous) is clear and follows good practices for error handling and logging.
metadata-collector/pkg/mapper/grpcclient_test.go (2)
159-180: Comprehensive test validates device aggregation and deduplication.The test effectively validates that:
- Devices are correctly aggregated per pod across containers
- Duplicate device IDs are deduplicated
- Unsupported resource types are filtered out
- The expected device mappings match the consolidated view
188-193: Good retry behavior validation.The test confirms that a single retry attempt succeeds, validating the retry logic handles transient errors correctly.
metadata-collector/pkg/mapper/httpsclient.go (2)
72-96: Excellent security justification documentation.The extensive comment clearly explains the rationale for InsecureSkipVerify and the authentication/authorization model. This level of detail is valuable for future maintainers and security auditors.
97-108: Well-reasoned token handling for rotation support.The decision to read the token file on every request correctly handles service account token rotation. The comment clearly explains the rationale.
metadata-collector/pkg/mapper/grpcclient.go (1)
159-175: LGTM: Efficient in-place deduplication.The implementation correctly sorts device lists and removes duplicates using an efficient two-pointer algorithm. This ensures deterministic ordering for downstream comparisons.
metadata-collector/pkg/mapper/mapper.go (3)
123-153: LGTM: Clear implementation with good observability.The method correctly lists pods and devices, compares them, and applies patches as needed. The counter return value provides useful observability into how many annotations were modified. Early return on error is appropriate since the operation will be retried in the next reconciliation cycle.
182-188: Excellent explanation of patch type choice.The comment clearly explains why
MergePatchTypeis necessary for add/update operations instead ofJSONPatchType. This is valuable context for future maintainers.
229-246: LGTM: Comprehensive retry predicate.The function correctly identifies common retryable Kubernetes API errors including conflicts, timeouts, rate limiting, and service unavailability. The error checks use the standard
errors.Is*helpers which properly unwrap error chains.
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
🛡️ CodeQL Analysis🚨 Found 1 security alert(s) |
Summary
Related design doc for GPU reset: https://github.com/NVIDIA/NVSentinel/blob/main/docs/designs/020-nvsentinel-gpu-reset.md.
This PR adds support to the metadata-collector to add a device annotation to all pods running on the given node which have been allocated a GPU device. The annotation is named dgxc.nvidia.com/devices with keys as the device resource name, either nvidia.com/gpu or nvidia.com/pgpu, and values a list of GPU UUIDs for the corresponding devices. In a follow-up PR, this pod annotation will be consumed by the node-drainer-module to allow partial node drains where only pods needing a GPU reset for a given GPU UUID will be drained and other pods running on the node will not be drained.
Example annotation:
High-level steps:
List all pods from the Kubelet's HTTPS /pods endpoint. This allows us to determine which pods need the devices
annotation added, updated, or removed. The /pods Kubelet endpoint is equivalent to the following K8s API request: /api/v1/pods?fieldSelector=spec.nodeName=$NODE_NAME. Using the local Kubelet endpoint will significantly reduce load on the control plane from all metadata-collector daemonset pods with the tradeoff that the endpoint has to be periodically scraped rather than watched. Since we have to periodically scrape the PodResourcesLister endpoint in the next step, we will be consistent in how we're discovering both the pod list for this node and the device to pod allocation by listing from a local endpoint on the node.
Discover the pod to device mapping using the Kubelet's gRPC PodResourcesLister endpoint. This local gRPC service is
exposed on a local Unix socket. This endpoint exposes devices allocated to pods through either resource requests through device plugins or resource claims through DRA. Starting in v1.35, we can rely on the ResourceHealthStatus feature to expose this same device mapping through container statuses and at that point we'll be able to remove this component
from the metadata-collector.
Compare each pod object to the current set of allocated devices and determine if any pods need a device annotation
updated, added, or removed. We will issue a patch request which updates the dgxc.nvidia.com/devices annotation to the
desired value or remove it. To prevent unnecessary writes, we will only update this annotation if it is modified.
This comes with the tradeoff that consumers of this annotation (node-drainer-module) will not be able to detect stale
annotations outside of the annotation not existing for a pod that is requesting GPUs.
Testing
Checking pod object:
Checking pod objects:
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Infrastructure
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.