Design: implement control grpc to the device plugin#1003
Design: implement control grpc to the device plugin#1003SchSeba wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
Conversation
Signed-off-by: Sebastian Sch <sebassch@gmail.com>
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
WalkthroughAdds a comprehensive design document for a new gRPC-based control interface for the SR-IOV Network Device Plugin. Defines a streaming connection from config-daemon to device plugin enabling lifecycle management of SR-IOV VF resource publication through a ControlService with EnableDevices RPC. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
doc/design/device-plugin-grpc-control.md (2)
223-231: Add language specifier to CLI flags code block (Line 225).The code block documenting CLI flags (starting line 225) is missing a language specifier. Add
```bashto satisfy markdown linting standards and improve syntax highlighting.
233-250: Minor: Fix compound adjective and clarify feature gate strategy.Two items:
Line 244: Change "Main entry point changes" to "Main entry-point changes" (compound adjective modifying "changes").
The design mentions a "feature gate" for enabling this behavior (lines 318-322, 327-328) but does not specify the gate name or its default state. For clarity, define the feature gate name (e.g.,
SRIOVDevicePluginControlGateor similar) and document whether it defaults toenabled: falseto maintain backward compatibility during rollout.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
doc/design/device-plugin-grpc-control.md
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ykulazhenkov
Repo: k8snetworkplumbingwg/sriov-network-operator PR: 981
File: doc/design/block-device-plugin-until-configured.md:109-151
Timestamp: 2025-12-12T12:23:23.347Z
Learning: In the SR-IOV Network Operator's "Block Device Plugin Until Configured" design (doc/design/block-device-plugin-until-configured.md), the init container wait mechanism intentionally has no timeout. If sriov-config-daemon fails or is not running, the sriov-device-plugin should remain blocked and not start. This is expected fail-safe behavior to prevent announcing misconfigured SR-IOV resources.
📚 Learning: 2025-12-12T12:23:23.347Z
Learnt from: ykulazhenkov
Repo: k8snetworkplumbingwg/sriov-network-operator PR: 981
File: doc/design/block-device-plugin-until-configured.md:109-151
Timestamp: 2025-12-12T12:23:23.347Z
Learning: In the SR-IOV Network Operator's "Block Device Plugin Until Configured" design (doc/design/block-device-plugin-until-configured.md), the init container wait mechanism intentionally has no timeout. If sriov-config-daemon fails or is not running, the sriov-device-plugin should remain blocked and not start. This is expected fail-safe behavior to prevent announcing misconfigured SR-IOV resources.
Applied to files:
doc/design/device-plugin-grpc-control.md
🪛 LanguageTool
doc/design/device-plugin-grpc-control.md
[uncategorized] ~244-~244: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... - Ability to start/stop all servers on demand - Generation tracking 3. **Main ent...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
🪛 markdownlint-cli2 (0.18.1)
doc/design/device-plugin-grpc-control.md
225-225: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: test
- GitHub Check: build
- GitHub Check: test-coverage
- GitHub Check: Golangci-lint
- GitHub Check: Analyze (go)
🔇 Additional comments (3)
doc/design/device-plugin-grpc-control.md (3)
45-61: Acknowledge the intentional fail-safe behavior of the init-container approach.The comparison dismisses PR #981's init-container approach as having "limitations," but the retrieved learning from that PR indicates its lack of timeout is intentional fail-safe behavior: if sriov-config-daemon fails or doesn't run, the device plugin should remain blocked and never announce misconfigured SR-IOV resources. This is a security/correctness feature, not a limitation. Consider revising the comparison to acknowledge this safety benefit while still arguing for the gRPC approach's advantages (lower latency, generation awareness, bidirectional status).
96-112: Clarify timeout and health-check behavior for controlled mode.The design states the device plugin "waits for a connection from the config-daemon" but does not specify a timeout. If config-daemon crashes or fails to connect, the device plugin will remain blocked indefinitely. This differs from the init-container approach's documented fail-safe (waiting indefinitely is intentional). Please clarify:
- Is indefinite waiting intended as a fail-safe (like the init-container approach), or should there be a timeout?
- What constitutes "connection close"? Are keepalive probes defined? How is liveness verified?
- How should operators handle a permanently failed config-daemon in controlled mode?
These behaviors are critical for operational safety and should be explicitly documented.
Also applies to: 270-275
337-361: Test plan is comprehensive and well-structured.The test plan appropriately covers unit, integration, and E2E scenarios, including backward compatibility, reconnection, and daemon crash recovery. Ensure integration tests validate that kubelet ListAndWatch is not initiated until after EnableDevices is called, as this is a core correctness requirement preventing the race condition.
Pull Request Test Coverage Report for Build 20637879903Details
💛 - Coveralls |
| // the device plugin's resource announcement lifecycle. | ||
| service ControlService { | ||
| // EnableDevices establishes a long-lived connection that signals the device | ||
| // plugin to start exposing devices to kubelet. The connection acts as a |
There was a problem hiding this comment.
i think its optionally loading the new configuration then start exposing devices to kubelet
| service ControlService { | ||
| // EnableDevices establishes a long-lived connection that signals the device | ||
| // plugin to start exposing devices to kubelet. The connection acts as a | ||
| // health check - when it closes, the device plugin stops advertising resources. |
There was a problem hiding this comment.
you mean the device plugin will re-advertize zero devices.
| // health check - when it closes, the device plugin stops advertising resources. | ||
| // The client sends the initial request with the generation, and the server | ||
| // streams status updates back. | ||
| rpc EnableDevices(EnableDevicesRequest) returns (stream DevicePluginStatus); |
There was a problem hiding this comment.
what will ensure that wen EnableDevicesRequest is sent, the config map file is actually updated in the device plugin pod
| State state = 1; | ||
|
|
||
| // Number of resource pools currently being served | ||
| int32 resource_pool_count = 2; |
There was a problem hiding this comment.
what will we do with this information ?
same for the below
| INITIALIZING = 1; | ||
| SERVING = 2; | ||
| ERROR = 3; | ||
| STOPPING = 4; |
|
|
||
| enum State { | ||
| UNKNOWN = 0; | ||
| INITIALIZING = 1; |
There was a problem hiding this comment.
when will we be in this state
There was a problem hiding this comment.
apart of SERVING and ERROR i dont understand when the others be transmitted.
even for error, if we are going to close the streaming connection (and probably exit) we could use the regular gRPC error IMO.
| | **Resource overhead** | Additional container image | Minimal (in-process server) | | ||
| | **Kubernetes API load** | Annotation updates | None | | ||
| | **Implementation scope** | Operator only | Both operator and device plugin | | ||
| | **Complexity** | Moderate | Higher initial, simpler operation | |
There was a problem hiding this comment.
the complexity for init container is low id say. its like ~300 lines of actual code and alot of it is boiler plate (logs, create client etc)
| that enables the sriov-config-daemon to establish a long-lived connection and control | ||
| when the device plugin publishes SR-IOV VF resources to kubelet. The config-daemon | ||
| passes the `SriovNetworkNodeState` generation to the device plugin, which stores it | ||
| and only starts exposing devices when connected. On disconnect, the device plugin |
There was a problem hiding this comment.
a note: the latter part was never done by device plugin. (probe function has a TODO to implement)
| 3. Kubernetes scheduler schedules pods to this node | ||
| 4. Pods start with partially configured VFs | ||
| 5. sriov-config-daemon applies VF configuration (unbind/bind VF driver) | ||
| 6. Pods that were not evicted lose VF connectivity |
There was a problem hiding this comment.
do you mean daemonset pods that consume VFs ?
even today we dont handle it upon sriov reconfiguration regardless of externally managed right ?
having a switch to delay device plugin started indeed solves the intial config but not reconfiguration.
this however is a separate problem which i believe we can solve by draining ds pods as well (only those with sriov prefixed resources imo)
| functional, this approach has several limitations: | ||
|
|
||
| **Complexity:** | ||
| - Requires adding an init container to the device plugin DaemonSet |
There was a problem hiding this comment.
this is not a limitation its an implementation detail
There was a problem hiding this comment.
note: the proposed implementation of the "Init Container Approach" uses sriov-config-daemon binary (and image), so no extra image pull is needed.
|
|
||
| **Complexity:** | ||
| - Requires adding an init container to the device plugin DaemonSet | ||
| - Uses Kubernetes API for coordination (pod annotation watching) |
There was a problem hiding this comment.
this is not a limitation, its an implementation detail IMO.
if using k8s API for coordination is a limitation then we sure took a few wrong turns with the operator as well :)
There was a problem hiding this comment.
the point here is that a any config-daemon needs talk to the device-plugin instance that is running on the same node. Using the api server is suboptimal in this case (limits the scalability, can introduce bugs as a config-daemon might target the wrong device-plugin), and a direct wire, like gRPC over a socketfile, might be a better solution.
Some decision we made with the operator are not that good, after all
| **Complexity:** | ||
| - Requires adding an init container to the device plugin DaemonSet | ||
| - Uses Kubernetes API for coordination (pod annotation watching) | ||
| - Requires periodic polling by config-daemon to detect device plugin restarts |
There was a problem hiding this comment.
i believe current approach would be more heavy on the config-daemon no ?
There was a problem hiding this comment.
note: the proposed implementation of the "Init Container Approach" uses periodic checks, but this checks rely on client's cache, meaning no extra calls to the k8s API are made.
|
|
||
| enum State { | ||
| UNKNOWN = 0; | ||
| INITIALIZING = 1; |
There was a problem hiding this comment.
apart of SERVING and ERROR i dont understand when the others be transmitted.
even for error, if we are going to close the streaming connection (and probably exit) we could use the regular gRPC error IMO.
|
|
||
| - **Connection timeout:** Config-daemon retries connection with exponential backoff | ||
| - **Unexpected disconnect:** Device plugin stops servers and waits for reconnection | ||
| - **Config-daemon crash:** Device plugin detects closed connection and cleans up |
There was a problem hiding this comment.
same as expected disconnect
| | **Additional containers** | Yes (init container) | No | | ||
| | **Communication method** | Pod annotations via K8s API | Direct gRPC over Unix socket | | ||
| | **Latency** | Higher (API server roundtrip) | Lower (direct IPC) | | ||
| | **Generation awareness** | No | Yes | |
There was a problem hiding this comment.
what is the benefit of generation awareness ?
AFAIU its to handle cases where u send the same EnableDevicesRequest in the same connection to prevent device plugin from reloading config.
this limitation is non existent in the init container approach.
also TBH im not sure with the current flow config daemon can send EnableDevicesRequest with the same generation twice in a row in the same connection
| - Device-level enable/disable | ||
| - Configuration hot-reload | ||
| - Detailed device status queries | ||
|
|
There was a problem hiding this comment.
we should list some downsides relative to the init container appraoch as well IMO.
- relatively complex logic introduced in both config daemon and device plugin.
- risk of API incompatibility which we need to take care of (e.g config daemon at versionX of gRPC API and device plugin at version X+1) or via versa
- device plugin in its current form is pretty stable, init container approach introduces 0 changes to device plugin
generally we are aiming to switch to DRA so need to ask ourselves if the above is really worth it.
(also a note, in DRA we also access k8s API :) )
| #### Integration Tests | ||
|
|
||
| * End-to-end flow with config-daemon and device plugin | ||
| * Verify devices appear in kubelet only after EnableDevices |
There was a problem hiding this comment.
this is integration test or e2e test ?
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.