Design: implement control grpc to the device plugin by SchSeba · Pull Request #1003 · k8snetworkplumbingwg/sriov-network-operator

SchSeba · 2026-01-01T11:32:33Z

Summary by CodeRabbit

Documentation
- Added comprehensive design documentation for a new gRPC-based control interface for the SR-IOV Network Device Plugin, including architecture specifications, workflow descriptions, API definitions, implementation notes, upgrade pathways, and test strategies.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Sebastian Sch <sebassch@gmail.com>

github-actions · 2026-01-01T11:32:41Z

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

/test-all: To run all tests for all vendors.
/test-e2e-all: To run all E2E tests for all vendors.
/test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

/skip-all: To skip all tests for all vendors.
/skip-e2e-all: To skip all E2E tests for all vendors.
/skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
Best regards.

coderabbitai · 2026-01-01T11:32:43Z

Walkthrough

Adds a comprehensive design document for a new gRPC-based control interface for the SR-IOV Network Device Plugin. Defines a streaming connection from config-daemon to device plugin enabling lifecycle management of SR-IOV VF resource publication through a ControlService with EnableDevices RPC.

Changes

Cohort / File(s)	Summary
Design Documentation `doc/design/device-plugin-grpc-control.md`	New design document detailing gRPC streaming control interface architecture, ControlService API with EnableDevices RPC carrying node_state_generation and DevicePluginStatus updates, startup behavior for controlled mode, config-daemon workflow, backward compatibility, error handling, upgrade path via feature gate, and comprehensive test plans

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Design: implement control grpc to the device plugin' directly summarizes the main change—adding a comprehensive gRPC-based control interface design for the SR-IOV Network Device Plugin.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

doc/design/device-plugin-grpc-control.md (2)

223-231: Add language specifier to CLI flags code block (Line 225).

The code block documenting CLI flags (starting line 225) is missing a language specifier. Add ```bash to satisfy markdown linting standards and improve syntax highlighting.

233-250: Minor: Fix compound adjective and clarify feature gate strategy.

Two items:

Line 244: Change "Main entry point changes" to "Main entry-point changes" (compound adjective modifying "changes").

The design mentions a "feature gate" for enabling this behavior (lines 318-322, 327-328) but does not specify the gate name or its default state. For clarity, define the feature gate name (e.g., SRIOVDevicePluginControlGate or similar) and document whether it defaults to enabled: false to maintain backward compatibility during rollout.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8627a03 and 15b1df7.

📒 Files selected for processing (1)

doc/design/device-plugin-grpc-control.md

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: ykulazhenkov
Repo: k8snetworkplumbingwg/sriov-network-operator PR: 981
File: doc/design/block-device-plugin-until-configured.md:109-151
Timestamp: 2025-12-12T12:23:23.347Z
Learning: In the SR-IOV Network Operator's "Block Device Plugin Until Configured" design (doc/design/block-device-plugin-until-configured.md), the init container wait mechanism intentionally has no timeout. If sriov-config-daemon fails or is not running, the sriov-device-plugin should remain blocked and not start. This is expected fail-safe behavior to prevent announcing misconfigured SR-IOV resources.

📚 Learning: 2025-12-12T12:23:23.347Z

Learnt from: ykulazhenkov
Repo: k8snetworkplumbingwg/sriov-network-operator PR: 981
File: doc/design/block-device-plugin-until-configured.md:109-151
Timestamp: 2025-12-12T12:23:23.347Z
Learning: In the SR-IOV Network Operator's "Block Device Plugin Until Configured" design (doc/design/block-device-plugin-until-configured.md), the init container wait mechanism intentionally has no timeout. If sriov-config-daemon fails or is not running, the sriov-device-plugin should remain blocked and not start. This is expected fail-safe behavior to prevent announcing misconfigured SR-IOV resources.

Applied to files:

doc/design/device-plugin-grpc-control.md

🪛 LanguageTool

doc/design/device-plugin-grpc-control.md

[uncategorized] ~244-~244: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... - Ability to start/stop all servers on demand - Generation tracking 3. **Main ent...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 markdownlint-cli2 (0.18.1)

doc/design/device-plugin-grpc-control.md

225-225: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: test
GitHub Check: build
GitHub Check: test-coverage
GitHub Check: Golangci-lint
GitHub Check: Analyze (go)

🔇 Additional comments (3)

doc/design/device-plugin-grpc-control.md (3)

45-61: Acknowledge the intentional fail-safe behavior of the init-container approach.

The comparison dismisses PR #981's init-container approach as having "limitations," but the retrieved learning from that PR indicates its lack of timeout is intentional fail-safe behavior: if sriov-config-daemon fails or doesn't run, the device plugin should remain blocked and never announce misconfigured SR-IOV resources. This is a security/correctness feature, not a limitation. Consider revising the comparison to acknowledge this safety benefit while still arguing for the gRPC approach's advantages (lower latency, generation awareness, bidirectional status).

96-112: Clarify timeout and health-check behavior for controlled mode.

The design states the device plugin "waits for a connection from the config-daemon" but does not specify a timeout. If config-daemon crashes or fails to connect, the device plugin will remain blocked indefinitely. This differs from the init-container approach's documented fail-safe (waiting indefinitely is intentional). Please clarify:

Is indefinite waiting intended as a fail-safe (like the init-container approach), or should there be a timeout?

What constitutes "connection close"? Are keepalive probes defined? How is liveness verified?

How should operators handle a permanently failed config-daemon in controlled mode?

These behaviors are critical for operational safety and should be explicitly documented.

Also applies to: 270-275

337-361: Test plan is comprehensive and well-structured.

The test plan appropriately covers unit, integration, and E2E scenarios, including backward compatibility, reconnection, and daemon crash recovery. Ensure integration tests validate that kubelet ListAndWatch is not initiated until after EnableDevices is called, as this is a core correctness requirement preventing the race condition.

coveralls · 2026-01-01T11:43:47Z

Pull Request Test Coverage Report for Build 20637879903

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.04%) to 63.254%

Totals
Change from base Build 20490897281:	0.04%
Covered Lines:	9125
Relevant Lines:	14426

💛 - Coveralls

adrianchiris · 2026-01-07T13:23:30Z

doc/design/device-plugin-grpc-control.md

+// the device plugin's resource announcement lifecycle.
+service ControlService {
+  // EnableDevices establishes a long-lived connection that signals the device
+  // plugin to start exposing devices to kubelet. The connection acts as a


i think its optionally loading the new configuration then start exposing devices to kubelet

adrianchiris · 2026-01-07T13:24:43Z

doc/design/device-plugin-grpc-control.md

+service ControlService {
+  // EnableDevices establishes a long-lived connection that signals the device
+  // plugin to start exposing devices to kubelet. The connection acts as a
+  // health check - when it closes, the device plugin stops advertising resources.


you mean the device plugin will re-advertize zero devices.

adrianchiris · 2026-01-07T13:27:47Z

doc/design/device-plugin-grpc-control.md

+  // health check - when it closes, the device plugin stops advertising resources.
+  // The client sends the initial request with the generation, and the server
+  // streams status updates back.
+  rpc EnableDevices(EnableDevicesRequest) returns (stream DevicePluginStatus);


what will ensure that wen EnableDevicesRequest is sent, the config map file is actually updated in the device plugin pod

adrianchiris · 2026-01-07T13:30:02Z

doc/design/device-plugin-grpc-control.md

+  State state = 1;
+
+  // Number of resource pools currently being served
+  int32 resource_pool_count = 2;


what will we do with this information ?

same for the below

adrianchiris · 2026-01-07T13:30:56Z

doc/design/device-plugin-grpc-control.md

+    INITIALIZING = 1;
+    SERVING = 2;
+    ERROR = 3;
+    STOPPING = 4;


adrianchiris · 2026-01-07T13:31:05Z

doc/design/device-plugin-grpc-control.md

+
+  enum State {
+    UNKNOWN = 0;
+    INITIALIZING = 1;


when will we be in this state

apart of SERVING and ERROR i dont understand when the others be transmitted.

even for error, if we are going to close the streaming connection (and probably exit) we could use the regular gRPC error IMO.

adrianchiris · 2026-01-07T13:39:04Z

doc/design/device-plugin-grpc-control.md

+| **Resource overhead** | Additional container image | Minimal (in-process server) |
+| **Kubernetes API load** | Annotation updates | None |
+| **Implementation scope** | Operator only | Both operator and device plugin |
+| **Complexity** | Moderate | Higher initial, simpler operation |


the complexity for init container is low id say. its like ~300 lines of actual code and alot of it is boiler plate (logs, create client etc)

adrianchiris · 2026-01-07T18:28:08Z

doc/design/device-plugin-grpc-control.md

+that enables the sriov-config-daemon to establish a long-lived connection and control
+when the device plugin publishes SR-IOV VF resources to kubelet. The config-daemon
+passes the `SriovNetworkNodeState` generation to the device plugin, which stores it
+and only starts exposing devices when connected. On disconnect, the device plugin


a note: the latter part was never done by device plugin. (probe function has a TODO to implement)

adrianchiris · 2026-01-07T18:33:57Z

doc/design/device-plugin-grpc-control.md

+3. Kubernetes scheduler schedules pods to this node
+4. Pods start with partially configured VFs
+5. sriov-config-daemon applies VF configuration (unbind/bind VF driver)
+6. Pods that were not evicted lose VF connectivity


do you mean daemonset pods that consume VFs ?

even today we dont handle it upon sriov reconfiguration regardless of externally managed right ?
having a switch to delay device plugin started indeed solves the intial config but not reconfiguration.
this however is a separate problem which i believe we can solve by draining ds pods as well (only those with sriov prefixed resources imo)

adrianchiris · 2026-01-07T18:36:22Z

doc/design/device-plugin-grpc-control.md

+functional, this approach has several limitations:
+
+**Complexity:**
+- Requires adding an init container to the device plugin DaemonSet


this is not a limitation its an implementation detail

note: the proposed implementation of the "Init Container Approach" uses sriov-config-daemon binary (and image), so no extra image pull is needed.

adrianchiris · 2026-01-07T18:37:49Z

doc/design/device-plugin-grpc-control.md

+
+**Complexity:**
+- Requires adding an init container to the device plugin DaemonSet
+- Uses Kubernetes API for coordination (pod annotation watching)


this is not a limitation, its an implementation detail IMO.

if using k8s API for coordination is a limitation then we sure took a few wrong turns with the operator as well :)

the point here is that a any config-daemon needs talk to the device-plugin instance that is running on the same node. Using the api server is suboptimal in this case (limits the scalability, can introduce bugs as a config-daemon might target the wrong device-plugin), and a direct wire, like gRPC over a socketfile, might be a better solution.

Some decision we made with the operator are not that good, after all

adrianchiris · 2026-01-07T18:39:30Z

doc/design/device-plugin-grpc-control.md

+**Complexity:**
+- Requires adding an init container to the device plugin DaemonSet
+- Uses Kubernetes API for coordination (pod annotation watching)
+- Requires periodic polling by config-daemon to detect device plugin restarts


i believe current approach would be more heavy on the config-daemon no ?

note: the proposed implementation of the "Init Container Approach" uses periodic checks, but this checks rely on client's cache, meaning no extra calls to the k8s API are made.

adrianchiris · 2026-01-07T19:15:08Z

doc/design/device-plugin-grpc-control.md

+
+  enum State {
+    UNKNOWN = 0;
+    INITIALIZING = 1;


apart of SERVING and ERROR i dont understand when the others be transmitted.

even for error, if we are going to close the streaming connection (and probably exit) we could use the regular gRPC error IMO.

adrianchiris · 2026-01-07T19:17:18Z

doc/design/device-plugin-grpc-control.md

+
+- **Connection timeout:** Config-daemon retries connection with exponential backoff
+- **Unexpected disconnect:** Device plugin stops servers and waits for reconnection
+- **Config-daemon crash:** Device plugin detects closed connection and cleans up


same as expected disconnect

adrianchiris · 2026-01-07T19:22:43Z

doc/design/device-plugin-grpc-control.md

+| **Additional containers** | Yes (init container) | No |
+| **Communication method** | Pod annotations via K8s API | Direct gRPC over Unix socket |
+| **Latency** | Higher (API server roundtrip) | Lower (direct IPC) |
+| **Generation awareness** | No | Yes |


what is the benefit of generation awareness ?

AFAIU its to handle cases where u send the same EnableDevicesRequest in the same connection to prevent device plugin from reloading config.

this limitation is non existent in the init container approach.

also TBH im not sure with the current flow config daemon can send EnableDevicesRequest with the same generation twice in a row in the same connection

adrianchiris · 2026-01-07T19:33:34Z

doc/design/device-plugin-grpc-control.md

+   - Device-level enable/disable
+   - Configuration hot-reload
+   - Detailed device status queries
+


we should list some downsides relative to the init container appraoch as well IMO.

relatively complex logic introduced in both config daemon and device plugin.

risk of API incompatibility which we need to take care of (e.g config daemon at versionX of gRPC API and device plugin at version X+1) or via versa

device plugin in its current form is pretty stable, init container approach introduces 0 changes to device plugin

generally we are aiming to switch to DRA so need to ask ourselves if the above is really worth it.
(also a note, in DRA we also access k8s API :) )

adrianchiris · 2026-01-07T19:37:44Z

doc/design/device-plugin-grpc-control.md

+#### Integration Tests
+
+* End-to-end flow with config-daemon and device plugin
+* Verify devices appear in kubelet only after EnableDevices


this is integration test or e2e test ?

Design: implement control grpc to the device plugin

15b1df7

Signed-off-by: Sebastian Sch <sebassch@gmail.com>

github-actions bot added the docs label Jan 1, 2026

SchSeba mentioned this pull request Jan 1, 2026

Add 'Block Device Plugin Until Configured' design doc #981

Open

coderabbitai bot reviewed Jan 1, 2026

View reviewed changes

adrianchiris reviewed Jan 7, 2026

View reviewed changes

doc/design/device-plugin-grpc-control.md

INITIALIZING = 1;

SERVING = 2;

ERROR = 3;

STOPPING = 4;

Copy link

Collaborator

adrianchiris Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no STOP ?

adrianchiris reviewed Jan 7, 2026

View reviewed changes

Conversation

SchSeba commented Jan 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 1, 2026

Uh oh!

coderabbitai bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jan 1, 2026

Pull Request Test Coverage Report for Build 20637879903

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SchSeba commented Jan 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 1, 2026 •

edited

Loading