Skip to content

Design: implement control grpc to the device plugin#1003

Open
SchSeba wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
SchSeba:device_plugin_design
Open

Design: implement control grpc to the device plugin#1003
SchSeba wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
SchSeba:device_plugin_design

Conversation

@SchSeba
Copy link
Collaborator

@SchSeba SchSeba commented Jan 1, 2026

Summary by CodeRabbit

  • Documentation
    • Added comprehensive design documentation for a new gRPC-based control interface for the SR-IOV Network Device Plugin, including architecture specifications, workflow descriptions, API definitions, implementation notes, upgrade pathways, and test strategies.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Sebastian Sch <sebassch@gmail.com>
@github-actions
Copy link

github-actions bot commented Jan 1, 2026

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@github-actions github-actions bot added the docs label Jan 1, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 1, 2026

Walkthrough

Adds a comprehensive design document for a new gRPC-based control interface for the SR-IOV Network Device Plugin. Defines a streaming connection from config-daemon to device plugin enabling lifecycle management of SR-IOV VF resource publication through a ControlService with EnableDevices RPC.

Changes

Cohort / File(s) Summary
Design Documentation
doc/design/device-plugin-grpc-control.md
New design document detailing gRPC streaming control interface architecture, ControlService API with EnableDevices RPC carrying node_state_generation and DevicePluginStatus updates, startup behavior for controlled mode, config-daemon workflow, backward compatibility, error handling, upgrade path via feature gate, and comprehensive test plans

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'Design: implement control grpc to the device plugin' directly summarizes the main change—adding a comprehensive gRPC-based control interface design for the SR-IOV Network Device Plugin.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
doc/design/device-plugin-grpc-control.md (2)

223-231: Add language specifier to CLI flags code block (Line 225).

The code block documenting CLI flags (starting line 225) is missing a language specifier. Add ```bash to satisfy markdown linting standards and improve syntax highlighting.


233-250: Minor: Fix compound adjective and clarify feature gate strategy.

Two items:

  1. Line 244: Change "Main entry point changes" to "Main entry-point changes" (compound adjective modifying "changes").

  2. The design mentions a "feature gate" for enabling this behavior (lines 318-322, 327-328) but does not specify the gate name or its default state. For clarity, define the feature gate name (e.g., SRIOVDevicePluginControlGate or similar) and document whether it defaults to enabled: false to maintain backward compatibility during rollout.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8627a03 and 15b1df7.

📒 Files selected for processing (1)
  • doc/design/device-plugin-grpc-control.md
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ykulazhenkov
Repo: k8snetworkplumbingwg/sriov-network-operator PR: 981
File: doc/design/block-device-plugin-until-configured.md:109-151
Timestamp: 2025-12-12T12:23:23.347Z
Learning: In the SR-IOV Network Operator's "Block Device Plugin Until Configured" design (doc/design/block-device-plugin-until-configured.md), the init container wait mechanism intentionally has no timeout. If sriov-config-daemon fails or is not running, the sriov-device-plugin should remain blocked and not start. This is expected fail-safe behavior to prevent announcing misconfigured SR-IOV resources.
📚 Learning: 2025-12-12T12:23:23.347Z
Learnt from: ykulazhenkov
Repo: k8snetworkplumbingwg/sriov-network-operator PR: 981
File: doc/design/block-device-plugin-until-configured.md:109-151
Timestamp: 2025-12-12T12:23:23.347Z
Learning: In the SR-IOV Network Operator's "Block Device Plugin Until Configured" design (doc/design/block-device-plugin-until-configured.md), the init container wait mechanism intentionally has no timeout. If sriov-config-daemon fails or is not running, the sriov-device-plugin should remain blocked and not start. This is expected fail-safe behavior to prevent announcing misconfigured SR-IOV resources.

Applied to files:

  • doc/design/device-plugin-grpc-control.md
🪛 LanguageTool
doc/design/device-plugin-grpc-control.md

[uncategorized] ~244-~244: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... - Ability to start/stop all servers on demand - Generation tracking 3. **Main ent...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 markdownlint-cli2 (0.18.1)
doc/design/device-plugin-grpc-control.md

225-225: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: test
  • GitHub Check: build
  • GitHub Check: test-coverage
  • GitHub Check: Golangci-lint
  • GitHub Check: Analyze (go)
🔇 Additional comments (3)
doc/design/device-plugin-grpc-control.md (3)

45-61: Acknowledge the intentional fail-safe behavior of the init-container approach.

The comparison dismisses PR #981's init-container approach as having "limitations," but the retrieved learning from that PR indicates its lack of timeout is intentional fail-safe behavior: if sriov-config-daemon fails or doesn't run, the device plugin should remain blocked and never announce misconfigured SR-IOV resources. This is a security/correctness feature, not a limitation. Consider revising the comparison to acknowledge this safety benefit while still arguing for the gRPC approach's advantages (lower latency, generation awareness, bidirectional status).


96-112: Clarify timeout and health-check behavior for controlled mode.

The design states the device plugin "waits for a connection from the config-daemon" but does not specify a timeout. If config-daemon crashes or fails to connect, the device plugin will remain blocked indefinitely. This differs from the init-container approach's documented fail-safe (waiting indefinitely is intentional). Please clarify:

  1. Is indefinite waiting intended as a fail-safe (like the init-container approach), or should there be a timeout?
  2. What constitutes "connection close"? Are keepalive probes defined? How is liveness verified?
  3. How should operators handle a permanently failed config-daemon in controlled mode?

These behaviors are critical for operational safety and should be explicitly documented.

Also applies to: 270-275


337-361: Test plan is comprehensive and well-structured.

The test plan appropriately covers unit, integration, and E2E scenarios, including backward compatibility, reconnection, and daemon crash recovery. Ensure integration tests validate that kubelet ListAndWatch is not initiated until after EnableDevices is called, as this is a core correctness requirement preventing the race condition.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 20637879903

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.04%) to 63.254%

Totals Coverage Status
Change from base Build 20490897281: 0.04%
Covered Lines: 9125
Relevant Lines: 14426

💛 - Coveralls

// the device plugin's resource announcement lifecycle.
service ControlService {
// EnableDevices establishes a long-lived connection that signals the device
// plugin to start exposing devices to kubelet. The connection acts as a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its optionally loading the new configuration then start exposing devices to kubelet

service ControlService {
// EnableDevices establishes a long-lived connection that signals the device
// plugin to start exposing devices to kubelet. The connection acts as a
// health check - when it closes, the device plugin stops advertising resources.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the device plugin will re-advertize zero devices.

// health check - when it closes, the device plugin stops advertising resources.
// The client sends the initial request with the generation, and the server
// streams status updates back.
rpc EnableDevices(EnableDevicesRequest) returns (stream DevicePluginStatus);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will ensure that wen EnableDevicesRequest is sent, the config map file is actually updated in the device plugin pod

State state = 1;

// Number of resource pools currently being served
int32 resource_pool_count = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will we do with this information ?

same for the below

INITIALIZING = 1;
SERVING = 2;
ERROR = 3;
STOPPING = 4;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no STOP ?


enum State {
UNKNOWN = 0;
INITIALIZING = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will we be in this state

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart of SERVING and ERROR i dont understand when the others be transmitted.

even for error, if we are going to close the streaming connection (and probably exit) we could use the regular gRPC error IMO.

| **Resource overhead** | Additional container image | Minimal (in-process server) |
| **Kubernetes API load** | Annotation updates | None |
| **Implementation scope** | Operator only | Both operator and device plugin |
| **Complexity** | Moderate | Higher initial, simpler operation |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the complexity for init container is low id say. its like ~300 lines of actual code and alot of it is boiler plate (logs, create client etc)

that enables the sriov-config-daemon to establish a long-lived connection and control
when the device plugin publishes SR-IOV VF resources to kubelet. The config-daemon
passes the `SriovNetworkNodeState` generation to the device plugin, which stores it
and only starts exposing devices when connected. On disconnect, the device plugin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a note: the latter part was never done by device plugin. (probe function has a TODO to implement)

3. Kubernetes scheduler schedules pods to this node
4. Pods start with partially configured VFs
5. sriov-config-daemon applies VF configuration (unbind/bind VF driver)
6. Pods that were not evicted lose VF connectivity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean daemonset pods that consume VFs ?

even today we dont handle it upon sriov reconfiguration regardless of externally managed right ?
having a switch to delay device plugin started indeed solves the intial config but not reconfiguration.
this however is a separate problem which i believe we can solve by draining ds pods as well (only those with sriov prefixed resources imo)

functional, this approach has several limitations:

**Complexity:**
- Requires adding an init container to the device plugin DaemonSet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a limitation its an implementation detail

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the proposed implementation of the "Init Container Approach" uses sriov-config-daemon binary (and image), so no extra image pull is needed.


**Complexity:**
- Requires adding an init container to the device plugin DaemonSet
- Uses Kubernetes API for coordination (pod annotation watching)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a limitation, its an implementation detail IMO.

if using k8s API for coordination is a limitation then we sure took a few wrong turns with the operator as well :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point here is that a any config-daemon needs talk to the device-plugin instance that is running on the same node. Using the api server is suboptimal in this case (limits the scalability, can introduce bugs as a config-daemon might target the wrong device-plugin), and a direct wire, like gRPC over a socketfile, might be a better solution.

Some decision we made with the operator are not that good, after all

**Complexity:**
- Requires adding an init container to the device plugin DaemonSet
- Uses Kubernetes API for coordination (pod annotation watching)
- Requires periodic polling by config-daemon to detect device plugin restarts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe current approach would be more heavy on the config-daemon no ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the proposed implementation of the "Init Container Approach" uses periodic checks, but this checks rely on client's cache, meaning no extra calls to the k8s API are made.


enum State {
UNKNOWN = 0;
INITIALIZING = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart of SERVING and ERROR i dont understand when the others be transmitted.

even for error, if we are going to close the streaming connection (and probably exit) we could use the regular gRPC error IMO.


- **Connection timeout:** Config-daemon retries connection with exponential backoff
- **Unexpected disconnect:** Device plugin stops servers and waits for reconnection
- **Config-daemon crash:** Device plugin detects closed connection and cleans up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as expected disconnect

| **Additional containers** | Yes (init container) | No |
| **Communication method** | Pod annotations via K8s API | Direct gRPC over Unix socket |
| **Latency** | Higher (API server roundtrip) | Lower (direct IPC) |
| **Generation awareness** | No | Yes |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of generation awareness ?

AFAIU its to handle cases where u send the same EnableDevicesRequest in the same connection to prevent device plugin from reloading config.

this limitation is non existent in the init container approach.

also TBH im not sure with the current flow config daemon can send EnableDevicesRequest with the same generation twice in a row in the same connection

- Device-level enable/disable
- Configuration hot-reload
- Detailed device status queries

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should list some downsides relative to the init container appraoch as well IMO.

  1. relatively complex logic introduced in both config daemon and device plugin.
  2. risk of API incompatibility which we need to take care of (e.g config daemon at versionX of gRPC API and device plugin at version X+1) or via versa
  3. device plugin in its current form is pretty stable, init container approach introduces 0 changes to device plugin

generally we are aiming to switch to DRA so need to ask ourselves if the above is really worth it.
(also a note, in DRA we also access k8s API :) )

#### Integration Tests

* End-to-end flow with config-daemon and device plugin
* Verify devices appear in kubelet only after EnableDevices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is integration test or e2e test ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants