Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
362 changes: 362 additions & 0 deletions doc/design/device-plugin-grpc-control.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,362 @@
---
title: Device Plugin gRPC Control Interface
authors:
- TBD
reviewers:
- TBD
creation-date: 01-01-2026
last-updated: 01-01-2026
---

# Device Plugin gRPC Control Interface

## Summary

This design proposes implementing a gRPC server in the SR-IOV Network Device Plugin
that enables the sriov-config-daemon to establish a long-lived connection and control
when the device plugin publishes SR-IOV VF resources to kubelet. The config-daemon
passes the `SriovNetworkNodeState` generation to the device plugin, which stores it
and only starts exposing devices when connected. On disconnect, the device plugin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a note: the latter part was never done by device plugin. (probe function has a TODO to implement)

stops reporting devices and cleans them from kubelet.

## Motivation

### Problem Description

When using policies with `externallyManaged: true`, VFs are created by an external
script or application instead of the sriov-config-daemon. The SR-IOV Network Operator
creates the sriov-device-plugin configuration and enables its DaemonSet pod at the
same time it starts applying SR-IOV state to the node.

This creates a race condition where the sriov-device-plugin may start and announce
resources to kubelet before VFs are fully configured by the sriov-config-daemon.

#### Race Condition Sequence

After node provisioning or host reboot:

1. External script creates VFs (partially configured, e.g., no RDMA GUID assigned)
2. sriov-device-plugin pod starts, discovers VFs and announces them to kubelet
3. Kubernetes scheduler schedules pods to this node
4. Pods start with partially configured VFs
5. sriov-config-daemon applies VF configuration (unbind/bind VF driver)
6. Pods that were not evicted lose VF connectivity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean daemonset pods that consume VFs ?

even today we dont handle it upon sriov reconfiguration regardless of externally managed right ?
having a switch to delay device plugin started indeed solves the intial config but not reconfiguration.
this however is a separate problem which i believe we can solve by draining ds pods as well (only those with sriov prefixed resources imo)


#### Existing Proposal: Init Container Approach (PR #981)

PR #981 proposes adding an init container to the device plugin DaemonSet that blocks
until the config-daemon signals configuration completion via pod annotations. While
functional, this approach has several limitations:

**Complexity:**
- Requires adding an init container to the device plugin DaemonSet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a limitation its an implementation detail

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the proposed implementation of the "Init Container Approach" uses sriov-config-daemon binary (and image), so no extra image pull is needed.

- Uses Kubernetes API for coordination (pod annotation watching)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a limitation, its an implementation detail IMO.

if using k8s API for coordination is a limitation then we sure took a few wrong turns with the operator as well :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the point here is that a any config-daemon needs talk to the device-plugin instance that is running on the same node. Using the api server is suboptimal in this case (limits the scalability, can introduce bugs as a config-daemon might target the wrong device-plugin), and a direct wire, like gRPC over a socketfile, might be a better solution.

Some decision we made with the operator are not that good, after all

- Requires periodic polling by config-daemon to detect device plugin restarts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe current approach would be more heavy on the config-daemon no ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the proposed implementation of the "Init Container Approach" uses periodic checks, but this checks rely on client's cache, meaning no extra calls to the k8s API are made.

- Adds latency due to annotation-based signaling through the API server
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by this ?

i think the main limitations of this apprach is:

  1. another entity creating a watch on each node to API server (although limited to a specific NS and pod and relatively short lived)
  2. detection of device plugin reset may be delayed up until polling interval


**Limitations:**
- Cannot dynamically pause device reporting without pod restart
- Device plugin has no awareness of which configuration generation it should serve
- Disconnect/reconnect scenarios require pod restarts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a limitation of the current implementation and not the init container approach.

- No direct synchronization between daemon and device plugin processes

### Use Cases

* As a user, I want pods to only receive SR-IOV VFs after they are fully configured
* As a user, I want the config-daemon to have direct control over when the device
plugin starts and stops advertising resources
* As a user, I want the device plugin to automatically stop advertising resources
if the config-daemon is not connected or crashes
* As a user, I want the device plugin to be aware of the configuration generation
it is serving

### Goals

* Implement a gRPC server in the device plugin for config-daemon communication
* Enable the config-daemon to control device plugin resource announcement lifecycle
* Provide generation-based synchronization between operator and device plugin
* Automatically clean up resources when the connection is lost
* Maintain backward compatibility when the feature is not used

### Non-Goals

* Changing the drain behavior for DaemonSet pods
* Modifying how externally managed VFs are created
* Changing the VF configuration process itself
* Replacing the existing kubelet device plugin gRPC interface

## Proposal

Add a new gRPC server in the sriov-network-device-plugin that listens on a Unix
domain socket. The sriov-config-daemon establishes a long-lived streaming connection
to this server and signals when the device plugin should start exposing devices.

### Workflow Description

#### Device Plugin Startup (Controlled Mode)

When the device plugin starts with the control interface enabled:

1. Device plugin starts and initializes the gRPC control server
2. Device plugin waits for a connection from the config-daemon
3. When the config-daemon connects and sends an `EnableDevices` RPC with the
`SriovNetworkNodeState` generation, the device plugin:
- Stores the generation for reference
- Starts the resource servers and registers with kubelet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what ensures the generation matches the config file?

maybe push the desired config through this API

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also IMO the proposed service can be a non streaming API, then use the standard grpc healh server for watching when the service goes down

see https://github.com/grpc/grpc/blob/master/doc/health-checking.md

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe push the desired config through this API

I agree with Adrian.

If we are considering adding a control socket or "externally managed mode" 😄 to the SR-IOV device plugin, it might be beneficial to design it so that the device plugin's configuration can be dynamically updated through this interface.

- Begins the ListAndWatch loop to expose devices
4. The connection remains open as a health check mechanism
5. If the connection is lost (daemon restart, crash, etc.), the device plugin:
- Stops all resource servers
- Cleans up kubelet registrations
- Waits for a new connection

#### Config Daemon Behavior

When the config-daemon completes VF configuration:

1. After successfully applying SR-IOV configuration, connect to the device plugin
gRPC server
2. Call `EnableDevices` RPC with the current `SriovNetworkNodeState` generation
3. Maintain the connection for the lifetime of the configuration
4. On reconfiguration that requires device plugin restart:
- Close the existing connection (triggers device cleanup)
- Wait for configuration to complete
- Establish a new connection with the new generation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing: on connection lost flow.

#### Sequence Diagram

```mermaid
sequenceDiagram
participant ES as External Script
participant DP as Device Plugin
participant GS as gRPC Server (in DP)
participant CD as Config Daemon
participant K as Kubelet

ES->>ES: Create VFs
DP->>GS: Start gRPC Control Server
DP->>DP: Discover Devices (but don't register)
Note over DP: Waiting for control connection

CD->>CD: Apply VF Configuration
CD->>GS: Connect + EnableDevices(generation=5)
GS->>DP: Signal to start
DP->>DP: Store generation=5
DP->>K: Register & ListAndWatch
K->>K: Update Node Capacity

Note over CD,GS: Long-lived connection maintained

CD->>CD: New configuration needed
CD->>GS: Disconnect
GS->>DP: Connection lost signal
DP->>K: Stop ListAndWatch, cleanup
K->>K: Remove SR-IOV resources

CD->>CD: Apply new VF Configuration
CD->>GS: Connect + EnableDevices(generation=6)
GS->>DP: Signal to start
DP->>DP: Store generation=6
DP->>K: Register & ListAndWatch
```

### API Extensions

#### gRPC Service Definition

```protobuf
syntax = "proto3";

package sriovdp.control.v1;

option go_package = "github.com/k8snetworkplumbingwg/sriov-network-device-plugin/pkg/control/v1";

// ControlService provides an interface for the config-daemon to control
// the device plugin's resource announcement lifecycle.
service ControlService {
// EnableDevices establishes a long-lived connection that signals the device
// plugin to start exposing devices to kubelet. The connection acts as a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its optionally loading the new configuration then start exposing devices to kubelet

// health check - when it closes, the device plugin stops advertising resources.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the device plugin will re-advertize zero devices.

// The client sends the initial request with the generation, and the server
// streams status updates back.
rpc EnableDevices(EnableDevicesRequest) returns (stream DevicePluginStatus);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will ensure that wen EnableDevicesRequest is sent, the config map file is actually updated in the device plugin pod

}

// EnableDevicesRequest is sent by the config-daemon to enable device advertising.
message EnableDevicesRequest {
// The generation of the SriovNetworkNodeState that triggered this enable.
// This allows the device plugin to track which configuration it is serving.
int64 node_state_generation = 1;

// Optional: node name for logging/debugging purposes
string node_name = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed ? even if optional

}

// DevicePluginStatus is streamed back to the config-daemon to provide
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any rules on when to stream this ?

i.e for each EnableDevicesRequest its expected to get exactly one DevicePluginStatus msg ?

// visibility into the device plugin state.
message DevicePluginStatus {
// Current state of the device plugin
State state = 1;

// Number of resource pools currently being served
int32 resource_pool_count = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will we do with this information ?

same for the below


// Total number of devices being advertised
int32 device_count = 3;

// The generation currently being served
int64 serving_generation = 4;

// Optional error message if state is ERROR
string error_message = 5;

enum State {
UNKNOWN = 0;
INITIALIZING = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will we be in this state

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart of SERVING and ERROR i dont understand when the others be transmitted.

even for error, if we are going to close the streaming connection (and probably exit) we could use the regular gRPC error IMO.

SERVING = 2;
ERROR = 3;
STOPPING = 4;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no STOP ?

}
}
```

#### Device Plugin CLI Flags

```
--control-socket Path to the Unix domain socket for the control interface
(default: /var/lib/sriov/sriovdp-control.sock)
--controlled-mode Enable controlled mode where device plugin waits for
config-daemon connection before advertising devices
(default: false)
```

### Implementation Details/Notes/Constraints

#### Device Plugin Changes

1. **New `pkg/control` package:**
- gRPC server implementation
- Connection state management
- Integration with resource manager

2. **ResourceManager modifications:**
- Support for delayed server startup
- Ability to start/stop all servers on demand
- Generation tracking

3. **Main entry point changes:**
- Parse new CLI flags
- Conditionally start control server
- Block on control connection in controlled mode

#### Config Daemon Changes

1. **New gRPC client:**
- Connect to device plugin after configuration
- Maintain long-lived connection
- Handle reconnection on errors

2. **Integration with reconcile loop:**
- Call `EnableDevices` after successful apply
- Close connection when reconfiguration needed
- Pass generation from `SriovNetworkNodeState`

#### Backward Compatibility

When `--controlled-mode=false` (default), the device plugin behaves exactly as
before - starting resource servers immediately on startup. This ensures existing
deployments continue to work without changes.

#### Error Handling

- **Connection timeout:** Config-daemon retries connection with exponential backoff
- **Unexpected disconnect:** Device plugin stops servers and waits for reconnection
- **Config-daemon crash:** Device plugin detects closed connection and cleans up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as expected disconnect

- **Device plugin restart:** Config-daemon detects disconnect and reconnects

### Comparison with Init Container Approach (PR #981)

| Aspect | Init Container (PR #981) | gRPC Control Interface |
|--------|--------------------------|------------------------|
| **Additional containers** | Yes (init container) | No |
| **Communication method** | Pod annotations via K8s API | Direct gRPC over Unix socket |
| **Latency** | Higher (API server roundtrip) | Lower (direct IPC) |
| **Generation awareness** | No | Yes |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of generation awareness ?

AFAIU its to handle cases where u send the same EnableDevicesRequest in the same connection to prevent device plugin from reloading config.

this limitation is non existent in the init container approach.

also TBH im not sure with the current flow config daemon can send EnableDevicesRequest with the same generation twice in a row in the same connection

| **Dynamic pause/resume** | Requires pod restart | Connection close/reconnect |
| **Failure detection** | Polling required | Immediate (connection closed) |
| **Resource overhead** | Additional container image | Minimal (in-process server) |
| **Kubernetes API load** | Annotation updates | None |
| **Implementation scope** | Operator only | Both operator and device plugin |
| **Complexity** | Moderate | Higher initial, simpler operation |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the complexity for init container is low id say. its like ~300 lines of actual code and alot of it is boiler plate (logs, create client etc)


#### Why gRPC is Better

1. **Direct Communication:** No intermediate Kubernetes API calls reduce latency
and eliminate potential API server bottlenecks.

2. **Generation Tracking:** The device plugin knows exactly which configuration
generation it's serving, enabling better debugging and state verification.

3. **Immediate Failure Detection:** gRPC connection termination is detected
immediately, unlike polling-based annotation watching.

4. **No Additional Containers:** Reduces pod startup time and resource usage.

5. **Bidirectional Status:** The device plugin can stream status updates back
to the config-daemon, providing visibility into device advertisement state.

6. **Cleaner Lifecycle Management:** Connection-based lifecycle is more intuitive
than annotation-based coordination.

7. **Future Extensibility:** The gRPC interface can be extended with additional
RPCs for features like:
- Device-level enable/disable
- Configuration hot-reload
- Detailed device status queries

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should list some downsides relative to the init container appraoch as well IMO.

  1. relatively complex logic introduced in both config daemon and device plugin.
  2. risk of API incompatibility which we need to take care of (e.g config daemon at versionX of gRPC API and device plugin at version X+1) or via versa
  3. device plugin in its current form is pretty stable, init container approach introduces 0 changes to device plugin

generally we are aiming to switch to DRA so need to ask ourselves if the above is really worth it.
(also a note, in DRA we also access k8s API :) )

### Upgrade & Downgrade Considerations

**Upgrade:**
- New device plugin image with control interface
- Config-daemon updated to use gRPC client
- Feature gate enables the new behavior
- Existing pods continue working; new configuration uses control interface

**Downgrade:**
- Disable feature gate
- Device plugin falls back to immediate startup
- Config-daemon stops attempting gRPC connections
- Reverts to original behavior (with potential timing issues)

**Mixed Version Scenarios:**
- Old device plugin + New config-daemon: Config-daemon connection fails, logs warning,
device plugin works in legacy mode
- New device plugin + Old config-daemon: Device plugin waits indefinitely in
controlled mode (should not enable controlled mode without updated daemon)

### Test Plan

#### Unit Tests

* gRPC server starts and accepts connections
* Device plugin waits for connection in controlled mode
* Device plugin starts servers after EnableDevices RPC
* Device plugin stops servers on connection close
* Generation is correctly stored and reported
* Backward compatibility when controlled mode is disabled

#### Integration Tests

* End-to-end flow with config-daemon and device plugin
* Verify devices appear in kubelet only after EnableDevices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is integration test or e2e test ?

* Verify devices are removed when connection closes
* Test reconnection scenarios
* Test config-daemon crash recovery

#### E2E Tests

* Deploy with feature gate enabled
* Verify pod scheduling waits for configuration completion
* Verify externally managed VF scenario works correctly
* Test node reboot scenarios
* Test daemon restart scenarios
Loading