feat: Add PF VFIO support#163
Conversation
WalkthroughPlugin now distinguishes VF vs PF at runtime (via new IsVirtualFunction), adds VF-specific handlers for add/del flows, delays GUID loading until device-type determination, and updates documentation to reflect vfio-pci mode applies to devices bound to vfio-pci (VF or PF). Changes
Sequence Diagram(s)sequenceDiagram
participant K8s as K8s/CNI (caller)
participant CNI as ib-sriov-cni
participant Sysfs as /sys/bus/pci (utils)
participant IPAM as IPAM
participant NetNS as netns
participant RDMA as RDMAManager
K8s->>CNI: CMD ADD (args)
CNI->>Sysfs: IsVirtualFunction(pciAddr)
Sysfs-->>CNI: isVF (true/false)
alt isVF == true
CNI->>CNI: set netConf.IsVFDevice=true
CNI->>CNI: load VF-specific device info
CNI->>IPAM: request IP (IPAM)
IPAM-->>CNI: IPAM result
CNI->>NetNS: configure VF in pod netns (setup, move RDMA if needed)
CNI->>RDMA: optionally move RDMA device to netns
CNI-->>K8s: result
else isVF == false
CNI->>CNI: set netConf.IsVFDevice=false
CNI->>CNI: validate vfioPciMode / PF path
CNI->>NetNS: configure PF passthrough (no VF IPAM steps)
CNI-->>K8s: result
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
pkg/utils/utils_test.go (1)
84-109: Good coverage for VF/PF/VFIO and non-existent-device casesThese tests nicely exercise the different physfn layouts (VF, PF, VFIO VF, and missing entry), matching the semantics of
IsVirtualFunction. Just ensure the test fixtures (overriddenSysBusPci/NetDirectory) actually include all of the referenced PCI addresses so the tests don’t become host‑dependent.README.md (1)
210-210: Doc update aligns with vfio-pci detection behaviorThe expanded
vfioPciModedescription (covering both VF and PF devices and auto-detection from driver binding) matcheshandleVfioPciDetectionsemantics. You might optionally clarify that for PF passthrough this effectively results in a “no-op” network configuration (CNI only records the attachment) so readers don’t expect IPAM or link setup on PFs.pkg/utils/utils.go (1)
266-283: VF detection viaphysfnsymlink looks correct; note “missing device” semanticsThe implementation matches standard SR‑IOV layout: presence of
/sys/bus/pci/devices/<addr>/physfnimplies VF, absence implies “not a VF”. Returning(false, nil)onos.IsNotExistmakes the function safe to use purely as a classifier, but any caller that needs to validate thatDeviceIDactually exists in sysfs must perform a separate existence check (or rely on a later step likeLoadDeviceInfo) rather than assumingfalsemeans “known PF”.cmd/ib-sriov-cni/main.go (3)
148-170: VF-only device info loading and GUID handling are consistent with PF passthrough designUsing
IsVirtualFunctionto setnetConf.IsVFDevice, then:
- Populating GUID from runtime/CNI args,
- Requiring GUID only when
IBKubernetesEnabled && IsVFDevice,- And calling
config.LoadDeviceInfoonly for VFsfits the new model where PFs are passed through untouched and don’t need VF metadata (PF already has its own GUID). This cleanly separates VF vs PF responsibilities.
327-360: PF passthrough path incmdAddis minimal and appropriateInitialising a basic
current.Resultwith the container interface and then:
- Short‑circuiting PF devices (
!netConf.IsVFDevice) to “cache NetConf and return success”, and- Delegating VFs to
handleVFAddmatches the PF passthrough goal: for PFs, CNI only records the attachment, leaving direct assignment and networking to KubeVirt/VM tools. This is a clear separation and doesn’t interfere with existing VF flows.
376-406: VF-specific cleanup helper keeps DEL logic focused
handleVFCleanupencapsulates the VF teardown nicely:
- Skips
ReleaseVFwhenVfioPciModeis true (no host netdev to release).- Restores the RDMA device back to the default namespace when
RdmaIsolationis enabled, with a clear error if that fails.- Always calls
ResetVFConfigto restore VF state.This is a good refactor from the inlined DEL logic and makes VF cleanup easier to reason about and test.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.md(1 hunks)cmd/ib-sriov-cni/main.go(5 hunks)pkg/types/types.go(1 hunks)pkg/utils/utils.go(1 hunks)pkg/utils/utils_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/utils/utils_test.go (1)
pkg/utils/utils.go (1)
IsVirtualFunction(267-283)
cmd/ib-sriov-cni/main.go (5)
pkg/utils/utils.go (1)
IsVirtualFunction(267-283)pkg/config/config.go (2)
LoadDeviceInfo(39-67)DefaultCNIDir(17-17)pkg/types/types.go (2)
NetConf(16-19)Manager(81-86)pkg/types/mocks/Manager.go (1)
Manager(12-14)pkg/utils/rdma.go (1)
MoveRdmaDevFromNs(58-73)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (go)
🔇 Additional comments (4)
cmd/ib-sriov-cni/main.go (3)
138-141: EarlydeviceIDpresence validation is a good safeguardChecking
netConf.DeviceIDand failing fast if it’s empty prevents ambiguous behavior later in VF/PF detection and VFIO handling. This is a solid tightening of input validation.
264-325:handleVFAddstructure and error handling look soundThe split into
handleVFAddsimplifiescmdAddand keeps VF-specific logic together:
doVFConfigis invoked before setting up deferred cleanup, so failures during initial VF config don’t trigger partial cleanups.- The deferred block correctly checks
retErrto decide when toReleaseVFand (for RDMA isolation) return the RDMA device to the default namespace.- IPAM is skipped for VFIO devices, and when IPAM is used, you correctly:
- Call
runIPAMPlugin,- Defer
ExecDelon outer failure,- Repoint IPs to interface index 0,
ConfigureIfaceinside the target netns,- And update the result via
*result = *newResult.This aligns with CNI patterns and keeps VFIO vs non‑VFIO behavior clear.
456-458: Early return incmdDelskips VF-specific cleanup logicThe code shown at lines 456-458 returns early when
!netConf.IsVFDevice, preventing execution ofhandleVFCleanupfor all cached configurations.However, verification of the underlying root cause—that
IsVFDeviceis markedjson:"-"in the struct definition and thatLoadConfFromCachedoes not recompute VF/PF classification—cannot be completed due to repository access limitations. Additionally, the proposed fix depends on the existence and correct behavior ofutils.IsVirtualFunction(), which could not be confirmed.The early return logic itself is clearly present in the code snippet provided. To definitively validate this issue and the suggested fix, direct inspection of:
- The
IbSriovNetConfstruct definition and its json tags- The
LoadConfFromCacheimplementation- The
utils.IsVirtualFunction()function signature and implementation
is required.pkg/types/types.go (1)
34-35: Runtime-only flag is fine, but don't rely on it being persistedUsing
IsVFDeviceas a runtime flag withjson:"-"is appropriate and avoids leaking it into user config. However, any code that runs on cachedNetConf(e.g., the DEL path) must recompute VF vs PF fromDeviceIDinstead of assumingIsVFDevicewas serialized and is available.
|
@adrianchiris @SchSeba PTAL |
cmd/ib-sriov-cni/main.go
Outdated
| } | ||
| defer netns.Close() | ||
|
|
||
| if !netConf.IsVFDevice { |
There was a problem hiding this comment.
netconf is loaded from cache. currently this field is not persisted.
There was a problem hiding this comment.
please also think of a case where CNI Add was called with a binary that did not contain this change, then the CNI plugin was updated and CMD Del command was called with this code.
in that case even if we persisted the value it would default to false in this case.
perhaps its worth to check if the device is PF or VF during runtime here as well
There was a problem hiding this comment.
make sense.
I will update my code
| netConf.GUID = getGUIDFromConf(netConf) | ||
|
|
||
| // Ensure GUID was provided if ib-kubernetes integration is enabled | ||
| // Note: PF devices already have their own GUID, so only check for VF devices |
There was a problem hiding this comment.
how does ib-kubernetes works with PF ?
it needs to know the PF GUID in order to assign it to a pkey
There was a problem hiding this comment.
ib-kubernetes will get the pf guid information via k8s node annotation.
e.g.
ib-kubernetes.io/pf-guids: '{"0":"c4:70:bd:03:00:eb:e7:a0","1":"c4:70:bd:03:00:eb:ed:18","2":"c4:70:bd:03:00:84:0c:c8","3":"c4:70:bd:03:00:eb:e0:cc"}'
There was a problem hiding this comment.
i see, so guid to pkey assignment happens sort of "out-of-band"
this bit is not part of the upstream ib-kubernenets[1] right ?
cmd/ib-sriov-cni/main.go
Outdated
| "infiniband SRIOV-CNI failed, Unexpected error. GUID must be provided by ib-kubernetes") | ||
| } | ||
|
|
||
| // Only load VF device info for VF devices (PF devices don't need this) |
There was a problem hiding this comment.
nit: PF device that is bound to vfio dont need this
8dba049 to
a73731a
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cmd/ib-sriov-cni/main.go (1)
411-478: Consider locking for PF device cleanup for consistency.The runtime VF/PF detection (line 460) correctly addresses the past review concern about binary version mismatches. However, there's a locking asymmetry:
- cmdAdd: Lock acquired for both PF and VF (line 335)
- cmdDel: Lock acquired only for VF (line 471, after early return for PF at line 467)
While PF devices don't require VF-specific cleanup, they still perform cache file operations (line 429). For consistency and defensive coding, consider acquiring the lock before the VF/PF check to serialize all CNI Del operations, even for PF devices.
Proposed lock placement adjustment
defer netns.Close() + // Lock CNI operation to serialize the operation + lock, err := lockCNIExecution() + if err != nil { + return err + } + defer unlockCNIExecution(lock) + // Detect if device is VF or PF at runtime during Del isVF, err := utils.IsVirtualFunction(netConf.DeviceID) if err != nil { return fmt.Errorf("failed to determine if device %s is VF or PF: %v", netConf.DeviceID, err) } // PF devices don't need VF cleanup if !isVF { return nil } - // Lock CNI operation to serialize the operation - lock, err := lockCNIExecution() - if err != nil { - return err - } - defer unlockCNIExecution(lock) - return handleVFCleanup(sm, netConf, args, netns)
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.md(1 hunks)cmd/ib-sriov-cni/main.go(5 hunks)pkg/types/types.go(1 hunks)pkg/utils/utils.go(1 hunks)pkg/utils/utils_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/utils/utils_test.go
- pkg/types/types.go
- pkg/utils/utils.go
- README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (go)
🔇 Additional comments (6)
cmd/ib-sriov-cni/main.go (6)
62-92: LGTM! Effective serialization of CNI operations.The file-based locking approach correctly prevents concurrent CNI invocations from causing RDMA resource naming conflicts. The signal handler ensures cleanup on termination.
98-116: LGTM! Proper VFIO detection and validation.The dual approach (explicit validation + auto-detection) ensures correct vfioPciMode configuration.
264-325: LGTM! Well-structured VF add handler.The function cleanly encapsulates VF configuration with proper error handling and cleanup. Line 316 correctly updates the result pointer to propagate IPAM results to the caller.
379-409: LGTM! Well-documented VF cleanup sequence.The cleanup ordering (ReleaseVF → RDMA namespace move → ResetVFConfig) is correctly documented and addresses kernel behavior around RDMA device recreation.
327-366: PF validation correctly enforces vfioPciMode requirement.The PF path validates that vfioPciMode is enabled (lines 349-351), addressing the previous concern. The result returned for PF devices contains the interface entry and sandbox path, which is spec-compliant without requiring IPAM data. Locking is applied consistently for both VF and PF paths.
138-170: The VF/PF detection flow and conditional logic are correctly implemented.The code properly determines device type via
IsVirtualFunction(which checks for thephysfnsysfs symlink), and the conditional checks for GUID and device info loading are appropriate: PF devices have their own GUID and don't require VF-specific device information, while VF devices need external GUID configuration when ib-kubernetes integration is enabled.
|
@adrianchiris PTAL my latest PR when you have time. |
a73731a to
30fae8f
Compare
|
@adrianchiris @SchSeba Can you guys review my latest changes when you have time? |
adrianchiris
left a comment
There was a problem hiding this comment.
some general questions, otherwise LGTM
| netConf.GUID = getGUIDFromConf(netConf) | ||
|
|
||
| // Ensure GUID was provided if ib-kubernetes integration is enabled | ||
| // Note: PF devices already have their own GUID, so only check for VF devices |
There was a problem hiding this comment.
i see, so guid to pkey assignment happens sort of "out-of-band"
this bit is not part of the upstream ib-kubernenets[1] right ?
|
10x for addressing my comments. LGTM |
|
@adrianchiris thanks for your time reviewing my PR. |
|
@adrianchiris @SchSeba please let me know if there pending task on my side. |
|
@SchSeba can we merge this one ? |
SchSeba
left a comment
There was a problem hiding this comment.
LGTM
can you just add some info in the readme that for PF we ONLY support VFIO this is not to move PF infiniband to the pod
(if I understand the PR right ;P )
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively adds support for Physical Function (PF) passthrough, complementing the existing Virtual Function (VF) capabilities. However, it introduces several path traversal vulnerabilities by failing to sanitize user-controlled inputs from the CNI configuration (DeviceID, ContainerID, IfName) before using them to construct file system paths. This leads to a high-severity arbitrary file write vulnerability and a medium-severity information leak vulnerability, both requiring strict input validation at the source. On a positive note, the code is well-structured with clear logic separation for PF and VF handling in cmdAdd and cmdDel flows, and refactoring into functions like handleVFAdd and handleVFCleanup improves readability and maintainability. The IsVirtualFunction utility is correctly implemented and tested, and documentation has been updated. There are also minor suggestions to improve error message formatting for consistency.
| return fmt.Errorf("PF device %s requires vfioPciMode to be enabled", netConf.DeviceID) | ||
| } | ||
| // PF device - just cache config and return success | ||
| if err = utils.SaveNetConf(args.ContainerID, config.DefaultCNIDir, args.IfName, netConf); err != nil { |
There was a problem hiding this comment.
The call to utils.SaveNetConf uses args.ContainerID and args.IfName to construct a cache file path. These values come from the container runtime and are not sanitized. A malicious or compromised runtime could provide values with path traversal characters (../), causing the CNI to write its cache file to an arbitrary location on the host filesystem. This could overwrite critical files, leading to a denial of service or other security impacts.
Remediation: Sanitize args.ContainerID and args.IfName in main.go before passing them to utils.SaveNetConf. Ensure they do not contain any path-related characters.
| func IsVirtualFunction(pciAddr string) (bool, error) { | ||
| physfnPath := filepath.Join(SysBusPci, pciAddr, "physfn") |
There was a problem hiding this comment.
The new function IsVirtualFunction is vulnerable to path traversal. It accepts a pciAddr from the user-controlled CNI configuration (DeviceID) and uses it in filepath.Join without sanitization. This allows an attacker to use ../ sequences to check for the existence of arbitrary files on the host, leading to an information leak. This vulnerability pattern is repeated in several other utility functions in this file (e.g., GetPfName, GetVFLinkNames), indicating a systemic issue of trusting external input.
Remediation: All inputs from the CNI configuration that are used in file paths (DeviceID, IfName, ContainerID) must be validated against a strict allow-list or format (e.g., a regex for PCI addresses) immediately after being read. This validation should occur at the source, in cmd/ib-sriov-cni/main.go, to protect all downstream utility functions.
30fae8f to
6c2fe03
Compare
|
@SchSeba thanks for the review. |
This change enables ib-sriov-cni to support Physical Function (PF) passthrough in addition to Virtual Functions (VF), particularly for KubeVirt GPU workloads requiring InfiniBand RDMA. Closes k8snetworkplumbingwg#159 Signed-off-by: Zhen(Winson) Wang <zhewang@nvidia.com>
6c2fe03 to
bc7ebe5
Compare
|
@SchSeba can we merge this PR? |
|
@adrianchiris can we merge this PR? |
a121e7e
into
k8snetworkplumbingwg:master
|
done ! thx for working on it @winsopc :) |
This change enables ib-sriov-cni to support Physical Function (PF) passthrough in addition to Virtual Functions (VF), particularly for KubeVirt GPU workloads requiring InfiniBand RDMA.
Closes #159
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.