VEP-183: NetworkDevicesWithDRA API design#185
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
2bbda95 to
9dd312c
Compare
|
/cc @alaypatel07 |
3aa5a35 to
ac099a5
Compare
|
cc @SchSeba @aojea @alaypatel07 @EdDev @orelmisan @LionelJouin Thanks |
| To support custom MAC addresses for DRA-based SR-IOV networks, KubeVirt annotates the virt-launcher pod with requested MAC addresses. The MAC address is taken from the existing `spec.domain.devices.interfaces[].macAddress` field: | ||
|
|
||
| ``` | ||
| kubevirt.io/dra-networks: '[{"claimName":"sriov","requestName":"vf","mac":"de:ad:00:00:be:ef"}]' |
There was a problem hiding this comment.
an alternative is to use the ResourceClaim and ResourceClaimTemplate opaque data to pass this information, so you don't need to embed it into an annotation, this information is cascaded to the driver via the kubelet NodePrepareResources hook
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: sriov-network-claim-template
namespace: default
spec:
spec:
devices:
requests:
- name: sriov-nic-request
exactly:
deviceClassName: sriov.network.example.com
config:
- opaque:
driver: sriov.network.example.com
parameters:
interface:
name: "enp0s1"
mtu: 4321
hardwareAddr: "00:11:22:33:44:55"this way you can also create an schema and add validation, see how I do in dranet
https://github.com/kubernetes-sigs/dranet/blob/main/pkg/apis/types.go
There was a problem hiding this comment.
Thanks, we discussed this approach,
It is a resource claim template, which means k8s is the one that creates ResourceClaims based on that, and it is a singleton, where each virt-launcher pod needs a different MAC, hence it is not scalable.
The MAC approach allows minimum maintenance by Kubevirt, and per pod custom MAC.
It solves the specific case where ResourceClaim/ResourceClaimsTemplate aren't created by Kubevirt, so it is good to use MAC annotation approach also for the generic case where Kubevirt itself creates the ResourceClaim ("auto" mode), so there will be one solution for all the cases.
There was a problem hiding this comment.
Agree here we can't have this one if we want to re-use the resourceClaimTemplate for multiple pod/vms.
then we can also continue to use the kubemacpool system to assign the mac address. (we may just need to ajust the annotation)
There was a problem hiding this comment.
Another option is to have a mutating webhook on the resourceClaim that will inject this opaque mac address to have unique mac addresses, but again I don't know how much work that is.
There was a problem hiding this comment.
there will be e2e test that injects a MAC, and depends on the current value being handled
this is what we do already for the current multus + DP approach, same tests will be duplicated to use DRA as well
There was a problem hiding this comment.
Multus is fine I think as it's a full implementation. DRA is a kubernetes feature that has multiple implementations (DRANET, SR-IOV DRA Driver, CNI DRA Driver...), so having conformance tests could help to determine which DRA Driver is Kubevirt networking compliant.
There was a problem hiding this comment.
Until now, we thought that any driver which has pciAddress attribute or mdevUUID attribute can be compatible with KubeVirt. But this usecase shows that advance features in kubevirt might require more from driver than just providing those attributes, in which case a conformance tests should will be helpful. It could potentially be tracked as part of graduating VEP-10 to GA.
Multus is fine I think as it's a full implementation. DRA is a kubernetes feature that has multiple implementations (DRANET, SR-IOV DRA Driver, CNI DRA Driver...), so having conformance tests could help to determine which DRA Driver is Kubevirt networking compliant.
There was a problem hiding this comment.
If ResourceClaimTemplate object could be embedded in the pod spec, would this help avoid having this custom annotation?
I remember very early on the plan was to embed the template in pod spec, but because DRA was not stable and pod spec is V1 GA object, it was decided to move the template into its own API.
There was a problem hiding this comment.
Assuming you mean that K8s will create a RessourceClaim if ResourceClaimTemplate is embedded in the pod right ?
This is a precondition for this idea.
But i think that we need to consider all the cases
- Exisitng ResourceClaim
- Existing ResourceClaimTemplate
- Kubevirt "auto" mode where Kubevirt is the one that creates the ResourceClaim
And to also consider that lets say ResourceClaimTemplate exists, we copy it to the pod, add there the MAC,
but now upon migration, we need to copy the pod's ResourceClaimTemplate because the original might be changed and we can't depend on it.
While this is a nice idea, given the above, we should start simple imho, and later consider to adjusts according needs.
Moreover i think that it is embedding a whole CRD instead an annotation, and have cons as above.
We can make the annotation standard and it is pretty straight forward and simpler.
Note, didn't think deeper on the other cases, took one edge example to show why it makes things harder.
Another cons, is once we don't have a MAC, we don't need it, but we can't have 2 mechanisms, so it bloats the non MAC path.
|
|
||
| ## Goals | ||
|
|
||
| - Align on the API changes needed to consume DRA-enabled SR-IOV network devices in KubeVirt |
There was a problem hiding this comment.
Please consider a different verb, here perhaps "Introduce"?
|
|
||
| - Align on the API changes needed to consume DRA-enabled SR-IOV network devices in KubeVirt | ||
| - Align on how KubeVirt will consume SR-IOV devices via external DRA drivers | ||
| - Enable DRA SR-IOV use cases available to containers to work seamlessly with KubeVirt VMIs |
There was a problem hiding this comment.
| - Enable DRA SR-IOV use cases available to containers to work seamlessly with KubeVirt VMIs | |
| - Seamlessly support container based SR-IOV use cases in KubeVirt VMIs |
| ## Non Goals | ||
|
|
||
| - Replace existing Multus-based SR-IOV network integration (remains fully supported) | ||
| - Deploy DRA SR-IOV driver (handled by sriov-network-operator) |
There was a problem hiding this comment.
Consider omitting (handled by sriov-network-operator) as it is not part of KubeVirt, nor is it a prerequisite . You can add a note that recommends it somewhere further down.
There was a problem hiding this comment.
It is a non goal that emphasize the fact deployment is not handled,
because deployment side by side with DP is tricky, yet to determined, and will be solved by the sriov-operator
| - Deploy DRA SR-IOV driver (handled by sriov-network-operator) | ||
| - Support coexistence of DRA SR-IOV and device-plugin SR-IOV | ||
| - Live migration of VMs with DRA network devices | ||
| - Align on what drivers KubeVirt will support in tree |
There was a problem hiding this comment.
I can drop it as it is mentioned in different ways already,
it is related to the fact we don't support yet side by side DP and DRA drivers in the same time,
so we will just deploy one at a time, until sriov-operator solves it, and then we will able to deploy them together,
the reason is that the driver need to consider each other when allocating the same devices,
and this is what not solved yet
EDIT - dropped
| } | ||
| ``` | ||
|
|
||
| The VMI must also include the resource claim in `spec.resourceClaims[]` (consistent with GPU and HostDevice DRA usage). |
There was a problem hiding this comment.
We already say in the beginning
This VEP builds upon the core DRA infrastructure defined in VEP #10 (https://github.com/kubevirt/enhancements/pull/11) to add support for network devices, specifically SR-IOV NICs.
I prefer to DRY (don't repeat yourself) rule
|
|
||
| ### Custom MAC Address Support | ||
|
|
||
| To support custom MAC addresses for DRA-based SR-IOV networks, KubeVirt annotates the virt-launcher pod with requested MAC addresses. The MAC address is taken from the existing `spec.domain.devices.interfaces[].macAddress` field: |
There was a problem hiding this comment.
- The present tense here is confusing.
- Will the annotation also be populated when custom mac address isn't specified?
There was a problem hiding this comment.
- fixed
- no, there is no need to
| **Virt-Launcher:** | ||
| - For SR-IOV networks with DRA, virt-launcher uses `vmi.status.deviceStatus` to generate the domain XML instead of Kubevirt's downwardAPI file as in the case of device-plugins | ||
| - The `CreateDRAHostDevices()` function generates hostdev XML by: | ||
| - Filtering interfaces with SRIOV binding that reference networks with resourceClaim source |
| - For SR-IOV networks with DRA, virt-launcher uses `vmi.status.deviceStatus` to generate the domain XML instead of Kubevirt's downwardAPI file as in the case of device-plugins | ||
| - The `CreateDRAHostDevices()` function generates hostdev XML by: | ||
| - Filtering interfaces with SRIOV binding that reference networks with resourceClaim source | ||
| - Looking up the corresponding device status entry by network name |
| - Consider additional use cases if any | ||
| - Work with Kubernetes community on standardizing device information injection | ||
| - Live migration support for DRA network devices | ||
| - Live migration will use CDI/NRI to inject device information as files into each pod (mappings of request/claim to PCI addresses) |
There was a problem hiding this comment.
Not clear to me - but maybe It's on me to learn.
There was a problem hiding this comment.
it is not part of this VEP, just meant to give highlight
in high level, in the long term k8s will create a downward API, each pod will have file with the device info,
CDI/NRI is implementation details
kubernetes/enhancements#5606
until it happens we might create this mechanism by SRIOV driver
(poc that we will continue according needs)
k8snetworkplumbingwg/dra-driver-sriov#59 (POC, closed for now, but it shows how it works, CDI prepare the pod manifest, NRI populates data and then container creates the file)
k8snetworkplumbingwg/dra-driver-sriov#62 (WIP)
nice diagram wrt to this
kubernetes/enhancements#5606 (comment)
also #155 (VEP 10 amendment)
talks about using downward API already
| - Looking up the corresponding device status entry by network name | ||
| - Extracting the PCI address from device status attributes | ||
| - Generating standard libvirt hostdev XML | ||
| - **Note:** If the ResourceClaim/ResourceClaimTemplate has `count != 1`, KubeVirt will consume the first device from the allocated devices |
There was a problem hiding this comment.
Just a detail, to allocate multiple devices via a resourceclaim, count could be used or the allocationMode set to all. Also, I am not sure it is guaranteed the list of allocated devices will be returned in a deterministic order.
| - **Note:** If the ResourceClaim/ResourceClaimTemplate has `count != 1`, KubeVirt will consume the first device from the allocated devices | |
| - **Note:** If the ResourceClaim/ResourceClaimTemplate is allocating more than one device for the request, KubeVirt will consume the first device from the allocated devices |
There was a problem hiding this comment.
thanks, will change,
please note that this logic (take first in list) won't be added as part of this VEP,
since it depends on the downward API which isn't part of this VEP (new VEP will be created afterwards)
EDIT - done
|
|
||
| // ResourceClaimNetworkSource represents a network resource requested | ||
| // via a Kubernetes ResourceClaim. | ||
| type ResourceClaimNetworkSource struct { |
There was a problem hiding this comment.
Any resource why ClaimRequest cannot be used here? https://github.com/kubevirt/kubevirt/blob/main/staging/src/kubevirt.io/api/core/v1/schema.go#L635
There was a problem hiding this comment.
Yes, because for Network DRA, requestName must exists, as was discussed in the meeting,
Because we support only one per network,interface tuple atm.
And having requestName always makes sure that it will support both ResourceClaim[Template] with one device type, or more.
however for GPU and HostDevices it isn't.
Yet there is a bug there
kubevirt/kubevirt#16319 (comment)
Once i reach it i will open an issue (with better explanation) about it so we can discuss and fix it
There was a problem hiding this comment.
created issue for this
kubevirt/kubevirt#16771
There was a problem hiding this comment.
Commented on the issue.
Yet there is a bug there
I am not sure if this is a bug, there is an intentional API decision in GPU struct that each device is a single element in the list. Having empty requestName gives flexibility but I cant think of any usecase for this flexibility.
There was a problem hiding this comment.
If requestName isn't supplied, it fail to find it using current logic, without failing earlier with deterministic assertion, hence at least that should be fixed imo.
Will answer verbose on kubevirt/kubevirt#16771 with the options at least as i see it, thanks
| // ResourceClaimNetworkSource represents a network resource requested | ||
| // via a Kubernetes ResourceClaim. | ||
| type ResourceClaimNetworkSource struct { | ||
| // ClaimName references the name of a ResourceClaim in the |
There was a problem hiding this comment.
If the ClaimName directly references the name of the ResourceClaim in the namespace, what is the point of having the resource claim mentioned in spec.resourceClaims[] ?
There was a problem hiding this comment.
Because spec.resourceClaims[] is the one that already exists for all DRA types,
and in this struct ResourceClaimNetworkSource we have claim + request,
so there can be multiple ResourceClaimNetworkSource with different request, but that point to the same spec.resourceClaims[] entry
This was the designed selected based on @aojea idea, after co-op with Sebastian and Edy, and the people in the DRA meeting, given the current DRA API,
EDIT - maybe the comment is wrong and that's what you mean, it points to spec.resourceClaims entry
not directly to a CR, fixed comment, thanks
Btw this also has a bug, that i will open an issue about
@SchSeba found that we generally speaking we don't render the pod correct in some more
advanced cases
here is more info, it is out of the scope of this VEP
kubevirt/kubevirt@164297c
I will open an issue about it
There was a problem hiding this comment.
EDIT - maybe the comment is wrong and that's what you mean, it points to spec.resourceClaims entry
not directly to a CR, fixed comment, thanks
+1 this is what I meant.
Yes please open a buy, I think we can fix it as part of VEP-10 beta work.
@SchSeba found that we generally speaking we don't render the pod correct in some more
advanced cases
here is more info, it is out of the scope of this VEP
kubevirt/kubevirt@164297c
I will open an issue about it
There was a problem hiding this comment.
Updated the comment
Created here, self assigned because already have a WIP direction
kubevirt/kubevirt#16769
|
subscribe |
|
Thanks all for the review, |
|
Addressed / answered all comments, thank you |
Document support for DRA-provisioned network devices, specifically SR-IOV NICs. Key additions: - ResourceClaimNetworkSource API for specifying DRA networks in spec.networks - SR-IOV integration details: device allocation via DRA, configuration via existing KubeVirt networks API. - Custom MAC address support through kubevirt.io/dra-networks. - NetworkDevicesWithDRA feature gate (Alpha). - Example VMI YAML with DRA SR-IOV network configuration. Network DRA maintains mutual exclusivity with traditional Multus-based SR-IOV per VM. The existing Multus SR-IOV API remains fully supported and unchanged. Assisted-by: claude-4.5-sonnet Signed-off-by: Or Shoval <oshoval@redhat.com>
|
fixed comment according #185 (comment) thread |
|
|
||
| 1. The network admitter validates that exactly one network source type (pod, multus, or resourceClaim) is specified | ||
| 2. Virt-controller adds the resource claim to the virt-launcher pod spec via `WithNetworksDRA()` render option | ||
| 3. The DRA controller populates `vmi.status.deviceStatus` with the PCI address from the ResourceSlice |
There was a problem hiding this comment.
If I'm not mistaken, you must query the resourceClaim in order to find the resourceSlice element/s name/s that was/were allocated to this pod.
Field ref resourceclaim.status.allocation.devices.results.device
There was a problem hiding this comment.
The DRA controller (#10) populates the data from the ResourceClaim to vmi.status.deviceStatus already
so we just need to read it,
we show flow here, not addition by this VEP
| devices: | ||
| requests: | ||
| - name: sriov-nic-request | ||
| exactly: |
There was a problem hiding this comment.
Does this select exactly 1 resource?
| spec: | ||
| domain: | ||
| devices: | ||
| interfaces: |
There was a problem hiding this comment.
I recommend to add one more sriov interface in order to demonstrate what gets multiplied(ifaces, networks, resourceClaims?) and what remains sigular (deviceClass, resourceClaimTemplate)
There was a problem hiding this comment.
well we dont support it correct yet because of kubevirt/kubevirt#16769
so i prefer not, once we do i will amend it
as far as i know we need this,
once we support downward API, maybe we will able to simplify this
(and to relax some of the webhook limitations)
right @SchSeba ?
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: vf-test2
name: single-vf
spec:
spec:
devices:
requests:
- name: vf1
exactly:
deviceClassName: sriovnetwork.k8snetworkplumbingwg.io
- name: vf2
exactly:
deviceClassName: sriovnetwork.k8snetworkplumbingwg.io
config:
- requests: ["vf1","vf2"]
opaque:
driver: sriovnetwork.k8snetworkplumbingwg.io
parameters:
apiVersion: sriovnetwork.k8snetworkplumbingwg.io/v1alpha1
kind: VfConfig
netAttachDefName: vf-test
driver: vfio-pci
addVhostMount: true
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: testvmi
namespace: vf-test2
spec:
runStrategy: Always
template:
metadata:
labels:
kubevirt.io/domain: testvmi
spec:
architecture: amd64
domain:
cpu:
cores: 1
maxSockets: 4
model: host-model
sockets: 1
threads: 1
devices:
autoattachGraphicsDevice: false
disks:
- disk:
bus: virtio
name: disk0
- disk:
bus: virtio
name: cloudinitdisk
interfaces:
- masquerade: {}
name: default
- sriov: {}
name: red
macAddress: "de:ad:00:00:be:ef"
- sriov: {}
name: blue
rng: {}
features:
acpi:
enabled: true
firmware:
uuid: d7873366-c2b0-4a56-bfc2-1e1bee0a88db
machine:
type: q35
memory:
guest: 1Gi
resources:
requests:
memory: 1Gi
evictionStrategy: None
networks:
- name: default
pod: {}
- name: red
resourceClaim:
claimName: sriov
requestName: vf1
- name: blue
resourceClaim:
claimName: sriov
requestName: vf2
terminationGracePeriodSeconds: 0
resourceClaims:
- name: sriov
resourceClaimTemplateName: single-vf
volumes:
- containerDisk:
image: quay.io/kubevirt/fedora-with-test-tooling-container-disk:devel
imagePullPolicy: IfNotPresent
name: disk0
- cloudInitNoCloud:
networkData: |
ethernets:
eth0:
addresses:
- fd10:0:2::2/120
dhcp4: true
gateway6: fd10:0:2::1
match: {}
nameservers:
addresses:
- 10.96.0.10
search:
- default.svc.cluster.local
- svc.cluster.local
- cluster.local
version: 2
name: cloudinitdisk
|
cc @aojea @alaypatel07 @oshoval I've taken the initiative to prototype VEP-183 and successfully tested the DRA SR-IOV integration with real hardware. I skipped few point such as admission webhook or assigning custom mac addresses, since the purpose was fast prototyping and figuring out blockers. you can take a look implementation @ kubevirt/kubevirt#16799 . I'm happy to help you guys to finish this proposal and implement it. Since we have few ppl waiting this feature. Let me know if you guys need any help from my side. Test Environment
Following is the resourceClaimTemplate I used Following is the VMI spec I used for testing Following is pod created Following is resourceClaim |
|
Hi @bgchun-fs Btw in additional
|
|
Please see |
|
|
||
| Webhook validations ensure: | ||
| 1. Networks with `resourceClaim` source have corresponding `sriov` binding interfaces | ||
| 2. Each network must reference a unique `claimName` + `requestName` combination. No two DRA entities (networks, hostDevices, or GPUs) can share the same tuple, as each interface+network pair must map to exactly one device allocation |
There was a problem hiding this comment.
For now will enforce it just for network versus network, not cross GPU / HostDevice.
If needed also cross better in a separated PR.
VEP Metadata
Tracking issue: #183
Upstream issue kubevirt/kubevirt#15995
SIG label: /sig network
What this PR does
Document support for DRA-provisioned network devices, specifically SR-IOV NICs.
Key additions:
existing KubeVirt networks API.
Network DRA maintains mutual exclusivity with traditional Multus-based SR-IOV
per VM. The existing Multus SR-IOV API remains fully supported and unchanged.
Thanks @SchSeba for the co-op, and to all who contributed
Special notes for your reviewer