|
| 1 | +# VEP #109: Implement vGPU Enabled Live Migration |
| 2 | + |
| 3 | +## Release Signoff Checklist |
| 4 | + |
| 5 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 6 | + |
| 7 | +- [X] (R) Enhancement issue created, which links to VEP dir in [kubevirt/enhancements] (not the initial VEP PR) |
| 8 | +- [] (R) Target version is explicitly mentioned and approved |
| 9 | +- [X] (R) Graduation criteria filled |
| 10 | + |
| 11 | +## Overview |
| 12 | + |
| 13 | +This is a proposal to allow live migrations in KubeVirt to work for VMs with a single NVIDIA vGPU, exposed by mdev, between two nodes in the same cluster with identical GPUs and GPU drivers. |
| 14 | + |
| 15 | +## Motivation |
| 16 | + |
| 17 | +GPU usage is increasing with more and more companies running AI workloads, so companies are now requesting live migration to support GPU enabled VMs. |
| 18 | + |
| 19 | +## Goals |
| 20 | + |
| 21 | +* Address a common live migration problem where the target needs to update the destination Libvirt XML. In the case of mdevs, it needs to update the mdev UUID in the XML. |
| 22 | +* Support single vGPU enabled live migrations for both nodes that are using the Nvidia GPU Operator and clusters that are using KubeVirt’s generic device plugin for mdev. |
| 23 | +* Support single vGPU enabled live migrations with minimal data lost due to high dirty rates. |
| 24 | + |
| 25 | +## Non Goals |
| 26 | + |
| 27 | +* Do not want to change the live migration workflow for non vGPU enabled VMs. |
| 28 | +* Do not support live migration for passthrough or SRIOV vGPU |
| 29 | +* Do not support cross-cluster live migrations |
| 30 | +* Do not support live migrations for VMs with multiple vGPUs |
| 31 | + |
| 32 | +## Definition of Users |
| 33 | + |
| 34 | +* **KubeVirt Administrators:** Users who have cluster wide privileges to trigger APIs to manage a cluster. |
| 35 | +* **KubeVirt Owner:** VM workload owners who want high availability for their VMs. |
| 36 | + |
| 37 | +## User Stories |
| 38 | + |
| 39 | +As a KubeVirt admin/owner, I want to be able to live migrate my VMs that have an NVIDIA vGPU. |
| 40 | + |
| 41 | +## Repos |
| 42 | + |
| 43 | +https://github.com/kubevirt/kubevirt |
| 44 | + |
| 45 | +## Design |
| 46 | + |
| 47 | +For Alpha, GPU driver versions on all worker nodes must be identical. The migration will not be successful if there is a version mismatch, so users must ensure this. This will be addressed and updated during Beta. |
| 48 | + |
| 49 | +[VEP 141](https://github.com/kubevirt/enhancements/issues/141) introduces a feature gate in KubeVirt, TargetSideMigrationHooks, to register and write QEMU hooks for the target `virt-launcher`. We will use this new infrastructure to mutate the domain XML with the updated mdev UUID, which will be the one assigned to the target `virt-launcher` by `gpu.CreateHostDevices()` in `manager.go`. VGPU live migration will only be available with the TargetSideMigrationHooks feature gate enabled. |
| 50 | + |
| 51 | +Once the destination XML contains the correct fields, the live migration can begin. Libvirt/QEMU already support vGPU live migration for mdev (since Libvirt 8.6.0 and QEMU 8.1.0) and will do the actual migration, so no further work is needed by KubeVirt to migrate the vGPU. Some migration configs at the Libvirt/QEMU level, such as the migration method or downtime limit, may be necessary however. |
| 52 | + |
| 53 | +### Example |
| 54 | +XML snippet before hook: |
| 55 | +``` |
| 56 | +<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on' ramfb='on'> |
| 57 | + <source> |
| 58 | + <address uuid='bb4a98d8-60c1-40c6-b39b-866b1e82bd8c'/> |
| 59 | + </source> |
| 60 | + <alias name='ua-gpu-gpu1'/> |
| 61 | + <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/> |
| 62 | + </hostdev> |
| 63 | +``` |
| 64 | + |
| 65 | +XML snippet after hook (address uuid updated): |
| 66 | +``` |
| 67 | +<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='on' ramfb='on'> |
| 68 | + <source> |
| 69 | + <address uuid='05b59010-d19c-47d2-9477-33b4579edc90'/> |
| 70 | + </source> |
| 71 | + <alias name='ua-gpu-gpu1'/> |
| 72 | + <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/> |
| 73 | + </hostdev> |
| 74 | +``` |
| 75 | + |
| 76 | +**Failed migrations:** Cleanup will be performed by existing code and by code introduced in [16212](https://github.com/kubevirt/kubevirt/pull/16212). |
| 77 | + |
| 78 | +## API Examples |
| 79 | + |
| 80 | +N/A |
| 81 | + |
| 82 | +## Alternatives |
| 83 | + |
| 84 | +Instead of relying on a QEMU hook, a Libvirt API could be introduced to allow KubeVirt to update the destination XML at the start of migration via callbacks. However, previous discussions asking for this API haven’t made progress. |
| 85 | + |
| 86 | +## Scalability |
| 87 | + |
| 88 | +The unix socket used will be `/var/run/kubevirt/migration-hook-socket` introduced in PR [16212](https://github.com/kubevirt/kubevirt/pull/16212). A target `virt-launcher` pod will have at most one of this socket open at a time, so it should be possible to live migrate a large number of VMs concurrently without significant performance issues. KubeVirt also imposes its own limitations on the number of live migrations on a node and cluster-wide level. |
| 89 | + |
| 90 | +## Update/Rollback Compatibility |
| 91 | + |
| 92 | +* Needs TargetSideMigrationHooks feature gate from PR [16212](https://github.com/kubevirt/kubevirt/pull/16212) to be enabled |
| 93 | +* Will be safe during upgrades as long as the newer node's mdev uuids don't change unexpectedly. |
| 94 | + |
| 95 | +## Functional Testing Approach |
| 96 | + |
| 97 | +* Unit tests: Verify that the VM is able to live migrate with the vGPU given the proper conditions. |
| 98 | +* [Optional] Also verify that this works with the NVIDIA GPU Operator. |
| 99 | + |
| 100 | +## Implementation History |
| 101 | + |
| 102 | +N/A |
| 103 | + |
| 104 | +## Graduation Requirements |
| 105 | + |
| 106 | +### Alpha |
| 107 | + |
| 108 | +* Implement basic functionality and testing. |
| 109 | +* Limitations |
| 110 | + * Users must ensure all worker nodes have identical GPU driver versions since KubeVirt will not take this into account when scheduling the migration |
| 111 | + * KubeVirt is unable to estimate the maximum period for the migration. Use a hard limit that is equal to the existing calculated values (which ignore gpu info) |
| 112 | +* Figure out how to handle any data loss during the migration. |
| 113 | + |
| 114 | +### Beta |
| 115 | + |
| 116 | +* No longer require users to ensure all worker nodes have identical GPU driver versions. KubeVirt will take driver version into account when scheduling the migration |
| 117 | +* Find a way to estimate the maximum period for the migration |
| 118 | +* Needs [VEP 141](https://github.com/kubevirt/enhancements/issues/141) to be in Beta. |
| 119 | + |
| 120 | +### GA |
| 121 | + |
| 122 | +* Needs [VEP 141](https://github.com/kubevirt/enhancements/issues/141) to be in GA. |
0 commit comments