-
Notifications
You must be signed in to change notification settings - Fork 81
vsock: Enable live migrations (snapshot-restore) #936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -129,6 +129,18 @@ In this configuration: | |
| - The host must have vsock support (e.g., `vsock_loopback` kernel module loaded) | ||
| - For testing, you can load the module with: `modprobe vsock_loopback` | ||
|
|
||
| ## Live migration | ||
|
|
||
| This device implementation advertises support for live migrations by offering the VHOST_USER_PROTOCOL_F_DEVICE_STATE protocol feature, however this doesn't work with Qemu yet as it marks its vsock frontend as "unmigratable". This feature does work with CrosVm and potentially other virtual machine managers. | ||
|
|
||
| The device itself doesn't save or restore any state during a live migration. It relies instead on the frontend to save the vring's states and negotiated features. It also expects the the frontend to "kick" the queues that have pending buffers in it since the driver probably kicked those queues before the migration and won't do it again. | ||
|
|
||
| The state saving flow is trivial as the device doesn't save any state as mentioned. | ||
|
|
||
| The state loading flow is a bit more complicated because the virtio-vsock spec mandates that the device must send a VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event to the driver. During a restore the backend is started no differently than during a regular boot. When the frontend sends the VHOST_USER_SET_DEVICE_STATE_FD command with LOAD direction the backend doesn't load anything, but it takes note that a transport reset event needs to be sent to the driver via the event vring when possible. In order to make sure this event is sent when the queue is ready, the backend waits for the event queue to be kicked before sending the event. While these kicks usually come from the driver, this particular one is actually sent by the vhost-user frontend. This implementation depends on the frontend to kick all queues with pending buffers after a restore because the driver is unlikely to do so as it probably did it before the snapshot was taken. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is the hack IMO. The kick means: "hey, I'm the driver and there is a new available buffer for you in the virtqueue", while here we are using for something completely different, and even not in the spec. If this is what we want from the frontend (and I don't understand why we want this), we should put into the spec and not guess here that we will receive that event. |
||
|
|
||
| In response to the transport reset event the driver drops any existing connections and reads the configuration space again. To prevent the driver from dropping any new connections established after the restore the backend doesn't forward any packets from outside the VM to the driver until it has read the configuration space. In fact, because the backend doesn't know at start time whether this is a restore or a clean boot, it always waits until after the driver has read the configuration space to start forwarding packets between the outside world and the driver. | ||
|
|
||
| ## Usage | ||
|
|
||
| Run the vhost-device-vsock device with unix domain socket backend: | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the commit there some something to be fixed:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So to recap, this is unclear to me, why the driver will send a kick to the event vring ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Oops! too many d-words :). Fixed it.
I can't reply to the other comment on this same topic for some reason, so I'll reply here too:
The driver probably doesn't (necessarily) push more buffers or send a kick after a restore, the fronted does (or should, at least CrosVm does). I'll admit I haven't looked at QEMU's source in depth, but I'd image that if the rx queue has buffers in it a kick is sent to the device after restore to activate that queue, otherwise the device won't be able to send data to the driver until the driver pushes more buffers, which it has no reason to do since the ones it already pushed are just sitting there. The backend could avoid waiting for the kick if it stored the queue state in its saved state, but there is still the question of when to send the buffer: Immediately after loading its state and replying to the "check" call? Wouldn't that be too soon, for example if it doesn't yet have call fd?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I went looking at the QEMU source code after all, or to be precise I asked Gemini to look for me. It seems QEMU does not in fact send that kick like crosvm does and instead depends on the driver to kick the queues when responding to the transport event OR on the device implementation to store the state of the queue. The virtio spec doesn't mention that the driver should kick the queues on transport reset though. In any case this approach seems to work well with the vhost backend, but the vhost-user backend is marked by qemu as unmigratable. Other vhost-user devices, like vhost-user-blk appear to have a "kick right away" logic similar to what crosvm does for vhost-user-vsock. So, maybe it would be a good idea to add it in vhost-user-vsock too.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry, why a frontend should inject a kick? This seems more an hack to me? A kick should means: "there is something new in the avail ring, please process it". What new stuff are in the vring after that kick?
The device IMO should know that is starting with a driver already initialized and should not wait for a kick, but start to process the queue just after starting. That said, for a TX queue may have sense, but we are not using at all in that sense in this case, since no new buffers will be there. Also we are not processing anything from the guest, so I'm still really confused why a kick is needed if the device doesn't need to process anything from the guest, but this seems more a frontend -> backend notification. Again, this seems purely an hack, and should not be enabled by default IMHO.
What about VHOST_USER_SET_VRING_ENABLE event? (maybe we need to extend the |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -332,7 +332,7 @@ impl VsockThreadBackend { | |
| if dst_cid != VSOCK_HOST_CID { | ||
| let cid_map = self.cid_map.read().unwrap(); | ||
| if cid_map.contains_key(&dst_cid) { | ||
| let (sibling_raw_pkts_queue, sibling_groups_set, sibling_event_fd) = | ||
| let (sibling_raw_pkts_queue_opt, sibling_groups_set, sibling_event_fd) = | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why we need this change? Can be split in another commit? I hope these things are mentioned in the commit description.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is to prevent packets from sibling VMs to be delivered to the guest driver before a potential transport reset event is processed (just like it's done for packages from the host). If any of these packets are allowed through a connection could be established that the transport reset would destroy later, but the sibling VM won't know about this. I don't know if just dropping the packet is the best course of action, but that's what the code already does for packets addressed to unknown CIDs and I just wanted to treat this case as if the CID is not reachable yet. In fact, my first option was to delay adding the whole tuple to I can split it into its own commit, but it doesn't make a lot of sense on its own. Let me know if you still prefer it separately. |
||
| cid_map.get(&dst_cid).unwrap(); | ||
|
|
||
| if self | ||
|
|
@@ -345,11 +345,18 @@ impl VsockThreadBackend { | |
| return Ok(()); | ||
| } | ||
|
|
||
| sibling_raw_pkts_queue | ||
| .write() | ||
| .unwrap() | ||
| .push_back(RawVsockPacket::from_vsock_packet(pkt)?); | ||
| let _ = sibling_event_fd.write(1); | ||
| match sibling_raw_pkts_queue_opt { | ||
| Some(queue) => { | ||
| queue | ||
| .write() | ||
| .unwrap() | ||
| .push_back(RawVsockPacket::from_vsock_packet(pkt)?); | ||
| let _ = sibling_event_fd.write(1); | ||
| } | ||
| None => { | ||
| info!("vsock: dropping packet for cid: {dst_cid:?} due to inactive device"); | ||
| } | ||
| } | ||
| } else { | ||
| warn!("vsock: dropping packet for unknown cid: {dst_cid:?}"); | ||
| } | ||
|
|
@@ -525,6 +532,7 @@ mod tests { | |
| #[cfg(feature = "backend_vsock")] | ||
| use crate::vhu_vsock::VsockProxyInfo; | ||
| use crate::vhu_vsock::{BackendType, VhostUserVsockBackend, VsockConfig, VSOCK_OP_RW}; | ||
| use vhost_user_backend::VhostUserBackend; | ||
|
|
||
| const DATA_LEN: usize = 16; | ||
| const CONN_TX_BUF_SIZE: u32 = 64 * 1024; | ||
|
|
@@ -698,11 +706,28 @@ mod tests { | |
| // SAFETY: Safe as hdr_raw and data_raw are guaranteed to be valid. | ||
| let mut packet = unsafe { VsockPacket::new(hdr_raw, Some(data_raw)).unwrap() }; | ||
|
|
||
| packet.set_type(VSOCK_TYPE_STREAM); | ||
| packet.set_src_cid(CID); | ||
| packet.set_dst_cid(SIBLING_CID); | ||
| packet.set_dst_port(SIBLING_LISTENING_PORT); | ||
| packet.set_op(VSOCK_OP_RW); | ||
| packet.set_len(DATA_LEN as u32); | ||
| packet | ||
| .data_slice() | ||
| .unwrap() | ||
| .copy_from(&[0x01u8, 0x12u8, 0x23u8, 0x34u8]); | ||
|
|
||
| // The packet will be dropped silently because the thread won't activate until the config | ||
| // is read. | ||
| vtp.send_pkt(&packet).unwrap(); | ||
| assert_eq!( | ||
| vtp.recv_raw_pkt(&mut packet).unwrap_err().to_string(), | ||
| Error::EmptyRawPktsQueue.to_string() | ||
| ); | ||
|
|
||
| sibling_backend.get_config(0, 8); | ||
| sibling2_backend.get_config(0, 8); | ||
|
|
||
| packet.set_type(VSOCK_TYPE_STREAM); | ||
| packet.set_src_cid(CID); | ||
| packet.set_dst_cid(SIBLING_CID); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,14 +2,18 @@ | |
|
|
||
| use std::{ | ||
| collections::{HashMap, HashSet}, | ||
| fs::File, | ||
| io::Result as IoResult, | ||
| path::PathBuf, | ||
| sync::{Arc, Mutex, RwLock}, | ||
| }; | ||
|
|
||
| use log::warn; | ||
| use thiserror::Error as ThisError; | ||
| use vhost::vhost_user::message::{VhostUserProtocolFeatures, VhostUserVirtioFeatures}; | ||
| use vhost::vhost_user::message::{ | ||
| VhostTransferStateDirection, VhostTransferStatePhase, VhostUserProtocolFeatures, | ||
| VhostUserVirtioFeatures, | ||
| }; | ||
| use vhost_user_backend::{VhostUserBackend, VringRwLock}; | ||
| use virtio_bindings::bindings::{ | ||
| virtio_config::{VIRTIO_F_NOTIFY_ON_EMPTY, VIRTIO_F_VERSION_1}, | ||
|
|
@@ -24,8 +28,14 @@ use vmm_sys_util::{ | |
|
|
||
| use crate::{thread_backend::RawPktsQ, vhu_vsock_thread::*}; | ||
|
|
||
| pub(crate) type CidMap = | ||
| HashMap<u64, (Arc<RwLock<RawPktsQ>>, Arc<RwLock<HashSet<String>>>, EventFd)>; | ||
| pub(crate) type CidMap = HashMap< | ||
| u64, | ||
| ( | ||
| Option<Arc<RwLock<RawPktsQ>>>, | ||
| Arc<RwLock<HashSet<String>>>, | ||
| EventFd, | ||
| ), | ||
| >; | ||
|
|
||
| const NUM_QUEUES: usize = 3; | ||
|
|
||
|
|
@@ -72,8 +82,11 @@ pub(crate) const VSOCK_FLAGS_SHUTDOWN_RCV: u32 = 1; | |
| /// data | ||
| pub(crate) const VSOCK_FLAGS_SHUTDOWN_SEND: u32 = 2; | ||
|
|
||
| /// Vsock events - `VSOCK_EVENT_TRANSPORT_RESET`: Communication has been interrupted | ||
| pub(crate) const VSOCK_EVENT_TRANSPORT_RESET: u32 = 0; | ||
|
|
||
| // Queue mask to select vrings. | ||
| const QUEUE_MASK: u64 = 0b11; | ||
| const QUEUE_MASK: u64 = 0b111; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm wondering if this should be controlled by a param in the CLI. IIRC in QEMU the event queue is handled by the vmm. Did you test this with QEMU?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue. Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways. I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
IIRC this was already in that way for vhost-vsock, so when we introduced vhost-user-vsock we shared the same code. The VMM knows perfectly when a snapshot/migration is starting, so why it's strage that QEMU does it?
mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?
yeah, the fronted needs to enable that in some way. Annoying...
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Because the queue is called "event", not "transport_reset". If other events are added to the virtio-vsock spec then having qemu handling that queue will be a problem. It also seems to me that it doesn't match the intent of vhost-user, where reading and writing the queues is the responsibility of the backend, not the frontend; but that's just my (probably very uninformed) opinion.
Because the probability of hitting that race is extremely low, possibly 0. I don't know for sure that it's valid to attempt to take a snapshot of an uninitialized driver. On the other hand, having the extra flag that changes the behavior based on what the VMM does is a significant burden for the users. So far what QEMU and CrosVm do, but what about other VMMs? I doubt this behavior is document anywhere in an accessible manner, so it's possible that the only way to know for sure is to look at the VMM's source code. The user could also just try it with and without the flag and see what happens, but for VMMs that share the event queue with the backend both combinations will work most of the time. So I between a very unlikely race and some usability issues I chose the former, but I'm fine either way. Just let me know your preference.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
As I mentioned, this comes from vhost (in-kernel) device implementation.
I think we need to find another way, for example, can we discover if the event queue is offloaded to the device from the frontend or not (e.g. check if the device called set_vring_* etc.) and do this only if it's set up? |
||
|
|
||
| pub(crate) type Result<T> = std::result::Result<T, Error>; | ||
|
|
||
|
|
@@ -141,6 +154,8 @@ pub(crate) enum Error { | |
| EmptyRawPktsQueue, | ||
| #[error("CID already in use by another vsock device")] | ||
| CidAlreadyInUse, | ||
| #[error("Failed to write to event virtqueue")] | ||
| EventQueueWrite, | ||
| } | ||
|
|
||
| impl std::convert::From<Error> for std::io::Error { | ||
|
|
@@ -261,6 +276,7 @@ pub(crate) struct VhostUserVsockBackend { | |
| queues_per_thread: Vec<u64>, | ||
| exit_consumer: EventConsumer, | ||
| exit_notifier: EventNotifier, | ||
| transport_reset_pending: Arc<Mutex<bool>>, | ||
| } | ||
|
|
||
| impl VhostUserVsockBackend { | ||
|
|
@@ -286,6 +302,7 @@ impl VhostUserVsockBackend { | |
| queues_per_thread, | ||
| exit_consumer, | ||
| exit_notifier, | ||
| transport_reset_pending: Arc::new(Mutex::new(false)), | ||
| }) | ||
| } | ||
| } | ||
|
|
@@ -310,7 +327,9 @@ impl VhostUserBackend for VhostUserVsockBackend { | |
| } | ||
|
|
||
| fn protocol_features(&self) -> VhostUserProtocolFeatures { | ||
| VhostUserProtocolFeatures::MQ | VhostUserProtocolFeatures::CONFIG | ||
| VhostUserProtocolFeatures::MQ | ||
| | VhostUserProtocolFeatures::CONFIG | ||
| | VhostUserProtocolFeatures::DEVICE_STATE | ||
| } | ||
|
|
||
| fn set_event_idx(&self, enabled: bool) { | ||
|
|
@@ -335,6 +354,7 @@ impl VhostUserBackend for VhostUserVsockBackend { | |
| ) -> IoResult<()> { | ||
| let vring_rx = &vrings[0]; | ||
| let vring_tx = &vrings[1]; | ||
| let vring_evt = &vrings[2]; | ||
|
|
||
| if evset != EventSet::IN { | ||
| return Err(Error::HandleEventNotEpollIn.into()); | ||
|
|
@@ -349,7 +369,11 @@ impl VhostUserBackend for VhostUserVsockBackend { | |
| thread.process_tx(vring_tx, evt_idx)?; | ||
| } | ||
| EVT_QUEUE_EVENT => { | ||
| warn!("Received an unexpected EVT_QUEUE_EVENT"); | ||
| let reset_pending = &mut *self.transport_reset_pending.lock().unwrap(); | ||
| if *reset_pending { | ||
| thread.reset_transport(vring_evt, evt_idx)?; | ||
| *reset_pending = false; | ||
| } | ||
|
Comment on lines
+372
to
+376
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to be clear, this should not be here IMO, but in some other place, because here we should handle events coming from the driver (or backends like unix socket, etc.).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the only place where the backend implementation is given access to the vrings. The spec clearly says that "the back-end must start a ring upon receiving a kick", so without this kick the vring will not be ready and any attempt to write to it will simply fail with the NoReady error.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, but maybe we need to find something else. Or put this behavior into the spec because we are really implementing something custom for crosvm. (I'll try to find something better when I'm back) Also, why not doing this checks in any case? I mean for every event? What is confusing me is that we are using a kick to do something else. If we need a notification mechanism between frontend and backend, we should add a new message, but we should not reuse the kick which should be only used by the driver to notify the device. |
||
| } | ||
| BACKEND_EVENT => { | ||
| thread.process_backend_evt(evset); | ||
|
|
@@ -389,6 +413,15 @@ impl VhostUserBackend for VhostUserVsockBackend { | |
| return Vec::new(); | ||
| } | ||
|
|
||
| if offset + size == buf.len() { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of doing this, can we add a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When would that
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also another question, why a driver should read the config space after a live migration? For the driver should be transparent, no?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, okay, here we are acking that the driver reset it. I need to reread it again, all of those assumptions are not clear to me and also not in the VIRTIO spec at all. Should we extend it? |
||
| // The last byte of the config is read when the driver is initializing or after it has | ||
| // processed a transport reset event. Either way, no transport reset will be pending | ||
| // after this. Activate all threads once it's known a reset event is not pending. | ||
| for thread in self.threads.iter() { | ||
| thread.lock().unwrap().activate(); | ||
| } | ||
| } | ||
|
|
||
| buf[offset..offset + size].to_vec() | ||
| } | ||
|
|
||
|
|
@@ -401,6 +434,23 @@ impl VhostUserBackend for VhostUserVsockBackend { | |
| let notifier = self.exit_notifier.try_clone().ok()?; | ||
| Some((consumer, notifier)) | ||
| } | ||
|
|
||
| fn set_device_state_fd( | ||
| &self, | ||
| direction: VhostTransferStateDirection, | ||
| _phase: VhostTransferStatePhase, | ||
| _file: File, | ||
| ) -> std::result::Result<Option<File>, std::io::Error> { | ||
| if let VhostTransferStateDirection::LOAD = direction { | ||
| *self.transport_reset_pending.lock().unwrap() = true; | ||
| } | ||
| Ok(None) | ||
| } | ||
|
|
||
| fn check_device_state(&self) -> std::result::Result<(), std::io::Error> { | ||
| // We had nothing to read/write to the fd, so always return Ok. | ||
| Ok(()) | ||
| } | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
|
|
@@ -436,17 +486,20 @@ mod tests { | |
| let vrings = [ | ||
| VringRwLock::new(mem.clone(), 0x1000).unwrap(), | ||
| VringRwLock::new(mem.clone(), 0x2000).unwrap(), | ||
| VringRwLock::new(mem.clone(), 0x1000).unwrap(), | ||
| ]; | ||
| vrings[0].set_queue_info(0x100, 0x200, 0x300).unwrap(); | ||
| vrings[0].set_queue_ready(true); | ||
| vrings[1].set_queue_info(0x1100, 0x1200, 0x1300).unwrap(); | ||
| vrings[1].set_queue_ready(true); | ||
| vrings[2].set_queue_info(0x2100, 0x2200, 0x2300).unwrap(); | ||
| vrings[2].set_queue_ready(true); | ||
|
|
||
| backend.update_memory(mem).unwrap(); | ||
|
|
||
| let queues_per_thread = backend.queues_per_thread(); | ||
| assert_eq!(queues_per_thread.len(), 1); | ||
| assert_eq!(queues_per_thread[0], 0b11); | ||
| assert_eq!(queues_per_thread[0], 0b111); | ||
|
|
||
| let config = backend.get_config(0, 8); | ||
| assert_eq!(config.len(), 8); | ||
|
|
@@ -569,6 +622,7 @@ mod tests { | |
| let vrings = [ | ||
| VringRwLock::new(mem.clone(), 0x1000).unwrap(), | ||
| VringRwLock::new(mem.clone(), 0x2000).unwrap(), | ||
| VringRwLock::new(mem.clone(), 0x1000).unwrap(), | ||
| ]; | ||
|
|
||
| backend.update_memory(mem).unwrap(); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, for example, if this feature is not negotiated, we can skip all of this, no?