Skip to content

vsock: Enable live migrations (snapshot-restore)#936

Open
jemoreira wants to merge 2 commits intorust-vmm:mainfrom
jemoreira:suspend_resume
Open

vsock: Enable live migrations (snapshot-restore)#936
jemoreira wants to merge 2 commits intorust-vmm:mainfrom
jemoreira:suspend_resume

Conversation

@jemoreira
Copy link
Contributor

@jemoreira jemoreira commented Feb 11, 2026

Summary of the PR

The vsock device only accesses virtqueues when they are ready.

The most common case where this matters is after receiving a VHOST_USER_GET_VRING_BASE message from the client (because of a suspend operation, for example), if one of the host side applications or sister VMs sends us a message it must be kept queued until the virtqueues are set to ready again with a kick.

Enable live migrations (snapshot-restore) of the vsock device.

The virtio spec mandates that when restoring from a snapshot (such as when the VM was migrated to a different host) the device must send a TRANSPORT RESET event via the event virtqueue to the device. The device in turn, upon receiving the event, must drop all existing connections while keeping all listeners and read the CID again from the device configuration space and update the listeners with the (potentially different) CID.

On the device side care must be taken to ensure no new connections are established until that transport reset event is handled otherwise the guest will silently drop them and the peer may not know about it. Given that the driver must read the CID from the configuration space during normal boot and after doing a reset that's the best signal for the device to "activate" and start processing packets from the host and sibling VMs. Feature negotiation also happens during both normal boot and restore, but in the case of restore it happens before the transport reset and therefore can't be used as reliable signal to "activate" the device.

The device doesn't need to save/load any state for this. When loading a snapshot the device simply notes that it must send a TRANSPORT RESET event to the driver as soon as possible, which it then does when it receives a kick on the event vring. Independently of whether the device is started to restore a previous VM state or brand new (the device actually doesn't know until it set_device_state_fd is called), the device always starts in "inactive" state, meaning it will drop any packets coming from any source. As mentioned before the device "activates" once the driver has read the configuration space.

Unlike other VMMs, QEMU takes ownership of the event vring and handles sending of the TRANSPORT RESET event itself. This change attempts to handle both approaches by sending the TRANSPORT RESET event if the vring is kicked, but doesn't treat it as precondition to activate the device, instead choosing to activate unconditionally once the driver reads the configuration space.

There is a race in this implementation, that occurs when the snapshot was taken before the guest driver read the configuration and then it reads immediately upon restore, before the transport reset event was handled. While this could be mitigated by waiting for the reset event to be sent AND the config to be read AFTER the event was sent, QEMU's decision to take over the event queue makes that approach unviable. Hopefully, it will be very unlikely a snapshot will be taken before the driver is fully initialized as that would be of very little value in practice.

Requirements

Before submitting your PR, please make sure you addressed the following
requirements:

  • All commits in this PR have Signed-Off-By trailers (with
    git commit -s), and the commit message has max 60 characters for the
    summary and max 75 characters for each description line.
  • All added/changed functionality has a corresponding unit/integration
    test.
  • All added/changed public-facing functionality has entries in the "Upcoming
    Release" section of CHANGELOG.md (if no such section exists, please create one).
  • Any newly added unsafe code is properly documented.

@jemoreira jemoreira changed the title Suspend resume vsock: Enable live migrations (snapshot-restore) Feb 11, 2026
@jemoreira jemoreira marked this pull request as draft February 12, 2026 00:07
@jemoreira
Copy link
Contributor Author

Converted to draft to address the test failures

@jemoreira jemoreira force-pushed the suspend_resume branch 2 times, most recently from d8fce24 to 2bcfd05 Compare February 13, 2026 00:17
@jemoreira jemoreira marked this pull request as ready for review February 13, 2026 00:17
Copy link
Collaborator

@dorindabassey dorindabassey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, just left a few comments.

@jemoreira
Copy link
Contributor Author

Hello, just a friendly reminder that this PR is ready for review.


// Queue mask to select vrings.
const QUEUE_MASK: u64 = 0b11;
const QUEUE_MASK: u64 = 0b111;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this should be controlled by a param in the CLI. IIRC in QEMU the event queue is handled by the vmm.

Did you test this with QEMU?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue. Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways.

I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue.

IIRC this was already in that way for vhost-vsock, so when we introduced vhost-user-vsock we shared the same code. The VMM knows perfectly when a snapshot/migration is starting, so why it's strage that QEMU does it?

Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways.

mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?

I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.

yeah, the fronted needs to enable that in some way. Annoying...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's strage that QEMU does it?

Because the queue is called "event", not "transport_reset". If other events are added to the virtio-vsock spec then having qemu handling that queue will be a problem. It also seems to me that it doesn't match the intent of vhost-user, where reading and writing the queues is the responsibility of the backend, not the frontend; but that's just my (probably very uninformed) opinion.

mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?

Because the probability of hitting that race is extremely low, possibly 0. I don't know for sure that it's valid to attempt to take a snapshot of an uninitialized driver.

On the other hand, having the extra flag that changes the behavior based on what the VMM does is a significant burden for the users. So far what QEMU and CrosVm do, but what about other VMMs? I doubt this behavior is document anywhere in an accessible manner, so it's possible that the only way to know for sure is to look at the VMM's source code. The user could also just try it with and without the flag and see what happens, but for VMMs that share the event queue with the backend both combinations will work most of the time.

So I between a very unlikely race and some usability issues I chose the former, but I'm fine either way. Just let me know your preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's strage that QEMU does it?

Because the queue is called "event", not "transport_reset". If other events are added to the virtio-vsock spec then having qemu handling that queue will be a problem. It also seems to me that it doesn't match the intent of vhost-user, where reading and writing the queues is the responsibility of the backend, not the frontend; but that's just my (probably very uninformed) opinion.

As I mentioned, this comes from vhost (in-kernel) device implementation.

mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?

Because the probability of hitting that race is extremely low, possibly 0. I don't know for sure that it's valid to attempt to take a snapshot of an uninitialized driver.

On the other hand, having the extra flag that changes the behavior based on what the VMM does is a significant burden for the users. So far what QEMU and CrosVm do, but what about other VMMs? I doubt this behavior is document anywhere in an accessible manner, so it's possible that the only way to know for sure is to look at the VMM's source code. The user could also just try it with and without the flag and see what happens, but for VMMs that share the event queue with the backend both combinations will work most of the time.

So I between a very unlikely race and some usability issues I chose the former, but I'm fine either way. Just let me know your preference.

I think we need to find another way, for example, can we discover if the event queue is offloaded to the device from the frontend or not (e.g. check if the device called set_vring_* etc.) and do this only if it's set up?

Comment on lines +434 to +436
// The last byte of the config is read when the driver is initializing or after it has
// processed a transport reset event. Either way, no transport reset will be pending
// after this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, so if a driver decides to not read the configuration, the device will not be activated? Is this conform to the spec?

Also, why we need to wait this, and not for example the feature negotiation?

Copy link
Contributor Author

@jemoreira jemoreira Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says the driver needs to read the CID from the config space as part of device initialization process. When restoring from a snapshot, the spec says the driver MUST read the config space again because its CID may have changed and drop any existing connections, but keep any listeners, now associated to the new CID. The device informs the driver that a restore/migration occurred by sending the transport reset event.

Because the driver may eventually silently drop any existing connections the device should ensure no new connections are established until after the drop happens or it knows for sure it won't happen. The only common action that happens during a regular boot and a restore is the driver reading the config. The features are also read, but in the case of the restore this is done from the frontend side, not the driver, and happens before the transport reset, so it can't be used as the signal to activate.

I've updated the commit message with this information too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, this helps the review a lot.

let cid_map = self.cid_map.read().unwrap();
if cid_map.contains_key(&dst_cid) {
let (sibling_raw_pkts_queue, sibling_groups_set, sibling_event_fd) =
let (sibling_raw_pkts_queue_opt, sibling_groups_set, sibling_event_fd) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need this change? Can be split in another commit?

I hope these things are mentioned in the commit description.

Copy link
Contributor Author

@jemoreira jemoreira Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prevent packets from sibling VMs to be delivered to the guest driver before a potential transport reset event is processed (just like it's done for packages from the host). If any of these packets are allowed through a connection could be established that the transport reset would destroy later, but the sibling VM won't know about this. I don't know if just dropping the packet is the best course of action, but that's what the code already does for packets addressed to unknown CIDs and I just wanted to treat this case as if the CID is not reachable yet.

In fact, my first option was to delay adding the whole tuple to cid_map until activation, but the presence of the CID in the map is also used to check something in or near the main function, so I opted for just this instead.

I can split it into its own commit, but it doesn't make a lot of sense on its own. Let me know if you still prefer it separately.


/// Sends a TRANSPORT_RESET event to the guest driver. Returns true if it was able to send it,
/// false if there were no buffers available in the vring.
pub fn reset_transport(&mut self, vring: &VringRwLock, event_idx: bool) -> Result<bool> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, after a reset what will be the state of this device, can it be re-used or need to be stopped ?

I'm asking because I don't see any reset of connection etc.

So, yes please describe this a bit better in the commit description and please add a section in the documentation to explain how this live migration is supposed to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a "device reset", the spec calls it a "transport reset", but effectively it just tells the driver that all connections it new about are gone. Those connections only existed in the driver side as the device has just been started from scratch when this event is sent.

I've added more details to the commit message. By documentation, do you mean the README.md file? I'm not sure what to add there beyond the fact that the device supports VM migrations as this is usually handled by the VMM itself (not the user directly) and the processes/protocols are described in the virtio and vhost-user specs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a "device reset", the spec calls it a "transport reset", but effectively it just tells the driver that all connections it new about are gone. Those connections only existed in the driver side as the device has just been started from scratch when this event is sent.

I'm a bit lost here, especially where this is called.

So we have:

            EVT_QUEUE_EVENT => {
                let reset_pending = &mut *self.transport_reset_pending.lock().unwrap();
                if *reset_pending {
                    thread.reset_transport(vring_evt, evt_idx)?;
                    *reset_pending = false;
                }
            }

IIUC this means that it's called when the driver fills the event queue. Why this?
I mean, why the guest will fill up that queue again after a snapshot?

What happen if the guest fills the event queue before we set self.transport_reset_pending?

In QEMU we use it in a different way: when the migration is completed (post-load), we use one of the buffer that the driver already queued in the event_queue.

Here for me is not clear at all, why after a snapshot/migration a driver needs to fill the event queue again with free descriptors.

I've added more details to the commit message. By documentation, do you mean the README.md file? I'm not sure what to add there beyond the fact that the device supports VM migrations as this is usually handled by the VMM itself (not the user directly) and the processes/protocols are described in the virtio and vhost-user specs.

Yep a new section in the README.md will be nice especially to explain how it supposed to interact with the driver, because I'm still a lot confused about it, so I'm pretty sure in the future we forgot about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the section in the README file.

The most common case where this matters is after receiving a
VHOST_USER_GET_VRING_BASE message from the client, if one of the host
side applications or sister VMs sends us a message it must be kept
queued until the virtqueues are set to ready again with a kick.

Signed-off-by: Jorge E. Moreira <jemoreira@google.com>
Copy link
Member

@stefano-garzarella stefano-garzarella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still really confused, especially on why we do this on EVT_QUEUE_EVENT. So a section in the readme that explain better all the snapshot/migration support will be nice to have.

In addition, I think a parameter to enable this only with some VMMs would be better IMO because I still think what QEMU is doing makes more sense.

Note: I'll be off and with limited internet access from March 7 to March 29, if others want to merge, I'm not against, but I'd like to have a section in the readme with a clear design on how this is going to work.


// Queue mask to select vrings.
const QUEUE_MASK: u64 = 0b11;
const QUEUE_MASK: u64 = 0b111;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test with QEMU, only with CrosVm. I was very surprised to learn that QEMU takes ownership of the third queue.

IIRC this was already in that way for vhost-vsock, so when we introduced vhost-user-vsock we shared the same code. The VMM knows perfectly when a snapshot/migration is starting, so why it's strage that QEMU does it?

Luckily no cmdline parameter is needed, instead the device always activates after the config is read. This unfortunately deprives us of the ability to avoid a particular race condition (see the details in the commit message), but it should be very unlikely to trigger anyways.

mmm, so why not have a cmdline parameter to enabled this and avoid the race at all?

I have now tested that the device works well with QEMU, however when I tried taking a snapshot and restoring it failed with an error saying that the device doesn't support it, even though it advertises the DEVICE_STATE protocol feature. I suspect the issue is in the qemu frontend, not in the backend.

yeah, the fronted needs to enable that in some way. Annoying...

Comment on lines +434 to +436
// The last byte of the config is read when the driver is initializing or after it has
// processed a transport reset event. Either way, no transport reset will be pending
// after this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, this helps the review a lot.


/// Sends a TRANSPORT_RESET event to the guest driver. Returns true if it was able to send it,
/// false if there were no buffers available in the vring.
pub fn reset_transport(&mut self, vring: &VringRwLock, event_idx: bool) -> Result<bool> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a "device reset", the spec calls it a "transport reset", but effectively it just tells the driver that all connections it new about are gone. Those connections only existed in the driver side as the device has just been started from scratch when this event is sent.

I'm a bit lost here, especially where this is called.

So we have:

            EVT_QUEUE_EVENT => {
                let reset_pending = &mut *self.transport_reset_pending.lock().unwrap();
                if *reset_pending {
                    thread.reset_transport(vring_evt, evt_idx)?;
                    *reset_pending = false;
                }
            }

IIUC this means that it's called when the driver fills the event queue. Why this?
I mean, why the guest will fill up that queue again after a snapshot?

What happen if the guest fills the event queue before we set self.transport_reset_pending?

In QEMU we use it in a different way: when the migration is completed (post-load), we use one of the buffer that the driver already queued in the event_queue.

Here for me is not clear at all, why after a snapshot/migration a driver needs to fill the event queue again with free descriptors.

I've added more details to the commit message. By documentation, do you mean the README.md file? I'm not sure what to add there beyond the fact that the device supports VM migrations as this is usually handled by the VMM itself (not the user directly) and the processes/protocols are described in the virtio and vhost-user specs.

Yep a new section in the README.md will be nice especially to explain how it supposed to interact with the driver, because I'm still a lot confused about it, so I'm pretty sure in the future we forgot about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit there some something to be fixed:
the device must send a TRANSPORT RESET event via the event virtqueue to the device. -> to the driver
The device in turn, upon receiving the event -> The driver in turn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device doesn't need to save/load any state for this. When loading a
snapshot the device simply notes that it must send a TRANSPORT RESET
event to the driver as soon as possible, which it then does when it
receives a kick on the event vring.

So to recap, this is unclear to me, why the driver will send a kick to the event vring ?
Is that in the spec or in the implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit there some something to be fixed

Oops! too many d-words :). Fixed it.

Why the driver will send a kick to the event vring?

I can't reply to the other comment on this same topic for some reason, so I'll reply here too:

I mean, why the guest will fill up that queue again after a snapshot?

What happen if the guest fills the event queue before we set self.transport_reset_pending?

The driver probably doesn't (necessarily) push more buffers or send a kick after a restore, the fronted does (or should, at least CrosVm does).

I'll admit I haven't looked at QEMU's source in depth, but I'd image that if the rx queue has buffers in it a kick is sent to the device after restore to activate that queue, otherwise the device won't be able to send data to the driver until the driver pushes more buffers, which it has no reason to do since the ones it already pushed are just sitting there.

The backend could avoid waiting for the kick if it stored the queue state in its saved state, but there is still the question of when to send the buffer: Immediately after loading its state and replying to the "check" call? Wouldn't that be too soon, for example if it doesn't yet have call fd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went looking at the QEMU source code after all, or to be precise I asked Gemini to look for me. It seems QEMU does not in fact send that kick like crosvm does and instead depends on the driver to kick the queues when responding to the transport event OR on the device implementation to store the state of the queue. The virtio spec doesn't mention that the driver should kick the queues on transport reset though. In any case this approach seems to work well with the vhost backend, but the vhost-user backend is marked by qemu as unmigratable.

Other vhost-user devices, like vhost-user-blk appear to have a "kick right away" logic similar to what crosvm does for vhost-user-vsock. So, maybe it would be a good idea to add it in vhost-user-vsock too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit there some something to be fixed

Oops! too many d-words :). Fixed it.

Why the driver will send a kick to the event vring?

I can't reply to the other comment on this same topic for some reason, so I'll reply here too:

I mean, why the guest will fill up that queue again after a snapshot?

What happen if the guest fills the event queue before we set self.transport_reset_pending?

The driver probably doesn't (necessarily) push more buffers or send a kick after a restore, the fronted does (or should, at least CrosVm does).

Sorry, why a frontend should inject a kick? This seems more an hack to me?

A kick should means: "there is something new in the avail ring, please process it". What new stuff are in the vring after that kick?

I'll admit I haven't looked at QEMU's source in depth, but I'd image that if the rx queue has buffers in it a kick is sent to the device after restore to activate that queue, otherwise the device won't be able to send data to the driver until the driver pushes more buffers, which it has no reason to do since the ones it already pushed are just sitting there.

The device IMO should know that is starting with a driver already initialized and should not wait for a kick, but start to process the queue just after starting.

That said, for a TX queue may have sense, but we are not using at all in that sense in this case, since no new buffers will be there. Also we are not processing anything from the guest, so I'm still really confused why a kick is needed if the device doesn't need to process anything from the guest, but this seems more a frontend -> backend notification. Again, this seems purely an hack, and should not be enabled by default IMHO.

The backend could avoid waiting for the kick if it stored the queue state in its saved state, but there is still the question of when to send the buffer: Immediately after loading its state and replying to the "check" call? Wouldn't that be too soon, for example if it doesn't yet have call fd?

What about VHOST_USER_SET_VRING_ENABLE event? (maybe we need to extend the

The [virtio
spec](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4950001)
mandates that when restoring from a snapshot (such as when the VM was
migrated to a different host) the device must send a TRANSPORT RESET
event via the event virtqueue to the driver. The driver in turn, upon
receiving the event, must drop all existing connections while keeping
all listeners and read the CID again from the device configuration
space and update the listeners with the (potentially different) CID.

On the device side care must be taken to ensure no new connections are
established until that transport reset event is handled otherwise the
guest will silently drop them and the peer may not know about it. Given
that the driver must read the CID from the configuration space during
normal boot and after doing a reset that's the best signal for the
device to "activate" and start processing packets from the host and
sibling VMs. Feature negotiation also happens during both normal boot
and restore, but in the case of restore it happens before the transport
reset and therefore can't be used as reliable signal to "activate" the
device.

The device doesn't need to save/load any state for this. When loading a
snapshot the device simply notes that it must send a TRANSPORT RESET
event to the driver as soon as possible, which it then does when it
receives a kick on the event vring. Independently of whether the device
is started to restore a previous VM state or brand new (the device
actually doesn't know until it `set_device_state_fd` is called), the
device always starts in "inactive" state, meaning it will drop any
packets coming from any source. As mentioned before the device
"activates" once the driver has read the configuration space.

Unlike other VMMs, QEMU takes ownership of the event vring and handles
sending of the TRANSPORT RESET event itself. This change attempts to handle
both approaches by sending the TRANSPORT RESET event if the vring is
kicked, but doesn't treat it as precondition to activate the device,
instead choosing to activate unconditionally once the driver reads the
configuration space.

There is a race in this implementation, that occurs when the snapshot
was taken before the guest driver read the configuration and then it
reads immediately upon restore, before the transport reset event was
handled. While this could be mitigated by waiting for the reset event to
be sent AND the config to be read AFTER the event was sent, QEMU's
decision to take over the event queue makes that approach unviable.
Hopefully, it will be very unlikely a snapshot will be taken before the
driver is fully initialized as that would be of very little value in
practice.

Signed-off-by: Jorge E. Moreira <jemoreira@google.com>
@jemoreira
Copy link
Contributor Author

Note: I'll be off and with limited internet access from March 7 to March 29, if others want to merge, I'm not against, but I'd like to have a section in the readme with a clear design on how this is going to work.

I am also going to be on vacation during that same period, we can resume after the 29th if you're still not convinced about this. I tried to address every comment you left with the latest push, except the command line flag because I think everyone is just going to run it in "qemu mode", see that it works with other VMMs too and never use it.


## Live migration

This device implementation advertises support for live migrations by offering the VHOST_USER_PROTOCOL_F_DEVICE_STATE protocol feature, however this doesn't work with Qemu yet as it marks its vsock frontend as "unmigratable". This feature does work with CrosVm and potentially other virtual machine managers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, for example, if this feature is not negotiated, we can skip all of this, no?


The state saving flow is trivial as the device doesn't save any state as mentioned.

The state loading flow is a bit more complicated because the virtio-vsock spec mandates that the device must send a VIRTIO_VSOCK_EVENT_TRANSPORT_RESET event to the driver. During a restore the backend is started no differently than during a regular boot. When the frontend sends the VHOST_USER_SET_DEVICE_STATE_FD command with LOAD direction the backend doesn't load anything, but it takes note that a transport reset event needs to be sent to the driver via the event vring when possible. In order to make sure this event is sent when the queue is ready, the backend waits for the event queue to be kicked before sending the event. While these kicks usually come from the driver, this particular one is actually sent by the vhost-user frontend. This implementation depends on the frontend to kick all queues with pending buffers after a restore because the driver is unlikely to do so as it probably did it before the snapshot was taken.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make sure this event is sent when the queue is ready, the backend waits for the event queue to be kicked before sending the event. While these kicks usually come from the driver, this particular one is actually sent by the vhost-user frontend.

This is the hack IMO. The kick means: "hey, I'm the driver and there is a new available buffer for you in the virtqueue", while here we are using for something completely different, and even not in the spec. If this is what we want from the frontend (and I don't understand why we want this), we should put into the spec and not guess here that we will receive that event.

return Vec::new();
}

if offset + size == buf.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this, can we add a set_enabled callback to the VhostUserBackend trait and do this in that callback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would that set_enabled callback be called? This has to happen after the config is read because that's the only signal (according to the spec) the driver gives the device that the transport reset happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also another question, why a driver should read the config space after a live migration?

For the driver should be transparent, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, here we are acking that the driver reset it. I need to reread it again, all of those assumptions are not clear to me and also not in the VIRTIO spec at all.

Should we extend it?

Comment on lines +372 to +376
let reset_pending = &mut *self.transport_reset_pending.lock().unwrap();
if *reset_pending {
thread.reset_transport(vring_evt, evt_idx)?;
*reset_pending = false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this should not be here IMO, but in some other place, because here we should handle events coming from the driver (or backends like unix socket, etc.).
This should be done as response of some vhost_user message, like the set_enabled, etc. not to a random kick injected by the frontend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place where the backend implementation is given access to the vrings.

The spec clearly says that "the back-end must start a ring upon receiving a kick", so without this kick the vring will not be ready and any attempt to write to it will simply fail with the NoReady error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but maybe we need to find something else. Or put this behavior into the spec because we are really implementing something custom for crosvm. (I'll try to find something better when I'm back)

Also, why not doing this checks in any case? I mean for every event?

What is confusing me is that we are using a kick to do something else. If we need a notification mechanism between frontend and backend, we should add a new message, but we should not reuse the kick which should be only used by the driver to notify the device.

@stefano-garzarella
Copy link
Member

Note: I'll be off and with limited internet access from March 7 to March 29, if others want to merge, I'm not against, but I'd like to have a section in the readme with a clear design on how this is going to work.

I am also going to be on vacation during that same period, we can resume after the 29th if you're still not convinced about this. I tried to address every comment you left with the latest push, except the command line flag because I think everyone is just going to run it in "qemu mode", see that it works with other VMMs too and never use it.

Agree on that, I pointed out what is unclear to me. I think we can find a solution, but I don't like the kick injection TBH and I think we should avoid that. Let's continue when we are back ;-)

Enjoy your time off!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants