Description
While investigating seemingly much worse performance for the nvme
device compared to virtio-block
, @rmustacc pointed out we set the Volatile Write Cache bit for nvme
devices but not the similar flush capability for virtio-block
.
As a quick test, I tried clearing the vwc
bit and rerunning pgbench
with an nvme
device and got similar (if not very slightly better) results compared to virtio-block
(both using the file backend). Given the fact the file backend also doesn't use any sync flags on open, the speedup we were seeing on virtio-block
makes sense: turns out it's faster to just assume things are synchronous (when in actually they're not) and never call flush/fsync.
The issue is we don't ever try to negotiate the VIRTIO_BLK_F_FLUSH
feature today. Per the VIRTIO spec1:
An implementation that does not offer
VIRTIO_BLK_F_FLUSH
and does not commit completed writes will not be resilient to data loss in case of crashes.
In addition to advertising and trying to negotiate VIRTIO_BLK_F_FLUSH
, we then subsequently need to support VIRTIO_BLK_T_FLUSH
commands and forward appropriately to the backend.
Note: implementation wise we can also just choose to always commit writes even without flush support:
If
VIRTIO_BLK_F_FLUSH
was not offered by the device, the device MAY also commit writes to persistent device backend storage before reporting their completion.
But this relies on better support on the backend's side as well.