block: Improves guest signaling for lower virtblk end-to-end latency#7532
block: Improves guest signaling for lower virtblk end-to-end latency#7532elainewu1222 wants to merge 1 commit intocloud-hypervisor:mainfrom
Conversation
Signal the guest before processing queue submissions to enable earlier guest side completion event handling, reducing end-to-end latency for block device operations. FIO benchmarks show up to 7.4% bandwidth improvement at 16 iodepth and 4k block size with NVMe devices. Signed-off-by: wuxinyue <wuxinyue.wxy@antgroup.com>
| )) | ||
| })?; | ||
|
|
||
| self.try_signal_used_queue()?; |
There was a problem hiding this comment.
This is going to generate two interrupts into the guest which has it's own performance penalty
I guess what this change is taking advantage of is that because this is running in it's own thread the first interrupt will go into the guest, it will take some time for the guest to start looking at the used queue whilst this thread is still continuing to process the submission queue and starting to put used entries onto the queue. The guest may see a partially updated used queue. But then it will then see complete used queue on the second submission.
I think this is spec compliant because the driver must handle extra interrupts.
But I don't think it is memory safe - it may work on a system with TSO but not on system with a weaker memory model.
Currently the VMM code will do a write fence as part of Queue.needs_notification - similarly the kernel code will do a read fence before the processing the used queue. With your change there can be writes to the used queue that are not "protected" by those fences. And although the index is updated after writing the entry to the queue there is no guarantee that the other thread will see them in that order without the fence.
Therefore I don't think this change is safe to include. However I think other folks should also do some research and think about it.
There was a problem hiding this comment.
This is going to generate two interrupts into the guest which has it's own performance penalty
The second interrupt is unlikely to happen on an async model: it's only generated when process_queue_submit touch used queue, which means the request is a sync request or a failed request. In most read/write cases, the second interrupt will not be triggered.
The purpose of this change is to make the signal for all completed read/write IO happen earlier, and not wait for potential submit-phase completed request.
With your change there can be writes to the used queue that are not "protected" by those fences
If you are talking about guest side may see new requests after the signal, I think such thing could also happen before, if the backend is fast enough it can add new used items before guest start handling.
Virt-queue itself has already guaranteed the memory safety, and when to send the signal won't impact the its efficiency.
For the signal timing thing, we can take virtio event index mechanism as an example. This mechanism tries to reduce interrupts by checking the other side's progress. Seeing new requests after signal is expected and encouraged in such case.
There was a problem hiding this comment.
More info on how virt-queue ensures memory safety on used queue:
- when a new item added to used queue, the used index is updated last and is stored using Release ordering. -- a write barrier
- guest side always check the used index before processing more entry, and a read barrier is place after the used index read.
|
I need someone else to help me understand whether this fix is spec compliant and valuable - i'm not convinced. @sboeuf @likebreath |
FWIW, I agree with @elainewu1222 on the safety of the operation. The queue access must be safe without the notification mechanism, and the index read/write is the mechanism to ensure that, as #7532 (comment) explains. Regarding TSO or not, correct code and correct compilers need to ensure the outcome looks sequentially consistent in the absence of data races on the target platform. Using the acquire and release semantics on the index read/write is the way to ensure that. Compilers will transform these to the correct operations or insert additional barriers for the platform as needed. Signaling the used queue right when the completion event is handled definitely reduces latency and sounds very reasonable. I see nothing wrong with that approach. The Regarding sending too many events, CHV supports
(indicating the notification mechanism is not tied to the safety of queue operations) |
Signal the guest before processing queue submissions to enable earlier guest side completion event handling, reducing end-to-end latency for block device operations.
FIO benchmarks show up to 7.4% bandwidth improvement at 16 iodepth and 4k block size with NVMe devices.
test stats
before the change: bw=2209MiB/s lat=112.62us
after the change: bw=2373MiB/s lat=104.92us
Test details
fio command: fio --name=direct_aio_randwrite --ioengine=libaio --iodepth=16 --rw=randwrite --bs=4k --direct=1 --size=4G --numjobs=4 --runtime=240 --group_reporting --time_based --filename=/dev/vdb
cloud-hypervisor setup:
machine info