Skip to content

Comments

block: Improves guest signaling for lower virtblk end-to-end latency#7532

Open
elainewu1222 wants to merge 1 commit intocloud-hypervisor:mainfrom
elainewu1222:virtblk-signal-frontend-earlier
Open

block: Improves guest signaling for lower virtblk end-to-end latency#7532
elainewu1222 wants to merge 1 commit intocloud-hypervisor:mainfrom
elainewu1222:virtblk-signal-frontend-earlier

Conversation

@elainewu1222
Copy link

Signal the guest before processing queue submissions to enable earlier guest side completion event handling, reducing end-to-end latency for block device operations.

FIO benchmarks show up to 7.4% bandwidth improvement at 16 iodepth and 4k block size with NVMe devices.

test stats

before the change: bw=2209MiB/s lat=112.62us
after the change: bw=2373MiB/s lat=104.92us

Test details

fio command: fio --name=direct_aio_randwrite --ioengine=libaio --iodepth=16 --rw=randwrite --bs=4k --direct=1 --size=4G --numjobs=4 --runtime=240 --group_reporting --time_based --filename=/dev/vdb

cloud-hypervisor setup:

numactl --cpubind=0 {clh-path} \
        --memory size=40G,shared=on \
        --kernel {linux 6.1.103} \
        --disk path={boot-disk-path} path={nvme-disk-path},direct=on,num_queues=4,queue_affinity=[0@8,1@9,2@10,3@11] \
        --cmdline "nokaslr earlyprintk=ttyS0 console=hvc0 root=/dev/vda1 rw" \
        --cpus boot=4,affinity=[0@4,1@5,2@6,3@7] \
        --serial tty  \
        --api-socket /tmp/ch-socket \
        --device path=/sys/bus/pci/devices/$dev1 \
        --net "tap=tap1"

machine info

#lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           4
NUMA node(s):        8
Vendor ID:           HygonGenuine
BIOS Vendor ID:      Chengdu Hygon
CPU family:          24
Model:               4
Model name:          Hygon C86-4G (OPN:7470)
BIOS Model name:     Hygon C86-4G (OPN:7470)
Stepping:            1
CPU MHz:             2999.320
BogoMIPS:            5200.03
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-11,96-107
NUMA node1 CPU(s):   12-23,108-119
NUMA node2 CPU(s):   24-35,120-131
NUMA node3 CPU(s):   36-47,132-143
NUMA node4 CPU(s):   48-59,144-155
NUMA node5 CPU(s):   60-71,156-167
NUMA node6 CPU(s):   72-83,168-179
NUMA node7 CPU(s):   84-95,180-191
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca sme sev sev_es

Signal the guest before processing queue submissions to enable
earlier guest side completion event handling, reducing end-to-end
latency for block device operations.

FIO benchmarks show up to 7.4% bandwidth improvement at 16
iodepth and 4k block size with NVMe devices.

Signed-off-by: wuxinyue <wuxinyue.wxy@antgroup.com>
@elainewu1222 elainewu1222 requested a review from a team as a code owner December 2, 2025 03:58
@elainewu1222 elainewu1222 changed the title block: Improves guest signaling for lower end-to-end latency block: Improves guest signaling for lower virtblk end-to-end latency Dec 2, 2025
))
})?;

self.try_signal_used_queue()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to generate two interrupts into the guest which has it's own performance penalty

I guess what this change is taking advantage of is that because this is running in it's own thread the first interrupt will go into the guest, it will take some time for the guest to start looking at the used queue whilst this thread is still continuing to process the submission queue and starting to put used entries onto the queue. The guest may see a partially updated used queue. But then it will then see complete used queue on the second submission.

I think this is spec compliant because the driver must handle extra interrupts.

But I don't think it is memory safe - it may work on a system with TSO but not on system with a weaker memory model.

Currently the VMM code will do a write fence as part of Queue.needs_notification - similarly the kernel code will do a read fence before the processing the used queue. With your change there can be writes to the used queue that are not "protected" by those fences. And although the index is updated after writing the entry to the queue there is no guarantee that the other thread will see them in that order without the fence.

Therefore I don't think this change is safe to include. However I think other folks should also do some research and think about it.

Copy link
Author

@elainewu1222 elainewu1222 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to generate two interrupts into the guest which has it's own performance penalty

The second interrupt is unlikely to happen on an async model: it's only generated when process_queue_submit touch used queue, which means the request is a sync request or a failed request. In most read/write cases, the second interrupt will not be triggered.
The purpose of this change is to make the signal for all completed read/write IO happen earlier, and not wait for potential submit-phase completed request.

With your change there can be writes to the used queue that are not "protected" by those fences

If you are talking about guest side may see new requests after the signal, I think such thing could also happen before, if the backend is fast enough it can add new used items before guest start handling.
Virt-queue itself has already guaranteed the memory safety, and when to send the signal won't impact the its efficiency.
For the signal timing thing, we can take virtio event index mechanism as an example. This mechanism tries to reduce interrupts by checking the other side's progress. Seeing new requests after signal is expected and encouraged in such case.

Copy link
Author

@elainewu1222 elainewu1222 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More info on how virt-queue ensures memory safety on used queue:

  • when a new item added to used queue, the used index is updated last and is stored using Release ordering. -- a write barrier
  • guest side always check the used index before processing more entry, and a read barrier is place after the used index read.

@rbradford
Copy link
Member

I need someone else to help me understand whether this fix is spec compliant and valuable - i'm not convinced. @sboeuf @likebreath

@snue
Copy link
Contributor

snue commented Feb 17, 2026

I need someone else to help me understand whether this fix is spec compliant and valuable - i'm not convinced. @sboeuf @likebreath

FWIW, I agree with @elainewu1222 on the safety of the operation. The queue access must be safe without the notification mechanism, and the index read/write is the mechanism to ensure that, as #7532 (comment) explains.

Regarding TSO or not, correct code and correct compilers need to ensure the outcome looks sequentially consistent in the absence of data races on the target platform. Using the acquire and release semantics on the index read/write is the way to ensure that. Compilers will transform these to the correct operations or insert additional barriers for the platform as needed.

Signaling the used queue right when the completion event is handled definitely reduces latency and sounds very reasonable. I see nothing wrong with that approach. The fio numbers show a clear effect for the "extreme" case. Regressions for a more general use case seem unlikely, but seeing some numbers could help raise confidence.

Regarding sending too many events, CHV supports VIRTIO_F_EVENT_IDX:

Similarly, if VIRTIO_F_EVENT_IDX has been negotiated, the device will check used_event to know if it needs to send a notification or not.
-- from a redhat blog post on virtqueues

(indicating the notification mechanism is not tied to the safety of queue operations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants