Multiple TCP Connections for Live Migrations#7669
Open
amphi wants to merge 14 commits intocloud-hypervisor:mainfrom
Open
Multiple TCP Connections for Live Migrations#7669amphi wants to merge 14 commits intocloud-hypervisor:mainfrom
amphi wants to merge 14 commits intocloud-hypervisor:mainfrom
Conversation
a39f7b7 to
b8a8caf
Compare
phip1611
requested changes
Feb 5, 2026
Member
There was a problem hiding this comment.
Awesome! Left a few remarks. I'll refrain from approving this because I'm biased (Sebastian and Julian are my colleagues)
Please, as discussed, add specific bandwidth numbers to the PR description and the commit series, showing how awesome this work is!
b8a8caf to
0460ed6
Compare
This is not wired up to anywhere yet. We will use this to establish multiple connections for live migration. On-behalf-of: SAP julian.stecklina@sap.com Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
This has no functional change, but it is a requirement to remove the lock that used to obtain the MemoryManager instance. On-behalf-of: SAP julian.stecklina@sap.com Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
... to avoid having to grab a lock when we receive a chunk of memory over the migration socket. This will come in handy when we have multiple threads for receiving memory. On-behalf-of: SAP julian.stecklina@sap.com Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
To allow for accepting more connections in the migration receive code paths, we need to keep track of the listener. This commit adds a thin abstraction to be able to hold on to it regardless of whether it is a UNIX domain or TCP socket. On-behalf-of: SAP julian.stecklina@sap.com Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
We keep the listening socket around and accept as many connections as the sender wants to open. There are still some problems: We never tear these threads down again. We will handle this in subsequent commits. On-behalf-of: SAP julian.stecklina@sap.com Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
In anticipation of using multiple threads for sending memory, refactor the sending code to be in a single place. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
... to be able to re-use it when establishing multiple send connections. I moved the receive socket creation out for symmetry. On-behalf-of: SAP julian.stecklina@sap.com Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
... to simplify sending memory from multiple connections in future commits. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
For sending memory over multiple connections, we need a way to split up the work. With these changes, we can take a memory table and chop it into same-sized chunks for transmit. On-behalf-of: SAP julian.stecklina@sap.com Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
This does not actually use the additional connections yet, but we are getting closer! On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
This will stop us from listening for more connections on the TCP socket when migration has finished. Tearing down the individual connections will come in a subsequent commit. Co-authored-by: Philipp Schuster <philipp.schuster@cyberus-technology.de> On-behalf-of: SAP julian.stecklina@sap.com On-behalf-of: SAP philipp.schuster@sap.com Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de> On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
... after the VM migration finishes. On-behalf-of: SAP julian.stecklina@sap.com Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de> On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
This solves the race condition in the following scenario: Thread A is done working and waits at the barrier. Thread B encounters an error and sends it to the main thread. Thread A is still waiting at the barrier, and the main thread cannot abort that. With the custom gate, the main thread can simply open the gate, and all waiting threads will continue. Even if Thread A now gets the gate-message that was sent for Thread B, the gate is now open and Thread A will not block. On-behalf-of: SAP sebastian.eydam@sap.com Signed-off-by: Sebastian Eydam <sebastian.eydam@cyberus-technology.de>
0460ed6 to
001ad37
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements VM live migrations using multiple TCP connections. Most of this work is taken from @blitz , so kudos to him!
The send-migration HTTP command now accepts a
connectionsparameter (defaults to 1) that specifies how many connections to use for live migration.Benchmarks
We did a quick test on two of our servers which have a 100G connection. We transferred a VM with 50GB RAM, here are our results:
We also ran iperf between the two machines and got a throughput of 11.5 GiB/s, so I'd say the feature works pretty good.
Hint: MiB/s is Mebibyte per Second, GiB is Gibibyte per second
Design
If
connectionsis larger than 1, the sender will attempt to establish additional TCP connections to the same migration destination. The main (initial) connection handles most of the migration protocol. The additional connections handle onlyMemorycommands for transferring chunks of VM memory.For each additional connection, a thread is created that receives chunks of memory from the main thread, and sends those chunks to the receiver.
For each iteration of sending memory, the
MemoryTablethat describes dirty memory will be split into chunks of fixed size (CHUNK_SIZE). These chunks will then be distributed among the available threads using an MPSC channel wrapped in a Mutex. This channel can have a configurable backlog of outstanding chunks to send (BUFFERED_REQUESTS_PER_THREAD). This is currently 64 chunks per thread to keep the memory consumption at a sensible level (Otherwise, for VMs with a huge amount of memory, this may take up a lot of additional memory).We still use the original request-response scheme. Since we don't pipeline requests, but always wait for the other side to acknowledge them, we have a fundamental limit on throughput that we can reach. The original code only expected one ACK for the whole dirty memory table. We now have one ACK per chunk.
@blitz came up with this formula of the upper bound of throughput per connection:
effective_throughput = chunk_size / (chunk_size / throughput_per_connection + round_trip_time)This formula is also in the code. We've played around with this and with large enough chunks, the impact seems negligible, especially since we can scale up connections. Feel free to plug in your favorite numbers.