It's been two months since we have released vzctl 4.7 and ploop 1.11 with some vital improvements to live migration that made it about 25% faster. Believe it or not, this is another "make it faster" post, making it, well, faster. That's right, we have more surprises in store for you! Read on and be delighted.
Asynchronous ploop send
This is something that should have been implemented at the very beginning, but the initial implementation looked like this (in C):
/* _XXX_ We should use AIO. ploopcopy cannot use cached reads and * has to use O_DIRECT, which introduces large read latencies. * AIO is necessary to transfer with maximal speed. */
Why there are no cached reads (and, in general, why ploop library is using O_DIRECT, and the kernel also uses direct I/O, bypassing the cache)? Note that ploop images are files that are used as block devices containing a filesystem. That ploop block device and filesystem access is going through cache as usual, so if we'd do the same for ploop image itself, it would result in double caching and waste of RAM. Therefore, we do direct I/O on lower level (when working with ploop images), and allow usual Linux cache to be used on the upper level (when container is accessing files inside the ploop, which is the most common operation).
So, ploop image itself is not (usually) cached in memory, and other tricks like read-ahead are also not used by the kernel, and reading each block from the image takes time as it is read directly from disk. In our test lab (OpenVZ running inside KVM with a networked disk) it takes about 10 milliseconds to read each 1 MB block. Then, sending each block to the destination system over network also takes time, it's about 15 milliseconds in our setup. So, it takes 25 milliseconds to read and send a block of data, if we do it one after another. Oh wait, this is exactly how we do it!
Solution -- let's do reading and sending at the same time! In other words, while sending the first block of data we can already read the second one, and so on, bringing the time required to transfer one block down to MAX(Tread, Tsend) instead of Tread + Tsend.
Implementation uses POSIX threads. Main ploop send process spawns a separate sending thread and two buffers -- one for reading and one for sending, then they change place. This works surprisingly well for something that uses threads, maybe because it is as simple as it could ever be (one thread, one mutex, one conditional). If you happen to know something about pthreads programming, please review the appropriate commit 55e26e9606.
The new async send is used by default, you just need to have a newer ploop (1.12, not yet released). Clone and compile ploop from git if you are curious.
As for how much time we save using this, it happen to be 15 microseconds instead of 25 per block of data, or 40% faster! Now, the real savings depend on the number of blocks needs to be migrated after container is frozen, it can be 5 or 100, so overall savings can be from 0.05 to 1 second.
ploop copy with feedback
The previous post on live migration improvements described an optimization of doing fdatasync() on the receiving ploop side before suspending the container on the sending side. It also noted that implementation is sub-par:
The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).
So, the problem is trivial -- there's a need for a bi-directional channel between ploop copy sender and receiver, so the receiver can say "All right, I have synced the freaking delta back to sender". In addition, we want to do it in a simple way, not much more complicated as it is now.
After some playing around with different approaches, it seemed that OpenSSH port forwarding, combined with tools like netcat, socat, or bash /dev/tcp feature, can do the trick of establishing a two-way pipe between ploop copy sides.
The netcat (or nc) is available in various varieties, which might or might not be available and/or compatible. socat is a bit better, but one problem with it is it ignores its child exit code, so there is no (simple) way to figure out an error. A problem with /dev/tcp feature is it is bash-specific, but bash itself might not be universally available (Debian and Ubunty users are well aware of that fact).
So, all this was replaced with a tiny home-grown vznnc ("nano-netcat) tool. It is indeed nano, as there is only only 200 lines of code in it. It can either listen or connect to a specified port at localhost (note that ssh is doing real networking for us), and run a program with its stdin/stdout (or a specified file descriptor) already connected to that TCP socket. Again, similar to netcat, but small, to the point, and hopefully bug-free.
Finally, with this in place, we can make sending side of ploop copy to wait for feedback from the receiving side, so it can suspend the container as soon as remote side finished syncing. This makes the whole migration a bit faster (by eliminating the previously hard-coded 5 seconds wait-for-sync delay), but it also helps to slash the frozen time as we suspend the container as soon as we should, so it won't be given any extra time to write some more data we'll be needing to copy while it's frozen.
It's hard to measure the practical impact of the feature, but in our tests it saves about 3.5 seconds of total migration time, and from 0 to a few seconds of frozen time, depending on container's disk I/O activity.
For the feature to work, you need the latest versions of vzctl (4.8) and ploop (1.12) on both nodes (if one node doesn't have newer tools, vzmigrate falls back to old non-feedback way of copying). Note that those new version are not yet released at the time of writing this, but you can get ones from git and compile yourself.
Using ssh connection multiplexing
It takes about 0.15 seconds to establish an ssh connection in our test lab. Your mileage may wary, but it can't be zero. Well, unless an OpenSSH feature of reusing an existing ssh connection is put to work! It's call connection miltiplexing, and when used, once a connection is established, you can have subsequent ones in practically no time. As vzmigrate is a shell script and runs a number of commands at remote (destination) side, using it might save us some time.
Unfortunately, it's a relatively new OpenSSH feature and is implemented quite ugly in a version available in CentOS 6 -- you need to keep one open ssh session running. They fixed it in a later version by adding a special daemon-like mode enabled with ControlPersist option, but alas not in CentOS 6. Therefore, we have to maintain a special "master" ssh connection for the duration of vzmigrate. For implementation details, see commits 06212ea3d and 00b9ce043.
This is still experimental, so you need to specify
flag to vzmigrate
to use it. You won't believe it (so, go test it yourself!), but this alone
slashes container frozen time by about 25% (which is great, as we fight for
every microsecond here), and improves the total time taken by vzmigrate
by up to 1 second (which is probably not that important but still nice).
Current setup used for development and testing is two OpenVZ instances
running in KVM guests on a
box. While it's convenient and oh so very space-saving, is is probably
not the best approximation of a real world OpenVZ usage.
So, we need
some better hardware:
If you are able to spend $500 to $2000 on ebay, please let us know by email to donate at openvz dot org so we can arrange it. Now, quite a few people offered hosted servers with a similar configuration. While we are very thankful to all such offers, this time we are looking for physical, not hosted, hardware.Update: Thanks to FastVPS, we got the Supermicro 4 node server. If you want to donate, see wiki: Donations.