It has been almost two years since we wrote about effective live migration with ploop write tracker. It's time to write some more about it, since we have managed to make ploop live migration yet more effective, by means of pretty simple optimizations. But let's not jump to resolution yet, it's a long and interesting story to tell.
As you know, live migration is not quite live, although it looks that way to a user. There is a short time period, usually a few seconds, during which the container being migrated is frozen. This time (shown if -t or -v option to vzmigrate --live is used) is what needs to be optimized, making it as short as possible. In order to do that, one needs to dig into details on what's happening when a container is frozen.
Typical timings obtained via vzmigrate -t --live look like this. We ran a few iterations migrating a container back and forth between two OpenVZ instances (running inside Parallels VMs on the same physical machine), so there are a few columns at the right side.
Apparently, the first suspect to look at is that "undump + resume". Basically, it shows timing of vzctl restore command. Why it is so slow? Apparently, ploop mount takes some noticeable time. Let's dig deeper into that process.
First, implement timestamps in ploop messages, raise the log level and see what is going on here. Apparently, adding deltas is not instant, takes any time from 0.1 second to almost a second. After some more experiments and thinking it becomes obvious that since ploop kernel driver works with data in delta files directly, bypassing the page cache, it needs to force those files to be written to the disk, and this costly operation happens while container is frozen. Is it possible to do it earlier? Sure, we just need to force write the deltas we just copied before suspending a container. Easy, just call fsync(), or yet better fdatasync(), since we don't really care about metadata being written.
Unfortunately, there is no command line tool to do fsync or fdatasync, so we had to write one and call it from vzmigrate. Is it any better now? Yes indeed, delta adding times went down to from tenths to hundredths of a second.
Except for the top delta, of course, which we migrate using ploop copy. Surely, we can't fsync it before suspending container, because we keep copying it after. Oh wait... actually we can! By adding an fsync before CT suspend, we force the data be written on disk, so the second fsync (which happens after everything is copied) will take less time. This time is shown as "Pcopy after suspend".
The problem is that ploop copy consists of two sides -- the sending one and the receiving one -- which communicate over a pipe (with ssh as a transport). It's the sending side which runs the command to freeze the container, and it's the receiving side which should do fsync, so we need to pass some sort of "do the fsync" command. Yet better, do it without breaking the existing protocol, so nothing bad will happen in case there is an older version of ploop on the receiving side.
The "do the fsync" command ended up being a data block of 4 bytes, you can see the patch here. Older version will write these 4 bytes to disk, which is unnecessary but OK do to, and newer version will recognize it as a need to do fsync.
The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).
To summarize, what we have added is a couple of fsyncs (it's actually fdatasync() since it is faster), and here are the results:
As you see, both "pcopy after suspend" and "undump + resume" times decreased, shaving off about a second of time, which gives us about 25% improvement. Now, take into account that tests were done on an otherwise idle nodes with mostly idle containers, we suspect that the benefit will be more apparent with I/O load. Let checking if this statement is true will be your homework for today!
During my last holiday on the of hospitable Turkey sunny seaside at night, instead of quenching my thirst or taking rest after a long and tedious day at a beach, I was sitting in a hotel lobby, where they have free Wi-Fi, trying to make live migration of a container on a ploop device working. I succeeded, with about 20 commits to ploop and another 15 to vzctl, so now I'd like to share my findings and tell the story about it.
Let's start from the basics and see how migration (i.e. moving a container from one OpenVZ server to another) is implemented. It's vzmigrate, a shell script which does the following (simplified for clarity):
1. Checks that a destination server is available via ssh w/o entering a password, that there is no container with the same ID on it, and so on. 2. Runs rsync of /vz/private/$CTID to the destination server 3. Stops the container 4. Runs a second rsync of /vz/private/$CTID to the destination 5. Starts the container on the destination 6. Removes it locally
Obviously, two rsync runs are needed, so the first one moves most of the data, while the container is still up and running, and the second one moves the changes made during the time period between the first rsync run and the container stop.
Now, if we need live migration (option --online to vzmigrate), then instead of CT stop we do vzctl checkpoint, and instead of start we do vzctl restore. As a result, a container will move to another system without your users noticing (process are not stopping, just freezing for a few seconds, TCP connections are migrating, IP addresses are not changed etc. -- no cheating, just a little bit of magic).
So this is the way it was working for years, making users happy and singing in the rain. One fine day, though, ploop was introduced and it was soon discovered that live migration is not working for ploop-based containers. I found a few reasons why (for example, one can't use rsync --sparse for copying ploop images, because in-kernel ploop driver can't work with files having holes). But the main thing I found is the proper way of migrating a ploop image: not with rsync, but with ploop copy.
ploop copy is a mechanism of effective copy of a ploop image with the help of build-in ploop kernel driver feature called write tracker. One ploop copy process is reading blocks of data from ploop image and sends them to stdout (prepending each block with a short header consisting of a magic label, a block position and its size). The other ploop copy process receives this data from stdin and writes them down to disk. If you connect these two processes via a pipe, and add ssh $DEST in between, you are all set.
You can say, cat utility can do almost the same thing. Right. The difference is, before starting to read and send data, ploop copy asks the kernel to turn on the write tracker, and the kernel starts to memorize a list of data blocks that are modified (written into). Then, after all the blocks are sent, ploop copy is politely expressing the interest in this list, and send the blocks from the list again, while the kernel is creating another list. The process repeats a few times, and the list becomes shorter and shorter. After a few iterations (either the list is empty, or it is not getting shorter, or we just decide that we already did enough iterations) ploop copy executes an external command, which should stop any disk activity for this ploop device. This command is either vzctl stop for offline migration or vzctl checkpoint for live migration; obviously, stopped or frozen container will not write anything to disk. After that, ploop copy asks the kernel for the list of modified blocks again, transfers the blocks listed, and finally asks the kernel for this list again. If this time the list is not empty, that means something is very wrong, meaning that stopping command haven't done what it should and we fail. Otherwise all is good and ploop copy sends a marker telling the transfer is over. So this is how the sending process works.
The receiving ploop copy process is trivial -- it just reads the blocks from stdin and writes those to file (to a specified position). If you want to see the code of both sending and receiving sides, look no further.
All right, in the previous migration description ploop copy is used in place of second rsync run (steps 3 and 4). I'd like to note that this way it's more effective, because rsync is trying to figure out which files have changed and where, while ploop copy just asks about it from the kernel. Also, because the "ask and send" process is iterative, container will be stopped or frozen as late as it can be and, even if container is writing data to disk actively, the period for which it is stopped is minimal.
Just out of pure curiosity I performed a quick non-scientific test, having "od -x /dev/urandom > file" running inside a container and live migrating it back and forth. ploop copy time after the freeze was a bit over 1 second, and the total frozen time a bit less than 3 seconds. Similar numbers can be obtained from the traditional simfs+rsync migration, but only if a container is not doing any significant I/O. Then I tried to migrate similar container on simfs running the same command running inside, and the frozen time increased to 13-16 seconds. I don't want to say these measurements are to be trusted, I just ran it without any precautions, with OpenVZ instances running inside Parallels VMs, with the physical server busy with something else...
Oh, the last thing. All this functionality is already included into latest tools releases: ploop 1.3 and vzctl 3.3.
I tried it and was able to migrate a CentOS 7 container... but the Fedora 22 one seems to be stuck in the "started" phase. It creates a /vz/private/{ctid} dir on the destination host (with the same…
The fall semester is just around the corner... so it is impossible for me to break away for a trip to Seattle. I hope one or more of you guys can blog so I can attend vicariously.
Comments
Do you still stand by your opinions above now in 2016?…