Effective live migration with ploop write tracker
Let's start from the basics and see how migration (i.e. moving a container from one OpenVZ server to another) is implemented. It's vzmigrate, a shell script which does the following (simplified for clarity):
1. Checks that a destination server is available via ssh w/o entering a password, that there is no container with the same ID on it, and so on.
2. Runs rsync of /vz/private/$CTID to the destination server
3. Stops the container
4. Runs a second rsync of /vz/private/$CTID to the destination
5. Starts the container on the destination
6. Removes it locally
Obviously, two rsync runs are needed, so the first one moves most of the data, while the container is still up and running, and the second one moves the changes made during the time period between the first rsync run and the container stop.
Now, if we need live migration (option --online to vzmigrate), then instead of CT stop we do vzctl checkpoint, and instead of start we do vzctl restore. As a result, a container will move to another system without your users noticing (process are not stopping, just freezing for a few seconds, TCP connections are migrating, IP addresses are not changed etc. -- no cheating, just a little bit of magic).
So this is the way it was working for years, making users happy and singing in the rain. One fine day, though, ploop was introduced and it was soon discovered that live migration is not working for ploop-based containers. I found a few reasons why (for example, one can't use rsync --sparse for copying ploop images, because in-kernel ploop driver can't work with files having holes). But the main thing I found is the proper way of migrating a ploop image: not with rsync, but with ploop copy.
ploop copy is a mechanism of effective copy of a ploop image with the help of build-in ploop kernel driver feature called write tracker. One ploop copy process is reading blocks of data from ploop image and sends them to stdout (prepending each block with a short header consisting of a magic label, a block position and its size). The other ploop copy process receives this data from stdin and writes them down to disk. If you connect these two processes via a pipe, and add ssh $DEST in between, you are all set.
You can say, cat utility can do almost the same thing. Right. The difference is, before starting to read and send data, ploop copy asks the kernel to turn on the write tracker, and the kernel starts to memorize a list of data blocks that are modified (written into). Then, after all the blocks are sent, ploop copy is politely expressing the interest in this list, and send the blocks from the list again, while the kernel is creating another list. The process repeats a few times, and the list becomes shorter and shorter. After a few iterations (either the list is empty, or it is not getting shorter, or we just decide that we already did enough iterations) ploop copy executes an external command, which should stop any disk activity for this ploop device. This command is either vzctl stop for offline migration or vzctl checkpoint for live migration; obviously, stopped or frozen container will not write anything to disk. After that, ploop copy asks the kernel for the list of modified blocks again, transfers the blocks listed, and finally asks the kernel for this list again. If this time the list is not empty, that means something is very wrong, meaning that stopping command haven't done what it should and we fail. Otherwise all is good and ploop copy sends a marker telling the transfer is over. So this is how the sending
The receiving ploop copy process is trivial -- it just reads the blocks from stdin and writes those to file (to a specified position). If you want to see the code of both sending and receiving sides, look no further.
All right, in the previous migration description ploop copy is used in place of second rsync run (steps 3 and 4). I'd like to note that this way it's more effective, because rsync is trying to figure out which files have changed and where, while ploop copy just asks about it from the kernel. Also, because the "ask and send" process is iterative, container will be stopped or frozen as late as it can be and, even if container is writing data to disk actively, the period for which it is stopped is minimal.
Just out of pure curiosity I performed a quick non-scientific test, having "od -x /dev/urandom > file" running inside a container and live migrating it back and forth. ploop copy time after the freeze was a bit over 1 second, and the total frozen time a bit less than 3 seconds. Similar numbers can be obtained from the traditional simfs+rsync migration, but only if a container is not doing any significant I/O. Then I tried to migrate similar container on simfs running the same command running inside, and the frozen time increased to 13-16 seconds. I don't want to say these measurements are to be trusted, I just ran it without any precautions, with OpenVZ instances running inside Parallels VMs, with the physical server busy with something else...
Oh, the last thing. All this functionality is already included into latest tools releases: ploop 1.3 and vzctl 3.3.
Update: comments disabled due to spam