It's been two months since we have released vzctl 4.7 and ploop 1.11 with
some vital improvements
to live migration that made it about 25% faster. Believe it or not,
this is another "make it faster" post, making it, well, faster.
That's right, we have more surprises in store for you! Read on and be delighted.
Asynchronous ploop send
This is something that should have been implemented at the very beginning,
but the initial implementation looked like this (in C):
/* _XXX_ We should use AIO. ploopcopy cannot use cached reads and
* has to use O_DIRECT, which introduces large read latencies.
* AIO is necessary to transfer with maximal speed.
*/
Why there are no cached reads (and, in general, why ploop library is
using O_DIRECT, and the kernel also uses direct I/O, bypassing the cache)?
Note that ploop images are files that are used as block
devices containing a filesystem. That ploop block device and filesystem
access is going through cache as usual, so if we'd do the same for ploop
image itself, it would result in double caching and waste of RAM. Therefore,
we do direct I/O on lower level (when working with ploop images), and allow
usual Linux cache to be used on the upper level (when container is accessing
files inside the ploop, which is the most common operation).
So, ploop image itself is not (usually) cached in memory, and other tricks
like read-ahead are also not used by the kernel, and reading each block from
the image takes time as it is read directly from disk. In our test lab
(OpenVZ running inside KVM with a networked disk) it takes about 10
milliseconds to read each 1 MB block. Then, sending each block to the
destination system over network also takes time, it's about 15 milliseconds
in our setup. So, it takes 25 milliseconds to read and send a block of data,
if we do it one after another. Oh wait, this is exactly how we do it!
Solution -- let's do reading and sending at the same time! In other words,
while sending the first block of data we can already read the second one,
and so on, bringing the time required to transfer one block down to
MAX(Tread, Tsend) instead of Tread + Tsend.
Implementation uses POSIX threads. Main ploop send process spawns a separate
sending thread and two buffers -- one for reading and one for sending, then
they change place. This works surprisingly well for something that uses
threads, maybe because it is as simple as it could ever be (one thread, one
mutex, one conditional). If you happen to know something about pthreads
programming, please review the appropriate
commit 55e26e9606.
The new async send is used by default, you just need to have a newer ploop
(1.12, not yet released). Clone and compile ploop from git if you are curious.
As for how much time we save using this, it happen to be 15 microseconds
instead of 25 per block of data, or 40% faster! Now, the real savings depend
on the number of blocks needs to be migrated after container is frozen,
it can be 5 or 100, so overall savings can be from 0.05 to 1 second.
ploop copy with feedback
The previous post on live migration improvements
described an optimization of doing fdatasync() on the receiving
ploop side before suspending the container on the sending side. It also noted
that implementation is sub-par:
The other problem is that sending side should wait for fsync to finish,
in order to proceed with CT suspend. Unfortunately, there is no way to solve
this one with a one-way pipe, so the sending side just waits for a few
seconds. Ugly as it is, this is the best way possible (let us know if
you can think of something better).
So, the problem is trivial -- there's a need for a bi-directional
channel between ploop copy sender and receiver, so the receiver can say
"All right, I have synced the freaking delta back to sender". In addition,
we want to do it in a simple way, not much more complicated as it is now.
After some playing around with different approaches, it seemed
that OpenSSH
port forwarding, combined with tools like netcat, socat, or bash
/dev/tcp feature, can do the trick of establishing a two-way pipe
between ploop copy sides.
The netcat (or nc) is available in various varieties, which might or might
not be available and/or compatible. socat is a bit better, but one problem
with it is it ignores its child exit code, so there is no (simple) way
to figure out an error. A problem with /dev/tcp feature is it is
bash-specific, but bash itself might not be universally available
(Debian and Ubunty users are well aware of that fact).
So, all this was replaced with a tiny home-grown vznnc ("nano-netcat) tool.
It is indeed nano, as there is only
only 200
lines of code in it. It can either listen or connect to a specified
port at localhost (note that ssh is doing real networking for us), and
run a program with its stdin/stdout (or a specified file descriptor)
already connected to that TCP socket. Again, similar to netcat, but
small, to the point, and hopefully bug-free.
Finally, with this in place, we can make sending side of ploop copy to
wait for feedback from the receiving side, so it can suspend the container
as soon as remote side finished syncing. This makes the whole migration
a bit faster (by eliminating the previously hard-coded 5 seconds
wait-for-sync delay), but it also helps to slash the frozen time
as we suspend the container as soon as we should, so it won't be given
any extra time to write some more data we'll be needing to copy while
it's frozen.
It's hard to measure the practical impact of the feature, but in our
tests it saves about 3.5 seconds of total migration time, and from 0
to a few seconds of frozen time, depending on container's disk I/O
activity.
For the feature to work, you need the latest versions of vzctl (4.8)
and ploop (1.12) on both nodes (if one node doesn't have newer tools,
vzmigrate falls back to old non-feedback way of copying). Note that
those new version are not yet released at the time of writing this,
but you can get ones from git and compile yourself.
Using ssh connection multiplexing
It takes about 0.15 seconds to establish an ssh connection in our test lab.
Your mileage may wary, but it can't be zero. Well, unless an OpenSSH feature
of reusing an existing ssh connection is put to work! It's call connection
miltiplexing, and when used, once a connection is established, you can have
subsequent ones in practically no time. As vzmigrate is a shell script and
runs a number of commands at remote (destination) side, using it might save
us some time.
Unfortunately, it's a relatively new OpenSSH feature and is implemented
quite ugly in a version available in CentOS 6 -- you need to keep one open
ssh session running. They fixed it in a later version by adding a special
daemon-like mode enabled with ControlPersist option, but alas not
in CentOS 6. Therefore, we have to maintain a special "master" ssh
connection for the duration of vzmigrate. For implementation details,
see commits
06212ea3d
and
00b9ce043.
This is still experimental, so you need to specify --ssh-mux
flag to vzmigrate
to use it. You won't believe it (so, go test it yourself!), but this alone
slashes container frozen time by about 25% (which is great, as we fight for
every microsecond here), and improves the total time taken by vzmigrate
by up to 1 second (which is probably not that important but still nice).
What's next?
Current setup used for development and testing is two OpenVZ instances
running in KVM guests on a
ThinkCentre
box. While it's convenient and oh so very space-saving, is is probably
not the best approximation of a real world OpenVZ usage. So, we need
some better hardware:
If you are able to spend $500 to $2000 on ebay, please
let us know by email to donate at openvz dot org so we can arrange it.
Now, quite a few people offered hosted servers with a similar configuration.
While we are very thankful to all such offers, this time we are looking
for physical, not hosted, hardware.
Update: Thanks to FastVPS, we got the Supermicro 4 node server. If you want to donate, see wiki: Donations.
It has been almost two years since we wrote about effective live migration with ploop write tracker. It's time to write some more about it, since we have managed to make ploop live migration yet more effective, by means of pretty simple optimizations. But let's not jump to resolution yet, it's a long and interesting story to tell.
As you know, live migration is not quite live, although it looks that way to a user. There is a short time period, usually a few seconds, during which the container being migrated is frozen. This time (shown if -t or -v option to vzmigrate --live is used) is what needs to be optimized, making it as short as possible. In order to do that, one needs to dig into details on what's happening when a container is frozen.
Typical timings obtained via vzmigrate -t --live look like this. We ran a few iterations migrating a container back and forth between two OpenVZ instances (running inside Parallels VMs on the same physical machine), so there are a few columns at the right side.
Apparently, the first suspect to look at is that "undump + resume". Basically, it shows timing of vzctl restore command. Why it is so slow? Apparently, ploop mount takes some noticeable time. Let's dig deeper into that process.
First, implement timestamps in ploop messages, raise the log level and see what is going on here. Apparently, adding deltas is not instant, takes any time from 0.1 second to almost a second. After some more experiments and thinking it becomes obvious that since ploop kernel driver works with data in delta files directly, bypassing the page cache, it needs to force those files to be written to the disk, and this costly operation happens while container is frozen. Is it possible to do it earlier? Sure, we just need to force write the deltas we just copied before suspending a container. Easy, just call fsync(), or yet better fdatasync(), since we don't really care about metadata being written.
Unfortunately, there is no command line tool to do fsync or fdatasync, so we had to write one and call it from vzmigrate. Is it any better now? Yes indeed, delta adding times went down to from tenths to hundredths of a second.
Except for the top delta, of course, which we migrate using ploop copy. Surely, we can't fsync it before suspending container, because we keep copying it after. Oh wait... actually we can! By adding an fsync before CT suspend, we force the data be written on disk, so the second fsync (which happens after everything is copied) will take less time. This time is shown as "Pcopy after suspend".
The problem is that ploop copy consists of two sides -- the sending one and the receiving one -- which communicate over a pipe (with ssh as a transport). It's the sending side which runs the command to freeze the container, and it's the receiving side which should do fsync, so we need to pass some sort of "do the fsync" command. Yet better, do it without breaking the existing protocol, so nothing bad will happen in case there is an older version of ploop on the receiving side.
The "do the fsync" command ended up being a data block of 4 bytes, you can see the patch here. Older version will write these 4 bytes to disk, which is unnecessary but OK do to, and newer version will recognize it as a need to do fsync.
The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).
To summarize, what we have added is a couple of fsyncs (it's actually fdatasync() since it is faster), and here are the results:
As you see, both "pcopy after suspend" and "undump + resume" times decreased, shaving off about a second of time, which gives us about 25% improvement. Now, take into account that tests were done on an otherwise idle nodes with mostly idle containers, we suspect that the benefit will be more apparent with I/O load. Let checking if this statement is true will be your homework for today!
But let's start with yet another ploop feature -- it's expandable format. When you create a ploop container with say 10G of disk space, ploop image is just slightly larger than the size of actual container files. I just created centos-6-x86 container -- ploop image size is 747M, and inside CT df shows that 737M is used. Of course, for empty ploop image (with a fresh filesystem and zero files) the ratio will be worse. Now, when CT is writing data, ploop image is auto-growing up to accomodate the data size.
Now, these images can be layered, or stacked. Imagine having a single ploop image, consisting of blocks. We can add another image on top of the first one, so that new reads will fall through to the lower image (because the upper one is empty yet), while new writes will end up being written to the upper (top) image. Perhaps this image will save some more words here:
The new (top) image is now accumulating all the changes, while the old (bottom) one is in fact the read-only snapshot of the container filesystem. Such a snapshot is cheap and instant, because there is no need to copy a lot of data or do other costly operations. Of course, ploop is not limited to only two levels -- you can create much more (up to 255 if I remember correctly, which is way above any practical limit).
What can be done with such a snapshot? We can mount it and copy all the data to a backup (update: see openvz.org/Ploop/backup). Note that such backup is very fast, online and consistent. There's more to it though. A ploop snapshot, combined with a snapshot of a running container in memory (also known as a checkpoint) and a container configuration file(s), can serve as a real checkpoint to which you can roll back.
Consider the following scenario: you need to upgrade your web site backend inside your container. First, you do a container snapshot (I mean complete snapshot, including an in-memory image of a running container). Then you upgrade, and realize your web site is all messed up and broken. Horror story, is it? No. You just switch to before-upgrade snapshot and keep working as it. It's like moving back in time, and all this is done on a running container, i.e. you don't have to shut it down.
Finally, when you don't need a snapshot anymore, you can merge it back. Merging process is when changes from an upper level are written to a lower level (i.e. the one under it), then the upper level is removed. Such merging is of course not as instant as creating a snapshot, but it is online, so you can just keep working while ploop is working with merge.
We have updated vzctl, ploop and vzquota recently (I wrote about vzctl here). Some changes in packaging are tricky, so let me explain why and give some hints.
For RHEL5-based kernel users (i.e. ovzkernel-2.6.18-028stabXXX) and earlier kernels
Since ploop is only supported in RHEL6 kernel, we have removed ploop dependency from vzctl-4.0 (ploop library is loaded dynamically when needed and if available). Since you have earlier vzctl version installed, you also have ploop installed. Now you can remove it, at the same time upgrading to vzctl-4.0. That "at the same time" part is done via yum shell:
That should fix it. In the meantime, think about upgrading your systems to RHEL6-based kernel which is better in terms of performance, features, and speed of development.
For RHEL6-based users (i.e. vzkernel-2.6.32-042stabXXX)
New ploop library (1.5) requires very recent RHEL6-based kernel kernel (version 2.6.32-042stab061.1 or later) and is not supposed to work with earlier kernels. To protect ploop from earlier kernels, its packaging says "Conflicts: vzkernel < 2.6.32-042stab061.1" which usually prevents ploop 1.5 installation on systems having those kernels.
To fix this conflict, make sure you run the latest kernel, and then remove the old ones:
During my last holiday on the of hospitable Turkey sunny seaside at night, instead of quenching my thirst or taking rest after a long and tedious day at a beach, I was sitting in a hotel lobby, where they have free Wi-Fi, trying to make live migration of a container on a ploop device working. I succeeded, with about 20 commits to ploop and another 15 to vzctl, so now I'd like to share my findings and tell the story about it.
Let's start from the basics and see how migration (i.e. moving a container from one OpenVZ server to another) is implemented. It's vzmigrate, a shell script which does the following (simplified for clarity):
1. Checks that a destination server is available via ssh w/o entering a password, that there is no container with the same ID on it, and so on. 2. Runs rsync of /vz/private/$CTID to the destination server 3. Stops the container 4. Runs a second rsync of /vz/private/$CTID to the destination 5. Starts the container on the destination 6. Removes it locally
Obviously, two rsync runs are needed, so the first one moves most of the data, while the container is still up and running, and the second one moves the changes made during the time period between the first rsync run and the container stop.
Now, if we need live migration (option --online to vzmigrate), then instead of CT stop we do vzctl checkpoint, and instead of start we do vzctl restore. As a result, a container will move to another system without your users noticing (process are not stopping, just freezing for a few seconds, TCP connections are migrating, IP addresses are not changed etc. -- no cheating, just a little bit of magic).
So this is the way it was working for years, making users happy and singing in the rain. One fine day, though, ploop was introduced and it was soon discovered that live migration is not working for ploop-based containers. I found a few reasons why (for example, one can't use rsync --sparse for copying ploop images, because in-kernel ploop driver can't work with files having holes). But the main thing I found is the proper way of migrating a ploop image: not with rsync, but with ploop copy.
ploop copy is a mechanism of effective copy of a ploop image with the help of build-in ploop kernel driver feature called write tracker. One ploop copy process is reading blocks of data from ploop image and sends them to stdout (prepending each block with a short header consisting of a magic label, a block position and its size). The other ploop copy process receives this data from stdin and writes them down to disk. If you connect these two processes via a pipe, and add ssh $DEST in between, you are all set.
You can say, cat utility can do almost the same thing. Right. The difference is, before starting to read and send data, ploop copy asks the kernel to turn on the write tracker, and the kernel starts to memorize a list of data blocks that are modified (written into). Then, after all the blocks are sent, ploop copy is politely expressing the interest in this list, and send the blocks from the list again, while the kernel is creating another list. The process repeats a few times, and the list becomes shorter and shorter. After a few iterations (either the list is empty, or it is not getting shorter, or we just decide that we already did enough iterations) ploop copy executes an external command, which should stop any disk activity for this ploop device. This command is either vzctl stop for offline migration or vzctl checkpoint for live migration; obviously, stopped or frozen container will not write anything to disk. After that, ploop copy asks the kernel for the list of modified blocks again, transfers the blocks listed, and finally asks the kernel for this list again. If this time the list is not empty, that means something is very wrong, meaning that stopping command haven't done what it should and we fail. Otherwise all is good and ploop copy sends a marker telling the transfer is over. So this is how the sending process works.
The receiving ploop copy process is trivial -- it just reads the blocks from stdin and writes those to file (to a specified position). If you want to see the code of both sending and receiving sides, look no further.
All right, in the previous migration description ploop copy is used in place of second rsync run (steps 3 and 4). I'd like to note that this way it's more effective, because rsync is trying to figure out which files have changed and where, while ploop copy just asks about it from the kernel. Also, because the "ask and send" process is iterative, container will be stopped or frozen as late as it can be and, even if container is writing data to disk actively, the period for which it is stopped is minimal.
Just out of pure curiosity I performed a quick non-scientific test, having "od -x /dev/urandom > file" running inside a container and live migrating it back and forth. ploop copy time after the freeze was a bit over 1 second, and the total frozen time a bit less than 3 seconds. Similar numbers can be obtained from the traditional simfs+rsync migration, but only if a container is not doing any significant I/O. Then I tried to migrate similar container on simfs running the same command running inside, and the frozen time increased to 13-16 seconds. I don't want to say these measurements are to be trusted, I just ran it without any precautions, with OpenVZ instances running inside Parallels VMs, with the physical server busy with something else...
Oh, the last thing. All this functionality is already included into latest tools releases: ploop 1.3 and vzctl 3.3.
OpenVZ have just introduced kernel and tools support for container-in-a-file technology, also known as ploop. This post tries to summarize why ploop is needed, and why is it a superior technology to what we had before.
Before ploop: simfs and vzquota
First of all, a few facts about the pre-ploop era technologies and their limitations.
As you are probably aware, a container file system was just a directory on the host, which a new container was chroot()-ed into. Although it seems like a good and natural idea, there are a number of limitations.
Since containers are living on one same file system, they all share common properties of that file system (it's type, block size, and other options). That means we can not configure the above properties on a per-container basis.
One such property that deserves a special item in this list is file system journal. While journal is a good thing to have, because it helps to maintain file system integrity and improve reboot times (by eliminating fsck in many cases), it is also a bottleneck for containers. If one container will fill up in-memory journal (with lots of small operations leading to file metadata updates, e.g. file truncates), all the other containers I/O will block waiting for the journal to be written to disk. In some extreme cases we saw up to 15 seconds of such blockage.
Since many containers share the same file system with limited space, in order to limit containers disk space we had to develop per-directory disk quotas (i.e. vzquota).
Again, since many containers share the same file system, and the number of inodes on a file system is limited [for most file systems], vzquota should also be able to limit inodes on a per container (per directory) basis.
In order for in-container (aka second-level) disk quota (i.e. standard per-user and per-group UNIX dist quota) to work, we had to provide a dummy file system called simfs. Its sole purpose is to have a superblock which is needed for disk quota to work.
When doing a live migration without some sort of shared storage (like NAS or SAN), we sync the files to a destination system using rsync, which does the exact copy of all files, except that their i-node numbers on disk will change. If there are some apps that rely on files' i-node numbers being constant (which is normally the case), those apps are not surviving the migration.
Finally, a container backup or snapshot is harder to do because there is a lot of small files that need to be copied.
Introducing ploop
In order to address the above problems and ultimately make a world a better place, we decided to implement a container-in-a-file technology, not different from what various VM products are using, but working as effectively as all the other container bits and pieces in OpenVZ.
The main idea of ploop is to have an image file, use it as a block device, and create and use a file system on that device. Some readers will recognize that this is exactly what Linux loop device does! Right, the only thing is loop device is very inefficient (say, using it leads to double caching of data in memory) and its functionality is very limited.
Ploop implementation in the kernel have a modular and layered design.
The top layer is the main ploop module, which provides a virtual block device to be used for CT file system.
The middle layer is the image format module, which does translation of block device block numbers into image file block numbers. A simple format module which is called "raw" is doing trivial 1:1 translation, same as existing loop device.
More sophisticated format module is keeping the translation table and is able to dynamically grow and shrink the image file. That means, if you create a container with 2GB of disk space, the image file size will not be 2GB, but less -- the size of the actual data stored in the container. It is also possible to support other image formats by writing other ploop format modules, such as the one for QCOW2 (used by QEMU and KVM).
The bottom layer is the I/O module. Currently modules for direct I/O on an ext4 device, and for NFS are available. There are plans to also have a generic VFS module, which will be able to store images on any decent file system, but that needs an efficient direct I/O implementation in the VFS layer which is still being worked on.
Ploop benefits
In a nutshell:
File system journal is not bottleneck anymore
Large-size image files I/O instead of lots of small-size files I/O on management operations
Disk space quota can be implemented based on virtual device sizes; no need for per-directory quotas
Number of inodes doesn't have to be limited because this is not a shared resource anymore (each CT has its own file system)
Live backup is easy and consistent
Live migration is reliable and efficient
Different containers may use file systems of different types and properties
In addition:
Efficient container creation
[Potential] support for QCOW2 and other image formats
I tried it and was able to migrate a CentOS 7 container... but the Fedora 22 one seems to be stuck in the "started" phase. It creates a /vz/private/{ctid} dir on the destination host (with the same…
The fall semester is just around the corner... so it is impossible for me to break away for a trip to Seattle. I hope one or more of you guys can blog so I can attend vicariously.
Comments
Do you still stand by your opinions above now in 2016?…