Top.Mail.Ru
? ?

Entries by tag: ploop

Yet more live migration goodness

It's been two months since we have released vzctl 4.7 and ploop 1.11 with some vital improvements to live migration that made it about 25% faster. Believe it or not, this is another "make it faster" post, making it, well, faster. That's right, we have more surprises in store for you! Read on and be delighted.

Asynchronous ploop send

This is something that should have been implemented at the very beginning, but the initial implementation looked like this (in C):

/* _XXX_ We should use AIO. ploopcopy cannot use cached reads and
 * has to use O_DIRECT, which introduces large read latencies.
 * AIO is necessary to transfer with maximal speed.
 */

Why there are no cached reads (and, in general, why ploop library is using O_DIRECT, and the kernel also uses direct I/O, bypassing the cache)? Note that ploop images are files that are used as block devices containing a filesystem. That ploop block device and filesystem access is going through cache as usual, so if we'd do the same for ploop image itself, it would result in double caching and waste of RAM. Therefore, we do direct I/O on lower level (when working with ploop images), and allow usual Linux cache to be used on the upper level (when container is accessing files inside the ploop, which is the most common operation).

So, ploop image itself is not (usually) cached in memory, and other tricks like read-ahead are also not used by the kernel, and reading each block from the image takes time as it is read directly from disk. In our test lab (OpenVZ running inside KVM with a networked disk) it takes about 10 milliseconds to read each 1 MB block. Then, sending each block to the destination system over network also takes time, it's about 15 milliseconds in our setup. So, it takes 25 milliseconds to read and send a block of data, if we do it one after another. Oh wait, this is exactly how we do it!

Solution -- let's do reading and sending at the same time! In other words, while sending the first block of data we can already read the second one, and so on, bringing the time required to transfer one block down to MAX(Tread, Tsend) instead of Tread + Tsend.

Implementation uses POSIX threads. Main ploop send process spawns a separate sending thread and two buffers -- one for reading and one for sending, then they change place. This works surprisingly well for something that uses threads, maybe because it is as simple as it could ever be (one thread, one mutex, one conditional). If you happen to know something about pthreads programming, please review the appropriate commit 55e26e9606.

The new async send is used by default, you just need to have a newer ploop (1.12, not yet released). Clone and compile ploop from git if you are curious.

As for how much time we save using this, it happen to be 15 microseconds instead of 25 per block of data, or 40% faster! Now, the real savings depend on the number of blocks needs to be migrated after container is frozen, it can be 5 or 100, so overall savings can be from 0.05 to 1 second.

ploop copy with feedback

The previous post on live migration improvements described an optimization of doing fdatasync() on the receiving ploop side before suspending the container on the sending side. It also noted that implementation is sub-par:

The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).

So, the problem is trivial -- there's a need for a bi-directional channel between ploop copy sender and receiver, so the receiver can say "All right, I have synced the freaking delta back to sender". In addition, we want to do it in a simple way, not much more complicated as it is now.

After some playing around with different approaches, it seemed that OpenSSH port forwarding, combined with tools like netcat, socat, or bash /dev/tcp feature, can do the trick of establishing a two-way pipe between ploop copy sides.

The netcat (or nc) is available in various varieties, which might or might not be available and/or compatible. socat is a bit better, but one problem with it is it ignores its child exit code, so there is no (simple) way to figure out an error. A problem with /dev/tcp feature is it is bash-specific, but bash itself might not be universally available (Debian and Ubunty users are well aware of that fact).

So, all this was replaced with a tiny home-grown vznnc ("nano-netcat) tool. It is indeed nano, as there is only only 200 lines of code in it. It can either listen or connect to a specified port at localhost (note that ssh is doing real networking for us), and run a program with its stdin/stdout (or a specified file descriptor) already connected to that TCP socket. Again, similar to netcat, but small, to the point, and hopefully bug-free.

Finally, with this in place, we can make sending side of ploop copy to wait for feedback from the receiving side, so it can suspend the container as soon as remote side finished syncing. This makes the whole migration a bit faster (by eliminating the previously hard-coded 5 seconds wait-for-sync delay), but it also helps to slash the frozen time as we suspend the container as soon as we should, so it won't be given any extra time to write some more data we'll be needing to copy while it's frozen.

It's hard to measure the practical impact of the feature, but in our tests it saves about 3.5 seconds of total migration time, and from 0 to a few seconds of frozen time, depending on container's disk I/O activity.

For the feature to work, you need the latest versions of vzctl (4.8) and ploop (1.12) on both nodes (if one node doesn't have newer tools, vzmigrate falls back to old non-feedback way of copying). Note that those new version are not yet released at the time of writing this, but you can get ones from git and compile yourself.

Using ssh connection multiplexing

It takes about 0.15 seconds to establish an ssh connection in our test lab. Your mileage may wary, but it can't be zero. Well, unless an OpenSSH feature of reusing an existing ssh connection is put to work! It's call connection miltiplexing, and when used, once a connection is established, you can have subsequent ones in practically no time. As vzmigrate is a shell script and runs a number of commands at remote (destination) side, using it might save us some time.

Unfortunately, it's a relatively new OpenSSH feature and is implemented quite ugly in a version available in CentOS 6 -- you need to keep one open ssh session running. They fixed it in a later version by adding a special daemon-like mode enabled with ControlPersist option, but alas not in CentOS 6. Therefore, we have to maintain a special "master" ssh connection for the duration of vzmigrate. For implementation details, see commits 06212ea3d and 00b9ce043.

This is still experimental, so you need to specify --ssh-mux flag to vzmigrate to use it. You won't believe it (so, go test it yourself!), but this alone slashes container frozen time by about 25% (which is great, as we fight for every microsecond here), and improves the total time taken by vzmigrate by up to 1 second (which is probably not that important but still nice).

What's next?

Current setup used for development and testing is two OpenVZ instances running in KVM guests on a ThinkCentre box. While it's convenient and oh so very space-saving, is is probably not the best approximation of a real world OpenVZ usage. So, we need some better hardware:

If you are able to spend $500 to $2000 on ebay, please let us know by email to donate at openvz dot org so we can arrange it. Now, quite a few people offered hosted servers with a similar configuration. While we are very thankful to all such offers, this time we are looking for physical, not hosted, hardware.

Update: Thanks to FastVPS, we got the Supermicro 4 node server. If you want to donate, see wiki: Donations.

ploop and live migration: 2 years later

It has been almost two years since we wrote about effective live migration with ploop write tracker. It's time to write some more about it, since we have managed to make ploop live migration yet more effective, by means of pretty simple optimizations. But let's not jump to resolution yet, it's a long and interesting story to tell.

As you know, live migration is not quite live, although it looks that way to a user. There is a short time period, usually a few seconds, during which the container being migrated is frozen. This time (shown if -t or -v option to vzmigrate --live is used) is what needs to be optimized, making it as short as possible. In order to do that, one needs to dig into details on what's happening when a container is frozen.

Typical timings obtained via vzmigrate -t --live look like this. We ran a few iterations migrating a container back and forth between two OpenVZ instances (running inside Parallels VMs on the same physical machine), so there are a few columns at the right side.

(Software: vzctl 4.6.1, ploop-1.10)

              Iteration     1     2     3     4     5     6    AVG

        Suspend + Dump:   0.58  0.71  0.64  0.64  0.91  0.74  0.703
   Pcopy after suspend:   1.06  0.71  0.50  0.68  1.07  1.29  0.885
        Copy dump file:   0.64  0.67  0.44  0.51  0.43  0.50  0.532
       Undump + Resume:   2.04  2.06  1.94  3.16  2.26  2.32  2.297
                        ------  ----  ----  ----  ----  ----  -----
  Total suspended time:   4.33  4.16  3.54  5.01  4.68  4.85  4.428

Apparently, the first suspect to look at is that "undump + resume". Basically, it shows timing of vzctl restore command. Why it is so slow? Apparently, ploop mount takes some noticeable time. Let's dig deeper into that process.

First, implement timestamps in ploop messages, raise the log level and see what is going on here. Apparently, adding deltas is not instant, takes any time from 0.1 second to almost a second. After some more experiments and thinking it becomes obvious that since ploop kernel driver works with data in delta files directly, bypassing the page cache, it needs to force those files to be written to the disk, and this costly operation happens while container is frozen. Is it possible to do it earlier? Sure, we just need to force write the deltas we just copied before suspending a container. Easy, just call fsync(), or yet better fdatasync(), since we don't really care about metadata being written.

Unfortunately, there is no command line tool to do fsync or fdatasync, so we had to write one and call it from vzmigrate. Is it any better now? Yes indeed, delta adding times went down to from tenths to hundredths of a second.

Except for the top delta, of course, which we migrate using ploop copy. Surely, we can't fsync it before suspending container, because we keep copying it after. Oh wait... actually we can! By adding an fsync before CT suspend, we force the data be written on disk, so the second fsync (which happens after everything is copied) will take less time. This time is shown as "Pcopy after suspend".

The problem is that ploop copy consists of two sides -- the sending one and the receiving one -- which communicate over a pipe (with ssh as a transport). It's the sending side which runs the command to freeze the container, and it's the receiving side which should do fsync, so we need to pass some sort of "do the fsync" command. Yet better, do it without breaking the existing protocol, so nothing bad will happen in case there is an older version of ploop on the receiving side.

The "do the fsync" command ended up being a data block of 4 bytes, you can see the patch here. Older version will write these 4 bytes to disk, which is unnecessary but OK do to, and newer version will recognize it as a need to do fsync.

The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).

To summarize, what we have added is a couple of fsyncs (it's actually fdatasync() since it is faster), and here are the results:

(Software: vzctl 4.7, ploop-1.11)

              Iteration     1     2     3     4     5     6    AVG

        Suspend + Dump:   0.60  0.60  0.57  0.74  0.59  0.80  0.650
   Pcopy after suspend:   0.41  0.45  0.45  0.49  0.40  0.42  0.437 (-0.4)
        Copy dump file:   0.46  0.44  0.43  0.48  0.47  0.52  0.467
       Undump + Resume:   1.86  1.75  1.67  1.91  1.87  1.84  1.817 (-0.5)
                        ------  ----  ----  ----  ----  ----  -----
  Total suspended time:   3.33  3.24  3.12  3.63  3.35  3.59  3.377 (-1.1)

As you see, both "pcopy after suspend" and "undump + resume" times decreased, shaving off about a second of time, which gives us about 25% improvement. Now, take into account that tests were done on an otherwise idle nodes with mostly idle containers, we suspect that the benefit will be more apparent with I/O load. Let checking if this statement is true will be your homework for today!

ploop snapshots and backups

OpenVZ ploop is a wonderful technology, and I want to share more of its wonderfulness with you. We have previously covered ploop in general and it's write tracker feature to help speed up container migration in particular. This time, I'd like to talk about snapshots and backups.

But let's start with yet another ploop feature -- it's expandable format. When you create a ploop container with say 10G of disk space, ploop image is just slightly larger than the size of actual container files. I just created centos-6-x86 container -- ploop image size is 747M, and inside CT df shows that 737M is used. Of course, for empty ploop image (with a fresh filesystem and zero files) the ratio will be worse. Now, when CT is writing data, ploop image is auto-growing up to accomodate the data size.

Now, these images can be layered, or stacked. Imagine having a single ploop image, consisting of blocks. We can add another image on top of the first one, so that new reads will fall through to the lower image (because the upper one is empty yet), while new writes will end up being written to the upper (top) image. Perhaps this image will save some more words here:

ploop-stacked-images

The new (top) image is now accumulating all the changes, while the old (bottom) one is in fact the read-only snapshot of the container filesystem. Such a snapshot is cheap and instant, because there is no need to copy a lot of data or do other costly operations. Of course, ploop is not limited to only two levels -- you can create much more (up to 255 if I remember correctly, which is way above any practical limit).

What can be done with such a snapshot? We can mount it and copy all the data to a backup (update: see openvz.org/Ploop/backup). Note that such backup is very fast, online and consistent. There's more to it though. A ploop snapshot, combined with a snapshot of a running container in memory (also known as a checkpoint) and a container configuration file(s), can serve as a real checkpoint to which you can roll back.

Consider the following scenario: you need to upgrade your web site backend inside your container. First, you do a container snapshot (I mean complete snapshot, including an in-memory image of a running container). Then you upgrade, and realize your web site is all messed up and broken. Horror story, is it? No. You just switch to before-upgrade snapshot and keep working as it. It's like moving back in time, and all this is done on a running container, i.e. you don't have to shut it down.

Finally, when you don't need a snapshot anymore, you can merge it back. Merging process is when changes from an upper level are written to a lower level (i.e. the one under it), then the upper level is removed. Such merging is of course not as instant as creating a snapshot, but it is online, so you can just keep working while ploop is working with merge.

All this can be performed from the command line using vzctl. For details, see vzctl(8) man page, section Snapshotting. Here's a quick howto:

Create a snapshot:
vzctl snapshot $CTID [--id $UUID] [--name name] [--description desc] [--skip-suspend] [--skip-config]

Mount a snapshot (say to copy the data to a backup):
vzctl snapshot-mount CTID --id uuid --target directory

Rollback to a snapshot:
vzctl snapshot-switch CTID --id uuid

Delete a snapshot (merging its data to a lower level image):
vzctl snapshot-delete CTID --id uuid

vzctl / ploop update

We have updated vzctl, ploop and vzquota recently (I wrote about vzctl here). Some changes in packaging are tricky, so let me explain why and give some hints.

For RHEL5-based kernel users (i.e. ovzkernel-2.6.18-028stabXXX) and earlier kernels

Since ploop is only supported in RHEL6 kernel, we have removed ploop dependency from vzctl-4.0 (ploop library is loaded dynamically when needed and if available). Since you have earlier vzctl version installed, you also have ploop installed. Now you can remove it, at the same time upgrading to vzctl-4.0. That "at the same time" part is done via yum shell:

# yum shell
> remove ploop ploop-lib
> update vzctl
> run

That should fix it. In the meantime, think about upgrading your systems to RHEL6-based kernel which is better in terms of performance, features, and speed of development.

For RHEL6-based users (i.e. vzkernel-2.6.32-042stabXXX)

New ploop library (1.5) requires very recent RHEL6-based kernel kernel (version 2.6.32-042stab061.1 or later) and is not supposed to work with earlier kernels. To protect ploop from earlier kernels, its packaging says "Conflicts: vzkernel < 2.6.32-042stab061.1" which usually prevents ploop 1.5 installation on systems having those kernels.

To fix this conflict, make sure you run the latest kernel, and then remove the old ones:

# yum remove "vzkernel < 2.6.32-042stab061.1"

Then you can run update as usual:

# yum update

Hope that helps

Update: comments disabled due to spam.

Tags:

During my last holiday on the of hospitable Turkey sunny seaside at night, instead of quenching my thirst or taking rest after a long and tedious day at a beach, I was sitting in a hotel lobby, where they have free Wi-Fi, trying to make live migration of a container on a ploop device working. I succeeded, with about 20 commits to ploop and another 15 to vzctl, so now I'd like to share my findings and tell the story about it.

Let's start from the basics and see how migration (i.e. moving a container from one OpenVZ server to another) is implemented. It's vzmigrate, a shell script which does the following (simplified for clarity):

1. Checks that a destination server is available via ssh w/o entering a password, that there is no container with the same ID on it, and so on.
2. Runs rsync of /vz/private/$CTID to the destination server
3. Stops the container
4. Runs a second rsync of /vz/private/$CTID to the destination
5. Starts the container on the destination
6. Removes it locally

Obviously, two rsync runs are needed, so the first one moves most of the data, while the container is still up and running, and the second one moves the changes made during the time period between the first rsync run and the container stop.

Now, if we need live migration (option --online to vzmigrate), then instead of CT stop we do vzctl checkpoint, and instead of start we do vzctl restore. As a result, a container will move to another system without your users noticing (process are not stopping, just freezing for a few seconds, TCP connections are migrating, IP addresses are not changed etc. -- no cheating, just a little bit of magic).

So this is the way it was working for years, making users happy and singing in the rain. One fine day, though, ploop was introduced and it was soon discovered that live migration is not working for ploop-based containers. I found a few reasons why (for example, one can't use rsync --sparse for copying ploop images, because in-kernel ploop driver can't work with files having holes). But the main thing I found is the proper way of migrating a ploop image: not with rsync, but with ploop copy.

ploop copy is a mechanism of effective copy of a ploop image with the help of build-in ploop kernel driver feature called write tracker. One ploop copy process is reading blocks of data from ploop image and sends them to stdout (prepending each block with a short header consisting of a magic label, a block position and its size). The other ploop copy process receives this data from stdin and writes them down to disk. If you connect these two processes via a pipe, and add ssh $DEST in between, you are all set.

You can say, cat utility can do almost the same thing. Right. The difference is, before starting to read and send data, ploop copy asks the kernel to turn on the write tracker, and the kernel starts to memorize a list of data blocks that are modified (written into). Then, after all the blocks are sent, ploop copy is politely expressing the interest in this list, and send the blocks from the list again, while the kernel is creating another list. The process repeats a few times, and the list becomes shorter and shorter. After a few iterations (either the list is empty, or it is not getting shorter, or we just decide that we already did enough iterations) ploop copy executes an external command, which should stop any disk activity for this ploop device. This command is either vzctl stop for offline migration or vzctl checkpoint for live migration; obviously, stopped or frozen container will not write anything to disk. After that, ploop copy asks the kernel for the list of modified blocks again, transfers the blocks listed, and finally asks the kernel for this list again. If this time the list is not empty, that means something is very wrong, meaning that stopping command haven't done what it should and we fail. Otherwise all is good and ploop copy sends a marker telling the transfer is over. So this is how the sending
process works.

The receiving ploop copy process is trivial -- it just reads the blocks from stdin and writes those to file (to a specified position). If you want to see the code of both sending and receiving sides, look no further.

All right, in the previous migration description ploop copy is used in place of second rsync run (steps 3 and 4). I'd like to note that this way it's more effective, because rsync is trying to figure out which files have changed and where, while ploop copy just asks about it from the kernel. Also, because the "ask and send" process is iterative, container will be stopped or frozen as late as it can be and, even if container is writing data to disk actively, the period for which it is stopped is minimal.

Just out of pure curiosity I performed a quick non-scientific test, having "od -x /dev/urandom > file" running inside a container and live migrating it back and forth. ploop copy time after the freeze was a bit over 1 second, and the total frozen time a bit less than 3 seconds. Similar numbers can be obtained from the traditional simfs+rsync migration, but only if a container is not doing any significant I/O. Then I tried to migrate similar container on simfs running the same command running inside, and the frozen time increased to 13-16 seconds. I don't want to say these measurements are to be trusted, I just ran it without any precautions, with OpenVZ instances running inside Parallels VMs, with the physical server busy with something else...

Oh, the last thing. All this functionality is already included into latest tools releases: ploop 1.3 and vzctl 3.3.



Update: comments disabled due to spam

Introducing container in a file aka ploop

OpenVZ have just introduced kernel and tools support for container-in-a-file technology, also known as ploop. This post tries to summarize why ploop is needed, and why is it a superior technology to what we had before.

Before ploop: simfs and vzquota


First of all, a few facts about the pre-ploop era technologies and their limitations.

As you are probably aware, a container file system was just a directory on the host, which a new container was chroot()-ed into. Although it seems like a good and natural idea, there are a number of limitations.

Since containers are living on one same file system, they all share common properties of that file system (it's type, block size, and other options). That means we can not configure the above properties on a per-container basis.

One such property that deserves a special item in this list is file system journal. While journal is a good thing to have, because it helps to maintain file system integrity and improve reboot times (by eliminating fsck in many cases), it is also a bottleneck for containers. If one container will fill up in-memory journal (with lots of small operations leading to file metadata updates, e.g. file truncates), all the other containers I/O will block waiting for the journal to be written to disk. In some extreme cases we saw up to 15 seconds of such blockage.

Since many containers share the same file system with limited space, in order to limit containers disk space we had to develop per-directory disk quotas (i.e. vzquota).

Again, since many containers share the same file system, and the number of inodes on a file system is limited [for most file systems], vzquota should also be able to limit inodes on a per container (per directory) basis.

In order for in-container (aka second-level) disk quota (i.e. standard per-user and per-group UNIX dist quota) to work, we had to provide a dummy file system called simfs. Its sole purpose is to have a superblock which is needed for disk quota to work.

When doing a live migration without some sort of shared storage (like NAS or SAN), we sync the files to a destination system using rsync, which does the exact copy of all files, except that their i-node numbers on disk will change. If there are some apps that rely on files' i-node numbers being constant (which is normally the case), those apps are not surviving the migration.

Finally, a container backup or snapshot is harder to do because there is a lot of small files that need to be copied.

Introducing ploop


In order to address the above problems and ultimately make a world a better place, we decided to implement a container-in-a-file technology, not different from what various VM products are using, but working as effectively as all the other container bits and pieces in OpenVZ.

The main idea of ploop is to have an image file, use it as a block device, and create and use a file system on that device. Some readers will recognize that this is exactly what Linux loop device does! Right, the only thing is loop device is very inefficient (say, using it leads to double caching of data in memory) and its functionality is very limited.

Ploop implementation in the kernel have a modular and layered design.

The top layer is the main ploop module, which provides a virtual block device to be used for CT file system.

The middle layer is the image format module, which does translation of block device block numbers into image file block numbers. A simple format module which is called "raw" is doing trivial 1:1 translation, same as existing loop device.

More sophisticated format module is keeping the translation table and is able to dynamically grow and shrink the image file. That means, if you create a container with 2GB of disk space, the image file size will not be 2GB, but less -- the size of the actual data stored in the container. It is also possible to support other image formats by writing other ploop format modules, such as the one for QCOW2 (used by QEMU and KVM).

The bottom layer is the I/O module. Currently modules for direct I/O on an ext4 device, and for NFS are available. There are plans to also have a generic VFS module, which will be able to store images on any decent file system, but that needs an efficient direct I/O implementation in the VFS layer which is still being worked on.

Ploop benefits


In a nutshell:

  • File system journal is not bottleneck anymore
  • Large-size image files I/O instead of lots of small-size files I/O on management operations
  • Disk space quota can be implemented based on virtual device sizes; no need for per-directory quotas
  • Number of inodes doesn't have to be limited because this is not a shared resource anymore (each CT has its own file system)
  • Live backup is easy and consistent
  • Live migration is reliable and efficient
  • Different containers may use file systems of different types and properties

In addition:

  • Efficient container creation
  • [Potential] support for QCOW2 and other image formats
  • Support for different storage types


Update: comments are disabled due to spam.

Latest Month

July 2016
S M T W T F S
     12
3456789
10111213141516
17181920212223
24252627282930
31      

Syndicate

RSS Atom

Comments

Powered by LiveJournal.com
Designed by Tiffany Chow