• k001

Donation: a way to help or just say thanks

For many years, when people asked us how can they help the project, our typical answer was something like "just use it, file bugs, spread the word". Some people were asking specifically about how to donate money, and we had to say "currently we don't have a way to accept", so it was basically "no, we don't need your money". It was not a good answer. Even if our big sponsor is generously helping us with everything we need, that doesn't mean your money would be useless.

Today we have opened a PayPal account to accept donations. Here:

How are we going to spend your money? In general:

  • Hardware for development and testing
  • Travel budget for conferences, events etc.
  • Accolades for active contributors

In particular, right now we need:

  • About $200 to cover shipping expenses for donated hardware
  • About $300 for network equipment
  • About $2000 for hard drives

Donations page is created on our wiki to track donations and spending. We hope that it will see some good progress in the coming days, with a little help from you!

NOTE that if you feel like spending $500 or more, there is yet another way to spend it -- a support contract from Parallels, with access to their excellent support team and 10 free support tickets.

  • k001

Yet more live migration goodness

It's been two months since we have released vzctl 4.7 and ploop 1.11 with some vital improvements to live migration that made it about 25% faster. Believe it or not, this is another "make it faster" post, making it, well, faster. That's right, we have more surprises in store for you! Read on and be delighted.

Asynchronous ploop send

This is something that should have been implemented at the very beginning, but the initial implementation looked like this (in C):

/* _XXX_ We should use AIO. ploopcopy cannot use cached reads and
 * has to use O_DIRECT, which introduces large read latencies.
 * AIO is necessary to transfer with maximal speed.

Why there are no cached reads (and, in general, why ploop library is using O_DIRECT, and the kernel also uses direct I/O, bypassing the cache)? Note that ploop images are files that are used as block devices containing a filesystem. That ploop block device and filesystem access is going through cache as usual, so if we'd do the same for ploop image itself, it would result in double caching and waste of RAM. Therefore, we do direct I/O on lower level (when working with ploop images), and allow usual Linux cache to be used on the upper level (when container is accessing files inside the ploop, which is the most common operation).

So, ploop image itself is not (usually) cached in memory, and other tricks like read-ahead are also not used by the kernel, and reading each block from the image takes time as it is read directly from disk. In our test lab (OpenVZ running inside KVM with a networked disk) it takes about 10 milliseconds to read each 1 MB block. Then, sending each block to the destination system over network also takes time, it's about 15 milliseconds in our setup. So, it takes 25 milliseconds to read and send a block of data, if we do it one after another. Oh wait, this is exactly how we do it!

Solution -- let's do reading and sending at the same time! In other words, while sending the first block of data we can already read the second one, and so on, bringing the time required to transfer one block down to MAX(Tread, Tsend) instead of Tread + Tsend.

Implementation uses POSIX threads. Main ploop send process spawns a separate sending thread and two buffers -- one for reading and one for sending, then they change place. This works surprisingly well for something that uses threads, maybe because it is as simple as it could ever be (one thread, one mutex, one conditional). If you happen to know something about pthreads programming, please review the appropriate commit 55e26e9606.

The new async send is used by default, you just need to have a newer ploop (1.12, not yet released). Clone and compile ploop from git if you are curious.

As for how much time we save using this, it happen to be 15 microseconds instead of 25 per block of data, or 40% faster! Now, the real savings depend on the number of blocks needs to be migrated after container is frozen, it can be 5 or 100, so overall savings can be from 0.05 to 1 second.

ploop copy with feedback

The previous post on live migration improvements described an optimization of doing fdatasync() on the receiving ploop side before suspending the container on the sending side. It also noted that implementation is sub-par:

The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).

So, the problem is trivial -- there's a need for a bi-directional channel between ploop copy sender and receiver, so the receiver can say "All right, I have synced the freaking delta back to sender". In addition, we want to do it in a simple way, not much more complicated as it is now.

After some playing around with different approaches, it seemed that OpenSSH port forwarding, combined with tools like netcat, socat, or bash /dev/tcp feature, can do the trick of establishing a two-way pipe between ploop copy sides.

The netcat (or nc) is available in various varieties, which might or might not be available and/or compatible. socat is a bit better, but one problem with it is it ignores its child exit code, so there is no (simple) way to figure out an error. A problem with /dev/tcp feature is it is bash-specific, but bash itself might not be universally available (Debian and Ubunty users are well aware of that fact).

So, all this was replaced with a tiny home-grown vznnc ("nano-netcat) tool. It is indeed nano, as there is only only 200 lines of code in it. It can either listen or connect to a specified port at localhost (note that ssh is doing real networking for us), and run a program with its stdin/stdout (or a specified file descriptor) already connected to that TCP socket. Again, similar to netcat, but small, to the point, and hopefully bug-free.

Finally, with this in place, we can make sending side of ploop copy to wait for feedback from the receiving side, so it can suspend the container as soon as remote side finished syncing. This makes the whole migration a bit faster (by eliminating the previously hard-coded 5 seconds wait-for-sync delay), but it also helps to slash the frozen time as we suspend the container as soon as we should, so it won't be given any extra time to write some more data we'll be needing to copy while it's frozen.

It's hard to measure the practical impact of the feature, but in our tests it saves about 3.5 seconds of total migration time, and from 0 to a few seconds of frozen time, depending on container's disk I/O activity.

For the feature to work, you need the latest versions of vzctl (4.8) and ploop (1.12) on both nodes (if one node doesn't have newer tools, vzmigrate falls back to old non-feedback way of copying). Note that those new version are not yet released at the time of writing this, but you can get ones from git and compile yourself.

Using ssh connection multiplexing

It takes about 0.15 seconds to establish an ssh connection in our test lab. Your mileage may wary, but it can't be zero. Well, unless an OpenSSH feature of reusing an existing ssh connection is put to work! It's call connection miltiplexing, and when used, once a connection is established, you can have subsequent ones in practically no time. As vzmigrate is a shell script and runs a number of commands at remote (destination) side, using it might save us some time.

Unfortunately, it's a relatively new OpenSSH feature and is implemented quite ugly in a version available in CentOS 6 -- you need to keep one open ssh session running. They fixed it in a later version by adding a special daemon-like mode enabled with ControlPersist option, but alas not in CentOS 6. Therefore, we have to maintain a special "master" ssh connection for the duration of vzmigrate. For implementation details, see commits 06212ea3d and 00b9ce043.

This is still experimental, so you need to specify --ssh-mux flag to vzmigrate to use it. You won't believe it (so, go test it yourself!), but this alone slashes container frozen time by about 25% (which is great, as we fight for every microsecond here), and improves the total time taken by vzmigrate by up to 1 second (which is probably not that important but still nice).

What's next?

Current setup used for development and testing is two OpenVZ instances running in KVM guests on a ThinkCentre box. While it's convenient and oh so very space-saving, is is probably not the best approximation of a real world OpenVZ usage. So, we need some better hardware:

If you are able to spend $500 to $2000 on ebay, please let us know by email to donate at openvz dot org so we can arrange it. Now, quite a few people offered hosted servers with a similar configuration. While we are very thankful to all such offers, this time we are looking for physical, not hosted, hardware.

Update: Thanks to FastVPS, we got the Supermicro 4 node server. If you want to donate, see wiki: Donations.
  • k001

An interview with ANK

This is a rare interview with the legendary Alexey Kuznetsov (a.k.a. ANK), who happen to work for Parallels. Alan Cox once said he had thought for a long time that "Kuznetsov" is a collective name for a secret group of Russian programmers -- because no single man can write so much code at once.

An interview is taken by and is part of "work places" series. I tried to do my best to translate it to English, but it's far from perfect. I guess this is still a very interesting reading.

Q: Who are you and what you do?

Since mid-90s I was one of Linux maintainers. Back then all the communication was done via conferences and Linux mailing lists. Pretty often I was aggressively arguing with someone there, don't remember for which reasons. Now it's fun to recall. Being a maintainer, I wasn't just making something on my own, but had to control others. Kicking out those who were making rubbish (from my point of view), and supporting those who were making something non-rubbish. All these conflicts, they were exhausting me. Since some moment I started noticing I am becoming "bronzed" [Alexey is referring to superiority complex -- Kir]. You said or did some crap, and then learn that this is becoming the right way now, since ANK said so.

I started to doubt, maybe I am just using my authority to maintain status quo. Every single morning started with a fight with myself, then with the world. In 2003 I got fed by it, so I went away from public, and later switched to a different field of knowledge. At that time I started my first project in Parallels. The task was to implement live migration of containers, and it was very complicated.

Now in Parallels we work on Parallels Cloud Storage project, developing cluster file systems for storing virtual machine images. The technology itself is a few years old already, we did a release recently, and are now working on improving it.

Q: How does your workplace look like?

My workplace is a bunch of computers. But I only work on a notebook, currently it's Lenovo T530. Other computers here are used for various purposes. This display, standing here, I never use it, nor this keyboard. Well, only if something breaks. Here we have different computers, including a Power Mac, an Intel and an AMD. I was using those in different years for different experiments. Once I needed to create a cluster of 3 machines right here at my workplace. One machine here is really old, and its sole purpose is to manage a power switch, so I can reboot all the others when working remotely from home. Here I have two Mac Minis and a Power Mac. They are always on, but I use them rarely, only when I need to see something in Parallels Desktop.

Q: What software do you use?

I don't use anything except for Google Chrome. Well, an editor and a compiler, if they qualify for software. I also store some personal data and notes in Evernote.

I only use a text console. For everything. In fact, on a newer notebooks, when the screen is getting better, the console mode is working worse and worse. So I am working in a graphical environment now, running a few full-screen terminals on top of it. It looks like a good old Unix console. So this is how I live, read email, work.

I do have a GMail account, I use it to read email from my phone. Sometimes it is needed. Or, when I see someone sent me a PDF, I have nothing else to do than to forward this email to where I can open that PDF. Same for PPT. But this is purely theoretical, in practice I never work with PPT.

I use Linux. Currently it is Fedora 13 -- not a newest one, to say at least. I am always using a version that was a base for a corresponding RHEL release. Every few years a new Red Hat [Enterprise Linux] is released, so I install a new system. When I do not change anything for a few years. Say, 5 years. I can't think of any new feature of an OS that would force me to update. I am just using the system as an editor, same as I have used it 20 years ago.

I have a phone, Motorola RAZR Maxx, running Android. I don't like iOS. You can't change anything in there. Not that I like customizations, I like a possibility to customize. I got a Motorola because I hate Samsung. This hatred is absolutely irrational. I had no happiness with any Samsung product, they either did't work for me or they break. I need a phone to make calls and check emails, that is all I need. Everything else is disabled -- to save the battery.

I am also reading news over RSS every day, like a morning paper. Now Feedly, before it was Google Reader, until they closed it. I have a list of bloggers I read, I won't mention their names. I am also reading Russian and foreign media., for example. There's that nice english-language service, News 360. It fits for what I like and gives me the relevant news. I am not able to say if it works or not, but the fact is, what it shows to me is really interesting to me. It was showing a lot of sports news at first, but then they disappeared.

I don't use instant messengers like Skype or ICQ, it's just meaningless. If you need something, write an email. If you need it urgently, call. Email and phone covers everything.

Speaking of social networks, I do have a Facebook account with two friends -- my wife and my sister. I view this account only when they post a picture, I don't wander there for no reason.

Q: Is there a use for paper in your work?

It's a mess. I don't have a pen, so when I would need it I could not find it. If I am close to the notebook and I need to write something -- I write to a file. If I don't have a notebook around, I write to my phone. For these situations I recently started to use Google Keep, a service to store small notes. It is convenient so far. Otherwise I use Evernote. Well, I don't have a system for that. But I do have a database of everything on my notebook: perpetual emails, all the files and notes. All this stuff is indexed. Total size is about 10 gigabytes, since I don't have any graphics. Well, if we throw away all the junk from there, almost nothing will remain.

Q: Is there a dream configuration?

What I have here is more than enough for me. This last notebook is probably the best notebook I ever had.

I was getting used to it for a long time, swore a lot. I only use Thinkpads for a long time. They are similar from version to version, but each next one is getting bigger and heavier, physically. This is annoying. This model, they changed the keyboard. I had to get used to it, but now I realize this is the best keyboard I ever had. In general, I am pretty satisfied with ThinkPads. Well, if it would had a Retina screen and be just 1 kilogram less weight -- that would be ideal.
  • k001

ploop and live migration: 2 years later

It has been almost two years since we wrote about effective live migration with ploop write tracker. It's time to write some more about it, since we have managed to make ploop live migration yet more effective, by means of pretty simple optimizations. But let's not jump to resolution yet, it's a long and interesting story to tell.

As you know, live migration is not quite live, although it looks that way to a user. There is a short time period, usually a few seconds, during which the container being migrated is frozen. This time (shown if -t or -v option to vzmigrate --live is used) is what needs to be optimized, making it as short as possible. In order to do that, one needs to dig into details on what's happening when a container is frozen.

Typical timings obtained via vzmigrate -t --live look like this. We ran a few iterations migrating a container back and forth between two OpenVZ instances (running inside Parallels VMs on the same physical machine), so there are a few columns at the right side.

(Software: vzctl 4.6.1, ploop-1.10)

              Iteration     1     2     3     4     5     6    AVG

        Suspend + Dump:   0.58  0.71  0.64  0.64  0.91  0.74  0.703
   Pcopy after suspend:   1.06  0.71  0.50  0.68  1.07  1.29  0.885
        Copy dump file:   0.64  0.67  0.44  0.51  0.43  0.50  0.532
       Undump + Resume:   2.04  2.06  1.94  3.16  2.26  2.32  2.297
                        ------  ----  ----  ----  ----  ----  -----
  Total suspended time:   4.33  4.16  3.54  5.01  4.68  4.85  4.428

Apparently, the first suspect to look at is that "undump + resume". Basically, it shows timing of vzctl restore command. Why it is so slow? Apparently, ploop mount takes some noticeable time. Let's dig deeper into that process.

First, implement timestamps in ploop messages, raise the log level and see what is going on here. Apparently, adding deltas is not instant, takes any time from 0.1 second to almost a second. After some more experiments and thinking it becomes obvious that since ploop kernel driver works with data in delta files directly, bypassing the page cache, it needs to force those files to be written to the disk, and this costly operation happens while container is frozen. Is it possible to do it earlier? Sure, we just need to force write the deltas we just copied before suspending a container. Easy, just call fsync(), or yet better fdatasync(), since we don't really care about metadata being written.

Unfortunately, there is no command line tool to do fsync or fdatasync, so we had to write one and call it from vzmigrate. Is it any better now? Yes indeed, delta adding times went down to from tenths to hundredths of a second.

Except for the top delta, of course, which we migrate using ploop copy. Surely, we can't fsync it before suspending container, because we keep copying it after. Oh wait... actually we can! By adding an fsync before CT suspend, we force the data be written on disk, so the second fsync (which happens after everything is copied) will take less time. This time is shown as "Pcopy after suspend".

The problem is that ploop copy consists of two sides -- the sending one and the receiving one -- which communicate over a pipe (with ssh as a transport). It's the sending side which runs the command to freeze the container, and it's the receiving side which should do fsync, so we need to pass some sort of "do the fsync" command. Yet better, do it without breaking the existing protocol, so nothing bad will happen in case there is an older version of ploop on the receiving side.

The "do the fsync" command ended up being a data block of 4 bytes, you can see the patch here. Older version will write these 4 bytes to disk, which is unnecessary but OK do to, and newer version will recognize it as a need to do fsync.

The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).

To summarize, what we have added is a couple of fsyncs (it's actually fdatasync() since it is faster), and here are the results:

(Software: vzctl 4.7, ploop-1.11)

              Iteration     1     2     3     4     5     6    AVG

        Suspend + Dump:   0.60  0.60  0.57  0.74  0.59  0.80  0.650
   Pcopy after suspend:   0.41  0.45  0.45  0.49  0.40  0.42  0.437 (-0.4)
        Copy dump file:   0.46  0.44  0.43  0.48  0.47  0.52  0.467
       Undump + Resume:   1.86  1.75  1.67  1.91  1.87  1.84  1.817 (-0.5)
                        ------  ----  ----  ----  ----  ----  -----
  Total suspended time:   3.33  3.24  3.12  3.63  3.35  3.59  3.377 (-1.1)

As you see, both "pcopy after suspend" and "undump + resume" times decreased, shaving off about a second of time, which gives us about 25% improvement. Now, take into account that tests were done on an otherwise idle nodes with mostly idle containers, we suspect that the benefit will be more apparent with I/O load. Let checking if this statement is true will be your homework for today!

  • dowdle

How about an OpenVZ CentOS Variant?

I've used RHEL, CentOS and Fedora for many years... and as many of you already know... back in January, CentOS became a sponsored project of Red Hat. For the upcoming CentOS 7 release they are going beyond just the normal release that is an as-perfect-as-possible clone of RHEL. They have this concept of variants... where Special Interest Groups (SIGs) are formed around making special purpose builds of CentOS... spins or remixs if you will. I don't know a lot about it yet but I think I have the basic concept correct.

Looking at the numbers on I see:

Top host distros
CentOS 56,725
Scientific 2,471
RHEL 869
Debian 576
Fedora 111
Ubuntu 82
Gentoo 54
openSUS 18
ALT Linux 10
Sabayon 6


Top 10 CT distros
centos 245,468
debian 106,350
ubuntu 83,197
OR 8,354
gentoo 7,017
pagoda 4,024
scientific 3,604
fedora 3,173
seedunlimited 1,965

Although reporting is optional, the popularity of CentOS as both an OpenVZ host and an OpenVZ container surely has to do with the fact that the two stable branches of the OpenVZ kernel are derived from RHEL kernels.

Wouldn't be nice if there were a CentOS variant that has the OpenVZ kernel and utils pre-installed? I think so.

While I have made CentOS remixes in the past just for my own personal use... I have not had any official engagement with the CentOS community. I was curious if there were some OpenVZ users out there who are already affiliated with the CentOS Project and who might want to get together in an effort to start a SIG and ultimately an OpenVZ CentOS 7 variant. Anyone? I guess if not, I could make a personal goal of building a CentOS and/or Scientific Linux 6-based remix that includes OpenVZ... as well as working on it after RHEL7 and clones are released... and after such time the OpenVZ Project has released a stable branch based on the RHEL7 kernel.

I will acknowledge up front that some of the top CentOS devs / contributors have historically been fairly nasty to OpenVZ users on the #centos IRC channel. They generally did not want to help someone using a CentOS system running under an OpenVZ kernel... but then again... their reputation is for being obnoxious to many groups of people. :) I don't think we should let that stop us.

Comments, feedback, questions?
  • k001

Why we still use gzip for templates?

OpenVZ precreated OS templates are tarballs of a pre-installed Linux distributions. While there are other ways to create a container, the easiest one is to take such a tarball and extract its contents. This is what takes 99.9% of vzctl create command execution.

To save some space and improve download speeds, those tarballs are compacted with good ol' gzip tool. For example, CentOS 6 template tar.gz is about 200 MB in size, while uncompacted tar would be about 550 MB. But why don't we use more efficient compacting tools, such as bzip2 or xz? Say, the same CentOS 6 tarball, compressed by xz, is as lightweight as 120 MB! Here are the numbers again:

centos-6-x86.tar.gz: 203M
centos-6-x86.tar.xz: 122M
centos-6-x86.tar: 554M

So, why don't we switch to xz which apparently looks way better? Well, there are other criteria to optimize for, except for file size and download speed. In fact, the main optimization target is container creation speed! I just ran a quick non-scientific test on my notebook in order to proof my words, measuring the time it takes to run tar xf on a tarball:

time tar xf tar.gz: ~7 seconds
time tar xf tar.xz: ~13 seconds

See, it takes twice the time if we switch to xz! Note that this ratio doesn't change much when I switched from fast SSD to (relatively slow) rotating hard disk drive:

time tar xf tar.gz: ~8 seconds
time tar xf tar.xz: ~16 seconds

Note, while I call it non-scientific, I still ran each test at least three times, with proper syncs, rms and cache drops in between.

Now, do we want to trade a double increase of container creation time for saving 80 MB of disk space. We sure don't!
  • dowdle

Video: Containers and the Cloud

James Bottomley, CTO, Server Virtualization for Parallels gave a presentation entitled, "Containers and The Cloud: Do you need another virtual environment?" on Oct 23. The Linux Foundation recently posted it to YouTube.

There is a lot of good information in the video even for us OpenVZ folks. Enjoy: