For the last two weeks or so we've been working on vzstats -- a way to get some statistics about OpenVZ usage. The system consists of a server, deployed to http://stats.openvz.org/, and clients installed onto OpenVZ machines (hardware nodes). This is currently in beta testing, with 71 servers participating at the moment. If you want to participate, read http://openvz.org/vzstats and run yum install vzstats on your OpenVZ boxes.
So far we have some interesting results. We are not sure how representative they are -- probably they aren't, much more servers are needed to participate-- but nevertheless they are interested. Let's share a few preliminary findings.
First, it looks like almost no one is using 32-bits on the host system anymore. This is reasonable and expected. Indeed, who needs system limited to 4GB of RAM nowdays?
Second, many hosts stay on latest stable RHEL6-based OpenVZ kernel. This is pretty good and above our expectations.
Third, very few run ploop-based containers. We don't understand why. Maybe we should write more about features you get from ploop, such as instant snapshots and improved live migration.
The OpenVZ Project has a booth at SCALE 11x. Kir and I will be staffing it. I wrote a lengthy blog post with the day 1 report where you can read more if you want:
I talked with Kir for about a hour last night and asked him a lot of questions. I got a lot of inside information that I will be sharing in the coming days... I promise... with pictures even.
If you are in the Southern California area and can make it to SCALE, please stop by our booth. We are #93. Assuming there is a reasonable Internet connection at the booth, I also try to be "Live from SCALE" in the #openvz IRC channel. The times are Saturday 10:00 - 18:00 PST and Sunday 10:00-16:00 PST.
You know I love to write about performance comparisons, right?
I just came across the one I saw some time ago and forgot about it, but now it's linked from the Wikipedia article about OpenVZ. The paper (presented at OLS'08) is comparing performance of OpenVZ, Linux-VServer, Xen, KVM and QEMU with the performance of non-virtualized Linux. The results are shocking! OpenVZ is twice slower than reference system when doing bzip -9 of an iso image! Yet better, on a dbench test OpenVZ performance was about 13% of a reference system (which is almost eight times slower), while Linux-VServer gave about 98%.
Well guys, I want to tell you just one thing. If someone offers to sell you a car for 13% of the usual price — don't buy it, it's a scam and a seller is not to be trusted. If someone tells you OpenVZ is 2 (or 8) times slower than non-virtualized system — don't buy it either!
I mean, I do know how complicated it is to have a sane test results. I spent about a year leading SWsoft QA team, mostly testing Linux kernels, and trying to make sure test results are sane and reproducible (not dependent on the moon phase etc). There are lots and lots of factors involved. But the main idea is simple: if results doesn't look plausible, if you can't explain them, dig deeper and find out why. Once you will, you will know how to exclude all the bad factors and conduct a proper test.
We have updated vzctl, ploop and vzquota recently (I wrote about vzctl here). Some changes in packaging are tricky, so let me explain why and give some hints.
For RHEL5-based kernel users (i.e. ovzkernel-2.6.18-028stabXXX) and earlier kernels
Since ploop is only supported in RHEL6 kernel, we have removed ploop dependency from vzctl-4.0 (ploop library is loaded dynamically when needed and if available). Since you have earlier vzctl version installed, you also have ploop installed. Now you can remove it, at the same time upgrading to vzctl-4.0. That "at the same time" part is done via yum shell:
That should fix it. In the meantime, think about upgrading your systems to RHEL6-based kernel which is better in terms of performance, features, and speed of development.
For RHEL6-based users (i.e. vzkernel-2.6.32-042stabXXX)
New ploop library (1.5) requires very recent RHEL6-based kernel kernel (version 2.6.32-042stab061.1 or later) and is not supposed to work with earlier kernels. To protect ploop from earlier kernels, its packaging says "Conflicts: vzkernel < 2.6.32-042stab061.1" which usually prevents ploop 1.5 installation on systems having those kernels.
To fix this conflict, make sure you run the latest kernel, and then remove the old ones:
OpenVZ project is 7 years old as of last month. It's hard to believe the number, but looking back, we've done a lot of things together with you, our users.
One of the main project goals was (and still is) to include the containers support upstream, i.e. to vanilla Linux kernel. In practice, OpenVZ kernel is a fork of the Linux kernel, and we don't like it that way, for a number of reasons. The main ones are:
We want everyone to benefit from containers, not just ones using OpenVZ kernel. Yes to world domination!
We'd like to concentrate on new features, improvements and bug fixes, rather than forward porting our changes to the next kernel.
So, we were (and still are) working hard to bring in-kernel containers support upstream, and many key pieces are already there in the kernel -- for example, PID and network namespaces, cgroups and memory controller. This is the functionality that lxc tool and libvirt library are using. We also use the features we merged into upstream, so with every new kernel branch we have to port less, and the size of our patch set decreases.
CRIU
One of such features for upstream is checkpoint/restore, an ability to save running container state and then restore it. The main use of this feature is live migration, but there are other usage scenarios as well. While the feature is present in OpenVZ kernel since April 2006, it was never accepted to upstream Linux kernel (nor was the other implementation proposed by Oren Laadan).
For the last year we are working on CRIU project, which aims to reimplement most of the checkpoint/restore functionality in userspace, with bits of kernel support where required. As of now, most of the additional kernel patches needed for CRIU are already there in kernel 3.6, and a few more patches are on its way to 3.7 or 3.8. Speaking of CRIU tools, they are currently at version 0.2, released 20th of September, which already have limited support for checkpointing and restoring an upstream container. Check criu.org for more details, and give it a try. Note that this project is not only for containers -- you can checkpoint any process trees -- it's just the container is better because it is clearly separated from the rest of the system.
One of the most important things about CRIU is we are NOT developing it behind the closed doors. As usual, we have wiki and git, but most important thing is every patch is going through the public mailing list, so everyone can join the fun.
vzctl for upstream kernel
We have also released vzctl 4.0 recently (25th of September). As you can see by the number, it is a major release, and the main feature is support for non-OpenVZ kernels. Yes it's true -- now you can have a feeling of OpenVZ without installing OpenVZ kernel. Any recent 3.x kernel should work.
As with OpenVZ kernel, you can use ready container images we have for OpenVZ (so called "OS templates") or employ your own. You can create, start, stop, and delete containers, set various resource parameters such as RAM and CPU limits. Networking (aside from routed-based) is also supported -- you can either move a network interface from host system to inside container (--netdev_add), or use bridged setup (--netif_add). I personally run this stuff on my Fedora 17 desktop using stock F17 kernel -- it just works!
Having said all that, surely OpenVZ kernel is in much better shape now as it comes for containers support -- it has more features (such as live container shapshots and live migration), better resource management capabilities, and overall is more stable and secure. But the fact that the kernel is now optional makes the whole thing more appealing (or so I hope).
You can find information on how to setup and start using vzctl at vzctl for upstream kernel wiki page. The page also lists known limitations are pointers to other resources. I definitely recommend you to give it a try and share your experience! As usual, any bugs found are to be reported to OpenVZ bugzilla.
In the last week or so I've had several questions from new OpenVZ users about container migration. They want to know how it works, what is the difference between live / online migration vs. offline... and how long it will take. I made a silent screencast a few years ago but it is rather dated... and since I can easily add sound now, I thought I'd make an updated screencast. I didn't really put a lot of production value into it but overall it meets its objective. I posted it to archive.org as a webm but they automatically convert it to flash and embed it... so enjoy whichever format you prefer.
Checkpoint/restore, or CPT for short, is a nice feature of OpenVZ, and probably the most amazing one. In a nutshell, it's a way to freeze a container and dump its complete state (processes, memory, network connections etc) to a file on disk, and then restore from that dump, resuming processes execution as if nothing happened. This opens a way to do nifty things such as live container migration, fast reboots, high availability setups etc.
It is our ultimate goal to merge all bits and pieces of OpenVZ to the mainstream Linux kernel. It's not a big secret that we failed miserably trying to merge the checkpoint/restore functionality (and yes, we have tried more than once). The fact that everyone else failed as well soothes the pain a bit, but is not really helpful. The reason is simple: CPT code is big, complex, and touches way too many places in the kernel.
So we* came up with an idea to implement most of CPT stuff in user space, i.e. as a separate program not as a part of the Linux kernel. In practice this is impossible because some kernel trickery is still required here and there, but the whole point was to limit kernel intervention to the bare minimum.
Guess what? It worked even better that we expected. As of today, after about a year of development, up to 90% of the stuff that is needed to be in the kernel is already there, and the rest is ready and seems to be relatively easy to merge (see this table to get an idea what's in and what's not).
As for the user space stuff, we are happy to announce the release of CRtools version 0.1. Now, let me step aside and quote Pavel's announcement:
The tool can already be used for checkpointing and restoring various individual applications. And the greatest thing about this so far is that most of the below functionality has the required kernel support in the recently released v3.5!
So, we support now
* x86_64 architecture * process' linkage * process groups and sessions (without ttys though :\ ) * memory mappings of any kind (shared, file, etc.) * threads * open files (shared between tasks and partially opened-and-unlinked) * pipes and fifos with data * unix sockets with packet queues contents * TCP and UDP sockets (TCP connections support exists, but needs polishing) * inotifies, eventpoll and eventfd * tasks' sigactions setup, credentials and itimers * IPC, mount and PID namespaces
Though namespaces support is in there, we do not yet support an LXC container c/r, but we're close to it :)
I'd like to thank everyone who took part in new kernel APIs discussions, the feedback was great! Special thanks goes to Linus for letting the kernel parts in early, instead of making them sit out of tree till becoming stable enough.
There are still things for which we don't have the kernel support merged (SysVIPC and various anon file descriptors, i.e. inotify, eventpoll, eventfd) yet. We have the kernel branch with the stuff applied available at
What's next? We will be rebasing OpenVZ to Linux Kernel 3.5 (most probably) and will try to re-use CRIU for checkpoint and restore of OpenVZ containers, effectively killing a huge chunk of out-of-tree kernel code that we have in OpenVZ kernel.
* - In fact it was Pavel Emelyanov, our chief kernel architect, but it feels oh so nice to say we that we can't refrain).
During my last holiday on the of hospitable Turkey sunny seaside at night, instead of quenching my thirst or taking rest after a long and tedious day at a beach, I was sitting in a hotel lobby, where they have free Wi-Fi, trying to make live migration of a container on a ploop device working. I succeeded, with about 20 commits to ploop and another 15 to vzctl, so now I'd like to share my findings and tell the story about it.
Let's start from the basics and see how migration (i.e. moving a container from one OpenVZ server to another) is implemented. It's vzmigrate, a shell script which does the following (simplified for clarity):
1. Checks that a destination server is available via ssh w/o entering a password, that there is no container with the same ID on it, and so on. 2. Runs rsync of /vz/private/$CTID to the destination server 3. Stops the container 4. Runs a second rsync of /vz/private/$CTID to the destination 5. Starts the container on the destination 6. Removes it locally
Obviously, two rsync runs are needed, so the first one moves most of the data, while the container is still up and running, and the second one moves the changes made during the time period between the first rsync run and the container stop.
Now, if we need live migration (option --online to vzmigrate), then instead of CT stop we do vzctl checkpoint, and instead of start we do vzctl restore. As a result, a container will move to another system without your users noticing (process are not stopping, just freezing for a few seconds, TCP connections are migrating, IP addresses are not changed etc. -- no cheating, just a little bit of magic).
So this is the way it was working for years, making users happy and singing in the rain. One fine day, though, ploop was introduced and it was soon discovered that live migration is not working for ploop-based containers. I found a few reasons why (for example, one can't use rsync --sparse for copying ploop images, because in-kernel ploop driver can't work with files having holes). But the main thing I found is the proper way of migrating a ploop image: not with rsync, but with ploop copy.
ploop copy is a mechanism of effective copy of a ploop image with the help of build-in ploop kernel driver feature called write tracker. One ploop copy process is reading blocks of data from ploop image and sends them to stdout (prepending each block with a short header consisting of a magic label, a block position and its size). The other ploop copy process receives this data from stdin and writes them down to disk. If you connect these two processes via a pipe, and add ssh $DEST in between, you are all set.
You can say, cat utility can do almost the same thing. Right. The difference is, before starting to read and send data, ploop copy asks the kernel to turn on the write tracker, and the kernel starts to memorize a list of data blocks that are modified (written into). Then, after all the blocks are sent, ploop copy is politely expressing the interest in this list, and send the blocks from the list again, while the kernel is creating another list. The process repeats a few times, and the list becomes shorter and shorter. After a few iterations (either the list is empty, or it is not getting shorter, or we just decide that we already did enough iterations) ploop copy executes an external command, which should stop any disk activity for this ploop device. This command is either vzctl stop for offline migration or vzctl checkpoint for live migration; obviously, stopped or frozen container will not write anything to disk. After that, ploop copy asks the kernel for the list of modified blocks again, transfers the blocks listed, and finally asks the kernel for this list again. If this time the list is not empty, that means something is very wrong, meaning that stopping command haven't done what it should and we fail. Otherwise all is good and ploop copy sends a marker telling the transfer is over. So this is how the sending process works.
The receiving ploop copy process is trivial -- it just reads the blocks from stdin and writes those to file (to a specified position). If you want to see the code of both sending and receiving sides, look no further.
All right, in the previous migration description ploop copy is used in place of second rsync run (steps 3 and 4). I'd like to note that this way it's more effective, because rsync is trying to figure out which files have changed and where, while ploop copy just asks about it from the kernel. Also, because the "ask and send" process is iterative, container will be stopped or frozen as late as it can be and, even if container is writing data to disk actively, the period for which it is stopped is minimal.
Just out of pure curiosity I performed a quick non-scientific test, having "od -x /dev/urandom > file" running inside a container and live migrating it back and forth. ploop copy time after the freeze was a bit over 1 second, and the total frozen time a bit less than 3 seconds. Similar numbers can be obtained from the traditional simfs+rsync migration, but only if a container is not doing any significant I/O. Then I tried to migrate similar container on simfs running the same command running inside, and the frozen time increased to 13-16 seconds. I don't want to say these measurements are to be trusted, I just ran it without any precautions, with OpenVZ instances running inside Parallels VMs, with the physical server busy with something else...
Oh, the last thing. All this functionality is already included into latest tools releases: ploop 1.3 and vzctl 3.3.
In case it doesn't work with vzctl-3.1 (haven't tried it, going to release 3.2 really soon), try the recent pre-3.2 vzctl nightly build, see more info at http://wiki.openvz.org/Download/vzctl/nightly
I tried it and was able to migrate a CentOS 7 container... but the Fedora 22 one seems to be stuck in the "started" phase. It creates a /vz/private/{ctid} dir on the destination host (with the same…
The fall semester is just around the corner... so it is impossible for me to break away for a trip to Seattle. I hope one or more of you guys can blog so I can attend vicariously.
Comments
Do you still stand by your opinions above now in 2016?…