OpenVZ project is 7 years old as of last month. It's hard to believe the number, but looking back, we've done a lot of things together with you, our users.
One of the main project goals was (and still is) to include the containers support upstream, i.e. to vanilla Linux kernel. In practice, OpenVZ kernel is a fork of the Linux kernel, and we don't like it that way, for a number of reasons. The main ones are:
We want everyone to benefit from containers, not just ones using OpenVZ kernel. Yes to world domination!
We'd like to concentrate on new features, improvements and bug fixes, rather than forward porting our changes to the next kernel.
So, we were (and still are) working hard to bring in-kernel containers support upstream, and many key pieces are already there in the kernel -- for example, PID and network namespaces, cgroups and memory controller. This is the functionality that lxc tool and libvirt library are using. We also use the features we merged into upstream, so with every new kernel branch we have to port less, and the size of our patch set decreases.
CRIU
One of such features for upstream is checkpoint/restore, an ability to save running container state and then restore it. The main use of this feature is live migration, but there are other usage scenarios as well. While the feature is present in OpenVZ kernel since April 2006, it was never accepted to upstream Linux kernel (nor was the other implementation proposed by Oren Laadan).
For the last year we are working on CRIU project, which aims to reimplement most of the checkpoint/restore functionality in userspace, with bits of kernel support where required. As of now, most of the additional kernel patches needed for CRIU are already there in kernel 3.6, and a few more patches are on its way to 3.7 or 3.8. Speaking of CRIU tools, they are currently at version 0.2, released 20th of September, which already have limited support for checkpointing and restoring an upstream container. Check criu.org for more details, and give it a try. Note that this project is not only for containers -- you can checkpoint any process trees -- it's just the container is better because it is clearly separated from the rest of the system.
One of the most important things about CRIU is we are NOT developing it behind the closed doors. As usual, we have wiki and git, but most important thing is every patch is going through the public mailing list, so everyone can join the fun.
vzctl for upstream kernel
We have also released vzctl 4.0 recently (25th of September). As you can see by the number, it is a major release, and the main feature is support for non-OpenVZ kernels. Yes it's true -- now you can have a feeling of OpenVZ without installing OpenVZ kernel. Any recent 3.x kernel should work.
As with OpenVZ kernel, you can use ready container images we have for OpenVZ (so called "OS templates") or employ your own. You can create, start, stop, and delete containers, set various resource parameters such as RAM and CPU limits. Networking (aside from routed-based) is also supported -- you can either move a network interface from host system to inside container (--netdev_add), or use bridged setup (--netif_add). I personally run this stuff on my Fedora 17 desktop using stock F17 kernel -- it just works!
Having said all that, surely OpenVZ kernel is in much better shape now as it comes for containers support -- it has more features (such as live container shapshots and live migration), better resource management capabilities, and overall is more stable and secure. But the fact that the kernel is now optional makes the whole thing more appealing (or so I hope).
You can find information on how to setup and start using vzctl at vzctl for upstream kernel wiki page. The page also lists known limitations are pointers to other resources. I definitely recommend you to give it a try and share your experience! As usual, any bugs found are to be reported to OpenVZ bugzilla.
Checkpoint/restore, or CPT for short, is a nice feature of OpenVZ, and probably the most amazing one. In a nutshell, it's a way to freeze a container and dump its complete state (processes, memory, network connections etc) to a file on disk, and then restore from that dump, resuming processes execution as if nothing happened. This opens a way to do nifty things such as live container migration, fast reboots, high availability setups etc.
It is our ultimate goal to merge all bits and pieces of OpenVZ to the mainstream Linux kernel. It's not a big secret that we failed miserably trying to merge the checkpoint/restore functionality (and yes, we have tried more than once). The fact that everyone else failed as well soothes the pain a bit, but is not really helpful. The reason is simple: CPT code is big, complex, and touches way too many places in the kernel.
So we* came up with an idea to implement most of CPT stuff in user space, i.e. as a separate program not as a part of the Linux kernel. In practice this is impossible because some kernel trickery is still required here and there, but the whole point was to limit kernel intervention to the bare minimum.
Guess what? It worked even better that we expected. As of today, after about a year of development, up to 90% of the stuff that is needed to be in the kernel is already there, and the rest is ready and seems to be relatively easy to merge (see this table to get an idea what's in and what's not).
As for the user space stuff, we are happy to announce the release of CRtools version 0.1. Now, let me step aside and quote Pavel's announcement:
The tool can already be used for checkpointing and restoring various individual applications. And the greatest thing about this so far is that most of the below functionality has the required kernel support in the recently released v3.5!
So, we support now
* x86_64 architecture * process' linkage * process groups and sessions (without ttys though :\ ) * memory mappings of any kind (shared, file, etc.) * threads * open files (shared between tasks and partially opened-and-unlinked) * pipes and fifos with data * unix sockets with packet queues contents * TCP and UDP sockets (TCP connections support exists, but needs polishing) * inotifies, eventpoll and eventfd * tasks' sigactions setup, credentials and itimers * IPC, mount and PID namespaces
Though namespaces support is in there, we do not yet support an LXC container c/r, but we're close to it :)
I'd like to thank everyone who took part in new kernel APIs discussions, the feedback was great! Special thanks goes to Linus for letting the kernel parts in early, instead of making them sit out of tree till becoming stable enough.
There are still things for which we don't have the kernel support merged (SysVIPC and various anon file descriptors, i.e. inotify, eventpoll, eventfd) yet. We have the kernel branch with the stuff applied available at
What's next? We will be rebasing OpenVZ to Linux Kernel 3.5 (most probably) and will try to re-use CRIU for checkpoint and restore of OpenVZ containers, effectively killing a huge chunk of out-of-tree kernel code that we have in OpenVZ kernel.
* - In fact it was Pavel Emelyanov, our chief kernel architect, but it feels oh so nice to say we that we can't refrain).
OpenVZ have just introduced kernel and tools support for container-in-a-file technology, also known as ploop. This post tries to summarize why ploop is needed, and why is it a superior technology to what we had before.
Before ploop: simfs and vzquota
First of all, a few facts about the pre-ploop era technologies and their limitations.
As you are probably aware, a container file system was just a directory on the host, which a new container was chroot()-ed into. Although it seems like a good and natural idea, there are a number of limitations.
Since containers are living on one same file system, they all share common properties of that file system (it's type, block size, and other options). That means we can not configure the above properties on a per-container basis.
One such property that deserves a special item in this list is file system journal. While journal is a good thing to have, because it helps to maintain file system integrity and improve reboot times (by eliminating fsck in many cases), it is also a bottleneck for containers. If one container will fill up in-memory journal (with lots of small operations leading to file metadata updates, e.g. file truncates), all the other containers I/O will block waiting for the journal to be written to disk. In some extreme cases we saw up to 15 seconds of such blockage.
Since many containers share the same file system with limited space, in order to limit containers disk space we had to develop per-directory disk quotas (i.e. vzquota).
Again, since many containers share the same file system, and the number of inodes on a file system is limited [for most file systems], vzquota should also be able to limit inodes on a per container (per directory) basis.
In order for in-container (aka second-level) disk quota (i.e. standard per-user and per-group UNIX dist quota) to work, we had to provide a dummy file system called simfs. Its sole purpose is to have a superblock which is needed for disk quota to work.
When doing a live migration without some sort of shared storage (like NAS or SAN), we sync the files to a destination system using rsync, which does the exact copy of all files, except that their i-node numbers on disk will change. If there are some apps that rely on files' i-node numbers being constant (which is normally the case), those apps are not surviving the migration.
Finally, a container backup or snapshot is harder to do because there is a lot of small files that need to be copied.
Introducing ploop
In order to address the above problems and ultimately make a world a better place, we decided to implement a container-in-a-file technology, not different from what various VM products are using, but working as effectively as all the other container bits and pieces in OpenVZ.
The main idea of ploop is to have an image file, use it as a block device, and create and use a file system on that device. Some readers will recognize that this is exactly what Linux loop device does! Right, the only thing is loop device is very inefficient (say, using it leads to double caching of data in memory) and its functionality is very limited.
Ploop implementation in the kernel have a modular and layered design.
The top layer is the main ploop module, which provides a virtual block device to be used for CT file system.
The middle layer is the image format module, which does translation of block device block numbers into image file block numbers. A simple format module which is called "raw" is doing trivial 1:1 translation, same as existing loop device.
More sophisticated format module is keeping the translation table and is able to dynamically grow and shrink the image file. That means, if you create a container with 2GB of disk space, the image file size will not be 2GB, but less -- the size of the actual data stored in the container. It is also possible to support other image formats by writing other ploop format modules, such as the one for QCOW2 (used by QEMU and KVM).
The bottom layer is the I/O module. Currently modules for direct I/O on an ext4 device, and for NFS are available. There are plans to also have a generic VFS module, which will be able to store images on any decent file system, but that needs an efficient direct I/O implementation in the VFS layer which is still being worked on.
Ploop benefits
In a nutshell:
File system journal is not bottleneck anymore
Large-size image files I/O instead of lots of small-size files I/O on management operations
Disk space quota can be implemented based on virtual device sizes; no need for per-directory quotas
Number of inodes doesn't have to be limited because this is not a shared resource anymore (each CT has its own file system)
Live backup is easy and consistent
Live migration is reliable and efficient
Different containers may use file systems of different types and properties
In addition:
Efficient container creation
[Potential] support for QCOW2 and other image formats
Are you ready for the next cool feature? Please welcome CT console.
Available in RHEL6-based kernel since 042stab048.1, this feature is pretty simple to use. Use vzctl attach CTID to attach to this container's console, and you will be able to see all the messages CT init is writing to console, or run getty on it, or anything else.
Please note that the console is persistent, i.e. it is available even if a container is not running. That way, you can run vzctl attach vzctl consoleand then (in another terminal) vzctl start. That also means that if a container is stopped, vzctl attach is still there.
Press Esc . to detach from the console.
The feature (vzctl git commit) will be available in up-coming vzctl-3.0.31. I have just made a nightly build of vzctl (version 3.0.30.2-18.git.a1f523f) available so you can test this. Check http://wiki.openvz.org/Download/vzctl/nightly for information of how to get a nightly build.
Update: the feature is renamed to vzctl console. Update: comments disabled due to spam.
The best feature of the new (RHEL6-based) 042 series of the OpenVZ kernels is definitely vSwap. The short story is, we used to have 22 user beancounter parameters which every seasoned OpenVZ user knows by heart. Each of these parameters is there for a reason, but 22 knobs are a bit too complex to manage for a mere mortal, especially bearing in mind that
many of them are interdependent;
the sum of all limits should not exceed the resources of a given physical server.
Keeping this configuration optimal (or even consistent) is quite a challenging task even for a senior OpenVZ admin (with a probable exception of an ex airline pilot). This complexity is the main reason why there are multiple articles and blog entries complaining OpenVZ is worse than Xen, or that it is not suitable for hosting Java apps. We do have some workarounds to mitigate this complexity, such as:
This is still not the way to go. While we think high of our users, we do not expect all of them to be ex airline pilots. To solve the complexity, the number of per-container knobs and handles should be reduced to some decent number, or at least most of these knobs should be optional.
We worked on that for a few years, and the end result is called vSwap (where V is for Vendetta, oh, pardon me, Virtual).
vSwap concept is as simple as a rectangle. For each container, there are only two required parameters: the memory size (known as physpages) and the swap size (swappages). Almost everyone (not only an admin, but even an advanced end user) knows what is RAM and what is swap. On a physical server, if there is not enough memory, the system starts to swap out memory pages to disk, then swap in some other pages, which results in severe performance degradation but it keeps the system from failing miserably.
It's about the same with vSwap, except that
RAM and swap are configured on a per container basis;
no I/O is performed until it is really necessary (this is why swap is virtual).
Some VSwap internals
Now, there are only two knobs per container on a dashboard, namely RAM and swap, and all the complexity is hidden under the hood. I am going to describe just a bit of that undercover mechanics and explain what does the "Reworked VSwap kernel memory accounting" line from the 042stab040.1 kernel changelog stands for.
The biggest problem is, RAM for containers is not just RAM. First of all, there is a need to distinguish between
the user memory,
the kernel memory,
the page cache,
and the directory entry cache.
The user memory is more or less clear, it is simply the memory that programs allocate for themselves to run. It is relatively easy to account for, and it is relatively simple to limit it (but read on).
The kernel memory is really complex thingie. Right, it is the memory that kernel allocates for itself in order for programs in a particular container to run. This includes a lot of stuff I'd rather not dive into, if I want to keep this piece as an article not a tome. Having said that, two particular kernel memory types are worth explaining.
First is the page cache, the kernel mechanism that caches disk contents in memory (that would be unused otherwise) to minimize the I/O. When a program reads some data from a disk, that data are read into the page cache first, and when a program writes to a disk, data goes to the page cache (and then eventually are written (flushed) to disk). In case of repeated disk access (which happens quite often) data is taken from a page cache, not from the real disk, which greatly improves the overall system performance, since a disk is much slower than RAM. Now, some of the page cache is used on behalf of a container, and this amount must be charged into "RAM used by this container" (i.e. physpages).
Second is the directory entry cache (dcache for short) is yet another sort of cache, and another sort of the kernel memory. Disk contents is a tree of files and directories, and such a tree is quite tall and wide. In order to read the contents of, say, /bin/sh file, kernel have to read the root (/) directory, find 'bin' entry in it, read /bin directory, find 'sh' entry in it and finally read it. Although these operations are not very complex, there is a multitude of those, they take time and are repeated often for most of the "popular" files. In order to improve performance, kernel keeps directory entries in memory — this is what dcache is for. The memory used by dcache should also be accounted and limited, since otherwise it's easily exploitable (not only by root, but also by an ordinary user, since any user is free to change into directories and read files).
Now, the physical memory of a container is the sum of its user memory, the kernel memory, the page cache and the dcache. Technically, dcache is accounted into the kernel memory, then kernel memory is accounted into the physical memory, but it's not overly important.
Improvements in the new 042stab04x kernels
Better reclamation and memory balancing
What to do if a container hit a physical memory limit? Free some pages by writing their contents to the abovementioned virtual swap. Well, not quite yet. Remember that there is also a page cache and a dcache, so the kernel can easily discard some of the pages from these caches, which is way cheaper than swapping out.
The process of finding some free memory is known as reclamation. Kernel needs to decide very carefully when to start reclamation, how many and what exact pages to reclaim in every particular situation, and when it is the right time to swap out rather than discard some of the cache contents.
Remember, we have four types of memory (kernel, user, dcache and page cache) and only one knob which limits the sum of all these. It would be easier for the kernel, but not for the user, to have separate limits for each type of memory. But, for the user convenience and simplicity, the kernel only have one knob for these four parameters, so it needs to balance between those four. One major improvement in 042stab040 kernel is that such balancing is now performed better.
Stricter memory limit
During the lifetime of a container, the kernel might face a situation when it needs more kernel memory, or user memory, or perhaps more dcache entries, and the memory for the container is tight (i.e. close to the limit), so it needs to either reclaim or swap. The problem is there are some situations when neither reclamation nor swapping is possible, so the kernel can either fail miserably (say by killing a process) or go beyond the limit and hope that everything will be fine and mommy won't notice. Another big improvement in 042stab040 kernel is it reduces the number of such situations, in other words, the new kernel obeys memory limit in a more strict way.
Polishing
Finally, the kernel is now in a pretty good shape, so we can afford some polishing, minor optimizations, and fine tuning. Such polishing was performed in a few subsystems, including checkpointing, user beancounters, netfilter, kernel NFS server and VZ disk quota.
Some numbers
Totally, there are 53 new patches in 042stab040.1, compared to previous 039 kernels. On top of that, 042stab042.1 adds another 30. We hope that the end result is improved stability and performance.
Instead of having a nice drink in a bar, I spent this Friday night splitting the RHEL6-based OpenVZ kernel branch/repository into two, so now we have 'rhel6' and 'rhel6-testing' branches/repos. Let me explain why.
When we made an initial port of OpenVZ to RHEL6 kernel and released the first kernel (in October 2010, named 042test001), I created a repository named openvz-kernel-rhel6 (or just rhel6), and this repository was marked as "development, unstable". When, after almost a year, we announced it as "testing" and then, finally, "stable" (in August 2011, named 042stab035.1).
After that, all the kernels in that repository were supposed to be stable, because they are incremental improvements of the kernel we call stable. In theory it is. In practice, of course, there can always be new bugs (both introduced by us and by Red Hat folks releasing their kernel updates which we rebase to). Thus a kernel update from a repo which is supposed to be stable can do bad things.
Better late than never, I have fixed the situation tonight by basically renaming "rhel6" repository into "rhel6-testing", and creating a new repository called just "rhel6". For now, I put 042stab037.1 (which is the latest kernel which has passed our internal QA) into rhel6 (aka stable), while all the other kernels, up to and including 042stab039.3, are in rhel6-testing repo.
Now, very similar to what we do with RHEL5 kernels, all the new fresh-from-the-build-farm kernels will appear in rhel6-testing repo, at about the same time they go to internal QA. Then, the kernels which will have QA approval will appear in rhel6 (aka -stable) repo. What it means for you as a user is you can now choose whether to stay at the bleeding edge and have the latest stuff, or to take a conservative approach and have less frequent and delayed updates, but be more confident about kernel quality and stability.
And we are coming to Prague, too! This time, there will be as many as six people and two talks from us, plus we will held a memory cgroup controller meeting.
The following OpenVZ/Parallels people are coming:
James Bottomley, Parallels virtualization CTO
Kir Kolyshkin, OpenVZ project manager
Pavel Emelyanov, OpenVZ kernel team leader (he's also taking part in Linux Kernel Summit)
Glauber Costa, OpenVZ kernel developer
Maxim Patlasov, OpenVZ kernel developer
Andrey Vagin, OpenVZ kernel developer
Two talks will be presented. Since linuxsymposium.org site is currently down, let me quote talk descriptions here.
1. Container in a file by Maxim Patlasov.
One of the feature differences between hypervisors and containers is the ability to store a virtual machine image in a single file, since most containers exist as a chroot within the host OS rather than as fully independent entities. However, the ability to save and restore state in a machine image file is invaluable in managing virtual machine life cycles in the data centre.
This talk will début a new loopback device which gives all the advantages of virtual machine images by storing the container in a file
while preserving the benefits of sharing significant portions with the host OS. We will compare and contrast the technology with the
traditional loopback device, and describe some changes to the ext4 filesystem which make it more friendly to new loopback device needs.
This talk will be technical in nature but should be accessible to people interested in cloud, virtualisation and container technologies.
2. OpenVZ and Linux kernel testing by Andrey Vagin.
One of the less appealing but very important part of software development is testing. This talk tries to summarize our 10+ years of experience in Linux kernel testing (including OpenVZ and Red Hat Enterprise Linux kernels). Overall description of our test system is provided, followed by details on some of the interesting test cases developed. Finally, a few anecdotal cases of bugs found will be presented.
In a sense, the talk is an answer to Andrew Morton's question from 2007: "I'm curious. For the past few months, people@openvz.org have discovered (and fixed) an ongoing stream of obscure but serious and quite long-standing bugs. How are you discovering these bugs?"
Talk is of interest to those concerned about kernel quality, and in general to people doing development and testing.
Finally, there will be a memcg meeting. Since LinuxCon will be right after the Kernel Summit, a number of kernel guys will still be there so anyone interested in cgroups can come. This meeting is a continuation of our recent discussion at Linux Plumbers (see etherpad and presentations).
Guys, I am very proud to inform you that today we mark RHEL6 kernel branch as stable. Below is a copy-paste from the relevant announce@ post. I personally highly recommend RHEL6-based OpenVZ kernel to everyone -- it is a major step forward compared to RHEL5.
In the other news, Parallels has just released Virtuozzo Containers for Linux 4.7, bringing the same cool stuff (VSwap et al) to commercial customers. Despite being only a "dot" (or "minor") release, this product incorporates an impressive amount of man-hours of best Parallels engineers.
== Stable: RHEL6 ==
This is to announce that RHEL6-based kernel branch (starting from kernel 042stab035.1) is now marked as stable, and it is now the recommended branch to use.
We are not aware of any major bugs or show-stoppers in this kernel. As always, we still recommend to test any new kernels before rolling out to production.
New features of RHEL6-based kernel branch (as compared to previous stable kernel branch, RHEL5) includes better performance, better scalability (especially on high-end SMP systems), and better resource management (notably, vSwap support -- see http://wiki.openvz.org/VSwap).
Also, from now we no longer maintain the following kernel branches:
* 2.6.27 * 2.6.32
No more new releases of the above kernels are expected. Existing users (if any) are recommended to switch to other (maintained) branches, such as RHEL6-2.6.32 or RHEL5-2.6.18.
This change does not affect vendor OpenVZ kernels (such as Debian or Ubuntu) -- those will still be supported for the lifetime of their distributions via the usual means (i.e. bugzilla.openvz.org).
== Development: none ==
Currently, there are no non-stable kernels in development. Eventually we will port to Linux kernel 3.x, but it might not happen this year. Instead, we are currently focused on bringing more of OpenVZ features to mainstream Linux kernels.
We have checkpoint/restart (CPT) and live migration in OpenVZ for ages (well, OK, since 2007 or so), allowing for containers to be freely moved between physical servers without any service interruption. It is a great feature which is valued by our users. The problem is we can't merge it upstream, ie to vanilla kernel.
Various people from our team worked on that, and they all gave up. Then, Oren Laadan was trying very hard to merge his CPT implementation -- unfortunately it didn't worked out very well either. The thing is, checkpointing is a complex thing, and the patch implementing it is very intrusive.
Recently, our kernel team leader Pavel Emelyanov got a new idea of moving most of the checkpointing complexity out of the kernel and into user space, thus minimizing the amount of the in-kernel changes needed. In about two weeks of time he wrote a working prototype. So far the reaction is mostly positive, and he's going to submit a second RFC version for review to lkml.
For more details, read the lwn.net article. After all, while I am sitting next to Pavel, Mr. Corbet ability to explain complex stuff in simple terms is way better than mine.
I tried it and was able to migrate a CentOS 7 container... but the Fedora 22 one seems to be stuck in the "started" phase. It creates a /vz/private/{ctid} dir on the destination host (with the same…
The fall semester is just around the corner... so it is impossible for me to break away for a trip to Seattle. I hope one or more of you guys can blog so I can attend vicariously.
Comments
Do you still stand by your opinions above now in 2016?…