This is a topic I always wanted to write about but was afraid my explanation would end up very cumbersome. This is no longer the case as we now have a picture that worth a thousand words!
The picture describes how we develop kernel releases. It's bit more complicated than the linearity of version 1 -> version 2 -> version 3. The reason behind it is we are balancing between adding new features, fixing bugs, and rebasing to newer kernels, while trying to maintain stability for our users. This is our convoluted way of achieving all this:
As you can see, we create a new branch when rebasing to a newer upstream (i.e. RHEL6) kernel, as regressions are quite common during a rebase. At the same time, we keep maintaining the older branch in which we add stability and security fixes. Sometimes we create a new branch to add some bold feature that takes a longer time to stabilize. Stability patches are then forward-ported to the new branch, which is either eventually becoming stable or obsoleted by yet another new one.
Of course there is a lot of work behind these curtains, including rigorous internal testing of new releases. In addition to that, we usually provide those kernels to our users (in rhel6-testing repo) so they could test new stuff before it hits production servers, and we can fix more bugs earlier (more on that here). If you are not taking part in this testing, well, it's never late to start!
Oh, such a provocative subject! Not really. Many people do believe that OpenVZ is obsoleted, and when I ask why, three most popular answers are:
1. OpenVZ kernel is old and obsoleted, because it is based on 2.6.32, while everyone in 2013 runs 3.x. 2. LXC is the future, OpenVZ is the past. 3. OpenVZ is no longer developed, it was even removed from Debian Wheezy.
Let me try to address all these misconceptions, one by one.
1. "OpenVZ kernel is old". Current OpenVZ kernels are based on kernels from Red Hat Enterprise Linux 6 (RHEL6 for short). This is the latest and greatest version of enterprise Linux distribution from Red Hat, a company who is always almost at the top of the list of top companies contributing to the Linux kernel development (see 1, 2, 3, 4 for a few random examples). While no kernel being ideal and bug free, RHEL6 one is a good real world approximation of these qualities.
What people in Red Hat do for their enterprise Linux is they take an upstream kernel and basically fork it, ironing out the bugs, cherry-picking security fixes, driver updates, and sometimes new features from upstream. They do so for about half a year or more before a release, so the released kernel is already "old and obsoleted", as it seems if one is looking at the kernel version number. Well, don't judge a book by its cover, don't judge a kernel by its number. Of course it's not old, neither obsoleted -- it's just more stable and secure. And then, after a release, it is very well maintained, with modern hardware support, regular releases, and prompt security fixes. This makes it a great base for OpenVZ kernel. In a sense, we are standing on the shoulders of a red hatted giant (and since this is open source, they are standing just a little bit on our shoulders, too).
RHEL7 is being worked on right now, and it will be based on some 3.x kernel (possibly 3.10). We will port OpenVZ kernel to RHEL7 once it will become available. In the meantime, RHEL6-based OpenVZ kernel is latest and greatest, and please don't be fooled by the fact that uname shows 2.6.32.
2. OpenVZ vs LXC. OpenVZ kernel was historically developed separately, i.e. aside from the upstream Linux kernel. This mistake was recognized in 2005, and since then we keep working on merging OpenVZ bits and pieces to the upstream kernel. It took way longer than expected, we are still in the middle of the process with some great stuff (like net namespace and CRIU, totally more than 2000 changesets) merged, while some other features are still in our TODO list. In the future (another eight years? who knows...) OpenVZ kernel functionality will probably be fully upstream, so it will just be a set of tools. We are happy to see that Parallels is not the only company interested in containers for Linux, so it might happen a bit earlier. For now, though, we still rely on our organic non-GMO home grown kernel (although it is already optional).
Now what is LXC? In fact, it is just another user-space tool (not unlike vzctl) that works on top of a recent upstream kernel (again, not unlike vzctl). As we work on merging our stuff upstream, LXC tools will start using new features and therefore benefit from this work. So far at least half of kernel functionality used by LXC was developed by our engineers, and while we don't work on LXC tools, it would not be an overestimation to say that Parallels is the biggest LXC contributor.
So, both OpenVZ and LXC are actively developed and have their future. We might even merge our tools at some point, the idea was briefly discussed during last containers mini-conf at Linux Plumbers. LXC is not a successor to OpenVZ, though, they are two different projects, although not entirely separate (since OpenVZ team contributes to the kernel a lot, and both tools use the same kernel functionality). OpenVZ is essentially LXC++, because it adds some more stuff that are not (yet) available in the upstream kernel (such as stronger isolation, better resource accounting, plus some auxiliary ones like ploop).
3. OpenVZ no longer developed, removed from Debian. Debian kernel team decided to drop OpenVZ (as well as few other) kernel flavors from Debian 7 a.k.a. Wheezy. This is completely understandable: kernel maintenance takes time and other resources, and they probably don't have enough. That doesn't mean though that OpenVZ is not developed. It's really strange to argue that, but please check our software updates page (or the announce@ mailing list archives). We made about 80 software releases this year so far. This accounts for 2 releases every week. Most of those are new kernels. So no, in no way it is abandoned.
The best feature of the new (RHEL6-based) 042 series of the OpenVZ kernels is definitely vSwap. The short story is, we used to have 22 user beancounter parameters which every seasoned OpenVZ user knows by heart. Each of these parameters is there for a reason, but 22 knobs are a bit too complex to manage for a mere mortal, especially bearing in mind that
many of them are interdependent;
the sum of all limits should not exceed the resources of a given physical server.
Keeping this configuration optimal (or even consistent) is quite a challenging task even for a senior OpenVZ admin (with a probable exception of an ex airline pilot). This complexity is the main reason why there are multiple articles and blog entries complaining OpenVZ is worse than Xen, or that it is not suitable for hosting Java apps. We do have some workarounds to mitigate this complexity, such as:
This is still not the way to go. While we think high of our users, we do not expect all of them to be ex airline pilots. To solve the complexity, the number of per-container knobs and handles should be reduced to some decent number, or at least most of these knobs should be optional.
We worked on that for a few years, and the end result is called vSwap (where V is for Vendetta, oh, pardon me, Virtual).
vSwap concept is as simple as a rectangle. For each container, there are only two required parameters: the memory size (known as physpages) and the swap size (swappages). Almost everyone (not only an admin, but even an advanced end user) knows what is RAM and what is swap. On a physical server, if there is not enough memory, the system starts to swap out memory pages to disk, then swap in some other pages, which results in severe performance degradation but it keeps the system from failing miserably.
It's about the same with vSwap, except that
RAM and swap are configured on a per container basis;
no I/O is performed until it is really necessary (this is why swap is virtual).
Some VSwap internals
Now, there are only two knobs per container on a dashboard, namely RAM and swap, and all the complexity is hidden under the hood. I am going to describe just a bit of that undercover mechanics and explain what does the "Reworked VSwap kernel memory accounting" line from the 042stab040.1 kernel changelog stands for.
The biggest problem is, RAM for containers is not just RAM. First of all, there is a need to distinguish between
the user memory,
the kernel memory,
the page cache,
and the directory entry cache.
The user memory is more or less clear, it is simply the memory that programs allocate for themselves to run. It is relatively easy to account for, and it is relatively simple to limit it (but read on).
The kernel memory is really complex thingie. Right, it is the memory that kernel allocates for itself in order for programs in a particular container to run. This includes a lot of stuff I'd rather not dive into, if I want to keep this piece as an article not a tome. Having said that, two particular kernel memory types are worth explaining.
First is the page cache, the kernel mechanism that caches disk contents in memory (that would be unused otherwise) to minimize the I/O. When a program reads some data from a disk, that data are read into the page cache first, and when a program writes to a disk, data goes to the page cache (and then eventually are written (flushed) to disk). In case of repeated disk access (which happens quite often) data is taken from a page cache, not from the real disk, which greatly improves the overall system performance, since a disk is much slower than RAM. Now, some of the page cache is used on behalf of a container, and this amount must be charged into "RAM used by this container" (i.e. physpages).
Second is the directory entry cache (dcache for short) is yet another sort of cache, and another sort of the kernel memory. Disk contents is a tree of files and directories, and such a tree is quite tall and wide. In order to read the contents of, say, /bin/sh file, kernel have to read the root (/) directory, find 'bin' entry in it, read /bin directory, find 'sh' entry in it and finally read it. Although these operations are not very complex, there is a multitude of those, they take time and are repeated often for most of the "popular" files. In order to improve performance, kernel keeps directory entries in memory — this is what dcache is for. The memory used by dcache should also be accounted and limited, since otherwise it's easily exploitable (not only by root, but also by an ordinary user, since any user is free to change into directories and read files).
Now, the physical memory of a container is the sum of its user memory, the kernel memory, the page cache and the dcache. Technically, dcache is accounted into the kernel memory, then kernel memory is accounted into the physical memory, but it's not overly important.
Improvements in the new 042stab04x kernels
Better reclamation and memory balancing
What to do if a container hit a physical memory limit? Free some pages by writing their contents to the abovementioned virtual swap. Well, not quite yet. Remember that there is also a page cache and a dcache, so the kernel can easily discard some of the pages from these caches, which is way cheaper than swapping out.
The process of finding some free memory is known as reclamation. Kernel needs to decide very carefully when to start reclamation, how many and what exact pages to reclaim in every particular situation, and when it is the right time to swap out rather than discard some of the cache contents.
Remember, we have four types of memory (kernel, user, dcache and page cache) and only one knob which limits the sum of all these. It would be easier for the kernel, but not for the user, to have separate limits for each type of memory. But, for the user convenience and simplicity, the kernel only have one knob for these four parameters, so it needs to balance between those four. One major improvement in 042stab040 kernel is that such balancing is now performed better.
Stricter memory limit
During the lifetime of a container, the kernel might face a situation when it needs more kernel memory, or user memory, or perhaps more dcache entries, and the memory for the container is tight (i.e. close to the limit), so it needs to either reclaim or swap. The problem is there are some situations when neither reclamation nor swapping is possible, so the kernel can either fail miserably (say by killing a process) or go beyond the limit and hope that everything will be fine and mommy won't notice. Another big improvement in 042stab040 kernel is it reduces the number of such situations, in other words, the new kernel obeys memory limit in a more strict way.
Polishing
Finally, the kernel is now in a pretty good shape, so we can afford some polishing, minor optimizations, and fine tuning. Such polishing was performed in a few subsystems, including checkpointing, user beancounters, netfilter, kernel NFS server and VZ disk quota.
Some numbers
Totally, there are 53 new patches in 042stab040.1, compared to previous 039 kernels. On top of that, 042stab042.1 adds another 30. We hope that the end result is improved stability and performance.
Instead of having a nice drink in a bar, I spent this Friday night splitting the RHEL6-based OpenVZ kernel branch/repository into two, so now we have 'rhel6' and 'rhel6-testing' branches/repos. Let me explain why.
When we made an initial port of OpenVZ to RHEL6 kernel and released the first kernel (in October 2010, named 042test001), I created a repository named openvz-kernel-rhel6 (or just rhel6), and this repository was marked as "development, unstable". When, after almost a year, we announced it as "testing" and then, finally, "stable" (in August 2011, named 042stab035.1).
After that, all the kernels in that repository were supposed to be stable, because they are incremental improvements of the kernel we call stable. In theory it is. In practice, of course, there can always be new bugs (both introduced by us and by Red Hat folks releasing their kernel updates which we rebase to). Thus a kernel update from a repo which is supposed to be stable can do bad things.
Better late than never, I have fixed the situation tonight by basically renaming "rhel6" repository into "rhel6-testing", and creating a new repository called just "rhel6". For now, I put 042stab037.1 (which is the latest kernel which has passed our internal QA) into rhel6 (aka stable), while all the other kernels, up to and including 042stab039.3, are in rhel6-testing repo.
Now, very similar to what we do with RHEL5 kernels, all the new fresh-from-the-build-farm kernels will appear in rhel6-testing repo, at about the same time they go to internal QA. Then, the kernels which will have QA approval will appear in rhel6 (aka -stable) repo. What it means for you as a user is you can now choose whether to stay at the bleeding edge and have the latest stuff, or to take a conservative approach and have less frequent and delayed updates, but be more confident about kernel quality and stability.
Guys, I am very proud to inform you that today we mark RHEL6 kernel branch as stable. Below is a copy-paste from the relevant announce@ post. I personally highly recommend RHEL6-based OpenVZ kernel to everyone -- it is a major step forward compared to RHEL5.
In the other news, Parallels has just released Virtuozzo Containers for Linux 4.7, bringing the same cool stuff (VSwap et al) to commercial customers. Despite being only a "dot" (or "minor") release, this product incorporates an impressive amount of man-hours of best Parallels engineers.
== Stable: RHEL6 ==
This is to announce that RHEL6-based kernel branch (starting from kernel 042stab035.1) is now marked as stable, and it is now the recommended branch to use.
We are not aware of any major bugs or show-stoppers in this kernel. As always, we still recommend to test any new kernels before rolling out to production.
New features of RHEL6-based kernel branch (as compared to previous stable kernel branch, RHEL5) includes better performance, better scalability (especially on high-end SMP systems), and better resource management (notably, vSwap support -- see http://wiki.openvz.org/VSwap).
Also, from now we no longer maintain the following kernel branches:
* 2.6.27 * 2.6.32
No more new releases of the above kernels are expected. Existing users (if any) are recommended to switch to other (maintained) branches, such as RHEL6-2.6.32 or RHEL5-2.6.18.
This change does not affect vendor OpenVZ kernels (such as Debian or Ubuntu) -- those will still be supported for the lifetime of their distributions via the usual means (i.e. bugzilla.openvz.org).
== Development: none ==
Currently, there are no non-stable kernels in development. Eventually we will port to Linux kernel 3.x, but it might not happen this year. Instead, we are currently focused on bringing more of OpenVZ features to mainstream Linux kernels.
We have just released a new RHEL6-based kernel, 042test005. It is shaping up pretty good — as you can see from the changelog, it's not just bug fixes but also performance improvements. If you haven't tried it yet, I suggest to do it today! Do not postpone this until 2011 — after all, this is what will become the next stable OpenVZ kernel.
RHEL6 kernel needs an appropriate (i.e. recent) Linux distribution. If you don't want latest Fedora releases, can't afford RHEL6, and tired of waiting for CentOS 6, I suggest you go with Scientific Linux 6 (SL6). This is yet another RHEL6 clone developed and used by CERN, Fermilabs and other similar institutions.
I tried it and was able to migrate a CentOS 7 container... but the Fedora 22 one seems to be stuck in the "started" phase. It creates a /vz/private/{ctid} dir on the destination host (with the same…
The fall semester is just around the corner... so it is impossible for me to break away for a trip to Seattle. I hope one or more of you guys can blog so I can attend vicariously.
Comments
Do you still stand by your opinions above now in 2016?…