Andrew Morton was giving a keynote on a recent LinuxWorld Expo in San Francisco. A fair portion of his talk was devoted to the need of testing new kernels, but also about what will appear in the kernel soon. A couple of slides were specifically about containers, including OpenVZ.
A nice recap of what he said is at zdnet.com, here's the quote: Additionally, and contrary to popular thinking, the debate over whether open source virtualization engines will fragment the industry is null and void since the kernel supports and will support all open source solutions – be it Xen, KVM, OpenVZ or VMware, Morton said.
A few days ago one of OpenVZ kernel team members, Pavel Emelyanov, posted a one-line patch to fix a bug in Linux kernel. He received the following reply from Andrew Morton, one of the upstream kernel maintainers:
I'm curious. For the past few months, people@openvz.org have discovered (and fixed) an ongoing stream of obscure but serious and quite long-standing bugs.
How are you discovering these bugs?
Andrew added later:
hm, OK, I was visualising some mysterious Russian bugfinding machine or something.
Don't stop ;)
So, here is the story behind that bug.
A few months ago, in the course of OpenVZ kernel testing, our QA (Quality Assurance) team found a strange issue. The thing is, every container (VE) in OpenVZ has a set of resource usage counters (and limits) called beancounters. All the usage counters should be zero when a VE is stopped, since naturally then all the resources are released. The issue was that a resource called kmemsize (a kernel memory used on behalf of given VE) had a usage counter of 78 bytes after the VE was stopped -- which effectively means 78 bytes of kernel memory were lost (or leaked, as programmers say).
Who cares about 78 bytes, especially on a server with 16 gigabytes (17,179,869,184 bytes) of RAM? We do. Pavel checked the beancounters debug information which showed that one struct user object has leaked. He then tried to reproduce that but with no luck.
Bugs that can not be reproduced are tough. The only option left was to audit the kernel source code. That involved finding all the places where struct user object is referenced, and checking the code correctness (the term "correctness" in this context means that every object that is allocated must later be released). It took him 4 hours to do the audit, and he found one place where the reference to an object might be lost (which means it could not later be released). It's the same as if you lend a book to your friend and later forgot whom you gave it to -- you lost the reference and you can't get the book back.
In this case, after the problem was found, fixing it was pretty straightforward. So Pavel wrote a fix and a demo code to trigger the bug, tested the fix and sent it to Linux kernel mailing list.
Why is this particular incident so important? * It's OpenVZ code (beancounters) which helped to detect the leak in the first place -- as the bug is very hard to trigger (unless you know how) and the leak is small enough that it might not be discovered at all. * It demonstrates OpenVZ developers dedicated attitude. They never dismiss real bugs as "works for me" or "invalid", and work to find the root cause and fix the problem. * This bug is in fact a security issue. An ordinary user (actually two users are needed in this case) could exploit the bug and eat all the kernel memory, thus bringing the whole system down. Worse scenarious are possible as well. * Incidentally, OpenVZ is protected from this security issue -- because kmemsize beancounter (which helped to found it) limits kernel memory usage per Virtual Environment.
Most important of all, this is just one out of 305 kernel patches by our team which were accepted into the mainstream Linux kernel during a one-year period. Almost one patch a day, excluding weekends and holidays. And we are not going to stop! :-)
Recently, I had the opportunity to present at a session of the Gelato Itanium Conference and Expo in San Jose. It was a good fit because they had a special track on virtualization, and OpenVZ (and the Virtuozzo product) is the only stable virtualization technology available now for Itanium servers.
Once again, I was able to talk with Andrew Morton (a kernel hacker, the right hand of Linus Torvalds) and was encouraged about the prospect of OS virtualization and OpenVZ in the Linux kernel. That is something we would really like to see and have been working towards. This article summarizes Andrew’s remarks noting “OpenVZ already has thousands of systems out there” and “as far as containerization standard in mainline goes, ‘most of the stakeholders are playing together quite nicely’”.
Yes, we are and we’ll keep at it so we can realize our goal.
I am happy to report that OpenVZ project has made an initial port to Linux kernel 2.6.20. The resulting kernel is now available from the GIT repository.
This is a work in progress -- so far we have checked the kernel compiles and passes some tests on x86; other architectures were not even tried yet. So it is definitely not for the faint of heart.
Note that OpenVZ versioning scheme was changed (or, rather, simplified) with this branch -- the branch number and the test/stab word were dropped, and not it is just 'ovzNNN' in RPM release and kernel "extraversion" fields (where NNN is three-digit number, starting from 001). So, what was meant to be 030test001 has become ovz001. Hope this will lead to less confusion.
Binary and source RPMS will be released some time next week.
Good news for all of us on the virtualization front!
The latest prepatch for the stable Linux kernel tree, 2.6.19-rc1, now includes some pieces of OS-level virtualization from OpenVZ, IBM, and Eric Biederman. Those patches have been sitting in -mm (Andrew Morton’s) tree for some time already, and now, during the “2.6.19 merge window,” Andrew has submitted them to Linus Torvalds. So it’s now a part of “vanilla” Linux, and will be finally released as a part of the 2.6.19 kernel when it is released.
I am really happy it is a community work and a community process (like I said before). We see different parties bringing in code and expertize, reviewing each other's code, making suggestions, exchanging ideas and improving things — to everybody's benefit!
These are just the first steps. Much more is needed to have full OS-level virtualization in the mainstream Linux kernel. Don’t worry — we are already working on that. A few days ago Kirill sent another iteration (v5) of beancounter patchset for further review and possible inclusion. Beancounters can be used to implement per-VE limits and guarantees for certain resources such as memory.
A bit of shameless PR for our kernel team leader, Kirill Korotaev:
From RHSA-2006-0617: Important: kernel security update [...] * a flaw in the restore_all code path of the 4/4GB split support of non-hugemem kernels that allowed a local user to cause a denial of service (panic) (CVE-2006-2932, Important) [...] Red Hat would like to thank Wei Wang of McAfee Avert Labs and Kirill Korotaev for reporting issues fixed in this erratum.
What it could mean, besides the fact that OpenVZ team is a valueable contributor to the mainstream kernel? It also means we do care much for stability and security of OpenVZ, we do a lot of testing and QA, which is good for OpenVZ kernel, but for the mainstream kernel as well.
In a broader sense, this is a nice example of how collaborative open source development works. A nice example of “everybody wins” strategy. Indeed, in open source everybody wins.
Update: our kernel team found a bug in this blog post. Looks like the bug belongs to the infamous "off-by-one" category. :) There are actually 5 patches from OpenVZ developers, not 6 — it's just Greg sent one patchtwice and thus I counted it twice. Fixed.
I tried it and was able to migrate a CentOS 7 container... but the Fedora 22 one seems to be stuck in the "started" phase. It creates a /vz/private/{ctid} dir on the destination host (with the same…
The fall semester is just around the corner... so it is impossible for me to break away for a trip to Seattle. I hope one or more of you guys can blog so I can attend vicariously.
Comments
Do you still stand by your opinions above now in 2016?…