We have checkpoint/restart (CPT) and live migration in OpenVZ for ages (well, OK, since 2007 or so), allowing for containers to be freely moved between physical servers without any service interruption. It is a great feature which is valued by our users. The problem is we can't merge it upstream, ie to vanilla kernel.
Various people from our team worked on that, and they all gave up. Then, Oren Laadan was trying very hard to merge his CPT implementation -- unfortunately it didn't worked out very well either. The thing is, checkpointing is a complex thing, and the patch implementing it is very intrusive.
Recently, our kernel team leader Pavel Emelyanov got a new idea of moving most of the checkpointing complexity out of the kernel and into user space, thus minimizing the amount of the in-kernel changes needed. In about two weeks of time he wrote a working prototype. So far the reaction is mostly positive, and he's going to submit a second RFC version for review to lkml.
For more details, read the lwn.net article. After all, while I am sitting next to Pavel, Mr. Corbet ability to explain complex stuff in simple terms is way better than mine.
Make sure you use latest vzctl (3.0.26.2), otherwise vzctl enter won't work with Ubuntu 11.04 containers. As usual, report all bugs to http://bugzilla.openvz.org/
This is to announce that from now on devel@ is a separate list, not mirroring containers@ or anything. From now on, if the topic is openvz-specific, like a patch to OpenVZ, please use devel@. If the topic is about containers (as appearing in mainline), use containers@.
Let me explain. Initially, when we started moving OpenVZ project forward, we wanted to discuss all the things about containers on a mailing list, and therefore I created devel@. Later, then other parties joined, it was decided to create containers at osdl.org mailing list (remember OSDL later became the Linux Foundation). At that time I was worried that the discussions will split, and decided to just subscribe our devel@ to containers@, so devel@ becomes a super-set of containers@ (i.e. every message posted to containers@ will appear on devel@, but not vice versa).
Of course it ended up being a big mess. Better late than never, mess is no more!
Some software bugs, while being simple and stupid, have an interesting and long lasting life. Here is the story of such a very simple bug with a lifespan of about 5 years (or more? I don't know when it was introduced). The bug doesn't worth looking at otherwise, so I'll try to be short, and more info is available from the links.
OK,
If you are a seasoned C programmer, skip this post entirely (or try to find bugs in it). If you know C but don't consider yourself an expert, please read on -- it might be helpful.
I was working a bit on vzctl today (my target was bug #1757, which is still a work in progress) and ... I am not sure how, but I ended up declaring most functions in src/vzlist.c as static. I thought it doesn't have any practical value -- I was wrong!
In C, if you declare the function as static, it means its visibility is limited to the translation unit (i.e. a file) in which it is defined. In other words, you can only call/use a static function from another function in the same file.
Now, in vzctl sources vzlist.c is only linked to one binary -- vzlist, and therefore I thought it doesn't make much sense to declare functions as static. Nevertheless I did it (see git commit).
Next thing I got is a set of compiler warnings! OK, all right, let's take a look...
First set of warnings is self-explanatory. See: vzlist.c:825:14: warning: ‘parse_var’ defined but not used vzlist.c:1075:14: warning: ‘remove_sp’ defined but not used vzlist.c:1357:12: warning: ‘get_stop_quota_stats’ defined but not used
Easy! In some ancient time, these functions were used, now the code has changed and no one needs these three, but they were not removed for some reason (probably just forgotten). Solution: remove the dead code (see git commit).
Second set of warnings looks similar: vzlist.c:400:1: warning: ‘dcachesize_m_sort_fn’ defined but not used vzlist.c:400:1: warning: ‘dcachesize_l_sort_fn’ defined but not used vzlist.c:400:1: warning: ‘dcachesize_b_sort_fn’ defined but not used vzlist.c:400:1: warning: ‘dcachesize_f_sort_fn’ defined but not used vzlist.c:411:1: warning: ‘diskinodes_s_sort_fn’ defined but not used vzlist.c:411:1: warning: ‘diskinodes_h_sort_fn’ defined but not used
Hmm... all these *_sort_fn are sort functions generated by means of a few #define statements, and they are used when vzlist needs to sort its output by some column or parameter (vzlist -s). It is very strange that these are not used, because they should be. Let's take a closer look... zOMG! it's a bug!
Apparently, someone was using copy-paste technique* and forgot to change the names of the functions. The bug is, when you ask vzlist to sort its output to, say, dcachesize failcounter values, it sorts it by dcachesize held values instead, because of the wrong sort function used. Such bugs are hard to notice manually, and there are no autotests for vzlist.
You probably thought we have abandoned 2.6.27 kernel branch. Well, we ourselves thought we did (although it was not yet officially announced). Then, out of a sudden, kernel 2.6.27-repin.1 is released, rebasing to latest upstream kernel (2.6.27.57), and fixing OpenVZ bug #1593.
The thing is, this kernel is called after Ilya Repin, a leading Russian painter and sculptor of the Peredvizhniki artistic school. One of his best paintings is called "Unexpected Return", and I happen to enjoy the original in Tretyakov Gallery here in Moscow a couple of weeks ago. So here it is: the unexpected return of 2.6.27 kernel. It took Ilya 4 years to finish the painting, it took Pavel 6 months to release the fix. Better late than never, that is.
I have added vswap confguration samples to vzctl git. Basically, you set physpages and swappages and leave every other beancounter at unlimited. For example, this is how ve-vswap-256m-conf.sample looks like:
As you can see, physpages (ie RAM size) is set to 256 megabytes, while swappages (ie swap size) is set to 512 megabytes, all the other beancounters are unlimited. Wow, it's never been easier to configure your containers!
Now, we can utilize this stuff using RHEL6 based kernel. This is what we see from inside the container:
[root@localhost ~]# vzctl enter 103
entered into CT 103
[root@localhost /]# free
total used free shared buffers cached
Mem: 262144 23936 238208 0 0 10968
-/+ buffers/cache: 12968 249176
Swap: 524288 0 524288
Hard CPU limit (ability to specify that you don't want this container to use more than X per cent of CPU no matter what) is back in latest RHEL6-based kernel, 042test006.1, which has just been released.
The feature was only available for the stable (i.e RHEL4 and RHEL5-based) kernels, and was missing from all of our development kernels from 2.6.20 to 2.6.32. So while it was always there in stable branches, the feeling is like it's back.
In order to use CPU limit feature, set the limit using vzctl set $CTID --cpulimit X, where X is in per cent of one single CPU. For example, if you have single 2 GHz CPU and want container 123 to use no more than 1 GHz, use vzctl set 123 --cpulimit 50. If you have 2 GHz quad-core system and want to use no more than 4 GHz, use vzctl set 123 --cpulimit 200. Well, in the second case it might be better to just use --cpus 2. Anyways, see vzctl man page.
We have just released a new RHEL6-based kernel, 042test005. It is shaping up pretty good — as you can see from the changelog, it's not just bug fixes but also performance improvements. If you haven't tried it yet, I suggest to do it today! Do not postpone this until 2011 — after all, this is what will become the next stable OpenVZ kernel.
RHEL6 kernel needs an appropriate (i.e. recent) Linux distribution. If you don't want latest Fedora releases, can't afford RHEL6, and tired of waiting for CentOS 6, I suggest you go with Scientific Linux 6 (SL6). This is yet another RHEL6 clone developed and used by CERN, Fermilabs and other similar institutions.
I tried it and was able to migrate a CentOS 7 container... but the Fedora 22 one seems to be stuck in the "started" phase. It creates a /vz/private/{ctid} dir on the destination host (with the same…
The fall semester is just around the corner... so it is impossible for me to break away for a trip to Seattle. I hope one or more of you guys can blog so I can attend vicariously.
Comments
Do you still stand by your opinions above now in 2016?…