|James Bottomley, CTO, Server Virtualization for Parallels gave a presentation entitled, "Containers and The Cloud: Do you need another virtual environment?" on Oct 23. The Linux Foundation recently posted it to YouTube.|
There is a lot of good information in the video even for us OpenVZ folks. Enjoy:
|vzctl 4.6 build hit download.openvz.org (and its many mirrors around the world) last week. Let's see what's in store, shall we?|
First and foremost, I/O limits, but the feature is already described in great details. What I want to add is the feature was sponsored by GleSYS Internet Services AB, and is one of the first results of OpenVZ partnership program in action. This program is a great opportunity for you to help keep the project up and running, and also experience our expert support service just in case you'd need it. Think of it as a two-way support option. Anyway, I digress. What was it about? Oh yes, vzctl.
Second, improvements to UBC setting in VSwap mode. Previously, if you set RAM and swap, all other UBC parameters not set explicitly were set to unlimited. Now they are just left unset (meaning that the default in-kernel setting is used, whatever it is). Plus, in addition to physpages and swappages, vzctl sets lockedpages and oomguarpages (to RAM), vmguarpages (to RAM+swap).
Plus, there is a new parameter vm_overcommit, and it works in the following way -- if set, it is used as a multiplier to ram+swap to set privvmpages. In layman terms, this is a ratio of real memory (ram+swap) to virtual memory (privvmpages). Again, physpages limits RAM, and physpages+swappages limits real memory used by a container. On the other side, privvmpages limits memory allocated by a container. While it depends on the application, generally not all allocated memory is used -- sometimes allocated memory is 5 or 10 times more than used memory. What vm_overcommit gives is a way to set this gap. For example, command
is telling OpenVZ to limit container $CTID with 2 GB of RAM, 4 GB of swap, and (2+4)*3, i.e. 18 GB of virtual memory. That means this container can allocate up to 18 GB of memory, but can't use more than 6 GB. So, vm_overcommit is just a way to set privvmpages implicitly, as a function of physpages and swappages. Oh, and if you are lost in all those *pages, we have extensive documentation at http://openvz.org/UBC.
A first version of vzoversell utility is added. This is a proposed vzmemcheck replacement for VSwap mode. Currently it just summarizes RAM and swap limits for all VSwap containers, and compares it to RAM and swap available on the host. Surely you can oversell RAM (as long as you have enough swap), but sum of all RAM+swap limits should not exceed RAM+swap on the node, and the main purpose of this utility is to check that constraint.
vztmpl-dl got a new --list-orphans option. It lists all local templates that are not available from the download server(s) (and therefore can't be updated by vztmpl-dl). Oh, by the way, since vzctl 4.5 you can use
vzubc got some love, too. It now skips unlimited UBCs by default, in order to improve the signal to noise ratio. If you want old behavior, (i.e. all UBCs), use -v flag.
Surely, there's a bunch of other fixes and improvements, please read the changelog if you want to know it all. One thing in particular worth mentioning, it's a hack to vzctl console. As you might know, in OpenVZ container's console is sort of eternal, meaning you can attach to it before a container is even started, and it keeps its state even if you detach from it. That creates a minor problem, though -- if someone run, say, vim in console, then detaches and reattaches, vim is not redrawing anything and the console shows nothing. To workaround, one needs to press Ctrl-L (it is also recognized by bash and other software). But it's a bit inconvenient to do that every time after reattach. Although, this is not required if terminal size has changed (i.e. you detach from console, change your xterm size, then run vzctl console again), because in this case vim is noting the change and redraws accordingly. So what vzctl now does after reattach is telling the underlying terminal its size twice -- first the wrong size (with incremented number of rows), then the right size (the one of the terminal vzctl is running in). This forces vim (or whatever is running on container console) to redraw.
Finally, new vzctl (as well as other utilities) are now in our Debian wheezy repo at http://download.openvz.org/debian,
Enjoy, and don't forget to report bugs to http://bugzilla.openvz.org/
Today we are releasing a somewhat small but very important OpenVZ feature: per-container disk I/O bandwidth and IOPS limiting.
OpenVZ have I/O priority feature for a while, which lets one set a per-container I/O priority -- a number from 0 to 7. This is working in a way that if two similar containers with similar I/O patterns, but different I/O priorities are run on the same system, a container with a prio of 0 (lowest) will have I/O speed of about 2-3 times less than that of a container with a prio of 7 (highest). This works for some scenarios, but not all.
So, I/O bandwidth limiting was introduced in Parallels Cloud Server, and as of today is available in OpenVZ as well. Using the feature is very easy: you set a limit for a container (in megabytes per second), and watch it obeying the limit. For example, here I try doing I/O without any limit set first:
root@host# vzctl enter 777 root@CT:/# cat /dev/urandom | pv -c - >/bigfile 88MB 0:00:10 [8.26MB/s] [ <=> ] ^C
Now let's set the I/O limit to 3 MB/s:
root@host# vzctl set 777 --iolimit 3M --save UB limits were set successfully Setting iolimit: 3145728 bytes/sec CT configuration saved to /etc/vz/conf/777.conf root@host# vzctl enter 777 root@CT:/# cat /dev/urandom | pv -c - >/bigfile3 39.1MB 0:00:10 [ 3MB/s] [ <=> ] ^C
If you run it yourself, you'll notice a spike of speed at the beginning, and then it goes down to the limit. This is so-called burstable limit working, it allows a container to over-use its limit (up to 3x) for a short time.
In the above example we tested writes. Reads work the same way, except when read data are in fact coming from the page cache (such as when you are reading the file which you just wrote). In this case, no actual I/O is performed, therefore no limiting.
Second feature is I/O operations per second, or just IOPS limit. For more info on what is IOPS, go read the linked Wikipedia article -- all I can say here is for traditional rotating disks the hardware capabilities are pretty limited (75 to 150 IOPS is a good guess, or 200 if you have high-end server class HDDs), while for SSDs this is much less of a problem. IOPS limit is set in the same way as iolimit (
Finally, to play with this stuff, you need:testing kernels.
|Oh, such a provocative subject! Not really. Many people do believe that OpenVZ is obsoleted, and when I ask why, three most popular answers are:|
1. OpenVZ kernel is old and obsoleted, because it is based on 2.6.32, while everyone in 2013 runs 3.x.
2. LXC is the future, OpenVZ is the past.
3. OpenVZ is no longer developed, it was even removed from Debian Wheezy.
Let me try to address all these misconceptions, one by one.
1. "OpenVZ kernel is old". Current OpenVZ kernels are based on kernels from Red Hat Enterprise Linux 6 (RHEL6 for short). This is the latest and greatest version of enterprise Linux distribution from Red Hat, a company who is always almost at the top of the list of top companies contributing to the Linux kernel development (see 1, 2, 3, 4 for a few random examples). While no kernel being ideal and bug free, RHEL6 one is a good real world approximation of these qualities.
What people in Red Hat do for their enterprise Linux is they take an upstream kernel and basically fork it, ironing out the bugs, cherry-picking security fixes, driver updates, and sometimes new features from upstream. They do so for about half a year or more before a release, so the released kernel is already "old and obsoleted", as it seems if one is looking at the kernel version number. Well, don't judge a book by its cover, don't judge a kernel by its number. Of course it's not old, neither obsoleted -- it's just more stable and secure. And then, after a release, it is very well maintained, with modern hardware support, regular releases, and prompt security fixes. This makes it a great base for OpenVZ kernel. In a sense, we are standing on the shoulders of a red hatted giant (and since this is open source, they are standing just a little bit on our shoulders, too).
RHEL7 is being worked on right now, and it will be based on some 3.x kernel (possibly 3.10). We will port OpenVZ kernel to RHEL7 once it will become available. In the meantime, RHEL6-based OpenVZ kernel is latest and greatest, and please don't be fooled by the fact that uname shows 2.6.32.
2. OpenVZ vs LXC. OpenVZ kernel was historically developed separately, i.e. aside from the upstream Linux kernel. This mistake was recognized in 2005, and since then we keep working on merging OpenVZ bits and pieces to the upstream kernel. It took way longer than expected, we are still in the middle of the process with some great stuff (like net namespace and CRIU, totally more than 2000 changesets) merged, while some other features are still in our TODO list. In the future (another eight years? who knows...) OpenVZ kernel functionality will probably be fully upstream, so it will just be a set of tools. We are happy to see that Parallels is not the only company interested in containers for Linux, so it might happen a bit earlier. For now, though, we still rely on our organic non-GMO home grown kernel (although it is already optional).
Now what is LXC? In fact, it is just another user-space tool (not unlike vzctl) that works on top of a recent upstream kernel (again, not unlike vzctl). As we work on merging our stuff upstream, LXC tools will start using new features and therefore benefit from this work. So far at least half of kernel functionality used by LXC was developed by our engineers, and while we don't work on LXC tools, it would not be an overestimation to say that Parallels is the biggest LXC contributor.
So, both OpenVZ and LXC are actively developed and have their future. We might even merge our tools at some point, the idea was briefly discussed during last containers mini-conf at Linux Plumbers. LXC is not a successor to OpenVZ, though, they are two different projects, although not entirely separate (since OpenVZ team contributes to the kernel a lot, and both tools use the same kernel functionality). OpenVZ is essentially LXC++, because it adds some more stuff that are not (yet) available in the upstream kernel (such as stronger isolation, better resource accounting, plus some auxiliary ones like ploop).
3. OpenVZ no longer developed, removed from Debian. Debian kernel team decided to drop OpenVZ (as well as few other) kernel flavors from Debian 7 a.k.a. Wheezy. This is completely understandable: kernel maintenance takes time and other resources, and they probably don't have enough. That doesn't mean though that OpenVZ is not developed. It's really strange to argue that, but please check our software updates page (or the announce@ mailing list archives). We made about 80 software releases this year so far. This accounts for 2 releases every week. Most of those are new kernels. So no, in no way it is abandoned.
As for Debian Wheezy, we are providing our repository with OpenVZ kernel and tools, as it was announced just yesterday.
Current Mood: excited
Good news, everyone!
Many people use OpenVZ on Debian. In fact, Debian was one of the distribution that come with OpenVZ kernel and tools. Unfortunately, it's not that way anymore, since Debian 7 "Wheezy" dropped OpenVZ kernel. A workaround was to take an RPM-packaged OpenVZ kernel and convert it to .deb using alien tool, but the process is manual and somewhat unnatural.
Finally, now we have a working build system for Debian kernel packages, and a repository for Debian Wheezy with latest and greatest OpenVZ kernels, as well as tools. In fact, we have two: one for stable, one for testing kernels and tools. Kernels debs are built and released at the same time as rpms. Currently we have vzctl/vzquota/ploop in 'wheezy-test' repository only -- once we'll be sure they work as expected, we will move those into stable 'wheezy' repo.
To enable these repos:
To install the kernel:
More info is available from https://wiki.openvz.org/Installation_on
|Vzstats, which we introduced at the end of May, turns 4 months today. While it's still a baby, we can already say it is showing some great potential. Current stats are at about 13 000 servers running OpenVZ, with more than 200 000 containers. The amount of newly registered hosts is not going down, so I assume we'll have more than 20K active servers and at least 300K containers by the end of year.|
Let's see what an average OpenVZ host looks like. It's an 8 core machine with 16 GB of RAM and 700 GB of disk space on /vz, used less than by 20%. This Joe is hosting only 8 containers, of which 4 are CentOS, 2 Debian, 1 Ubuntu and 1 is something else. Pretty average, eh?
What would be the ultimate OpenVZ host then? This one is a 64 cores monster with 1 TB of RAM and 50 TB of disk space, running about 1000 containers.
You can see much more stats like that at the project server site, http://stats.openvz.org/. The problem, though, is that not all stats can be represented in a meaningful way. For example, with vzstats 0.5.1 we introduced top-ps script, showing top 5 processes in every running container. The idea was to have something similar to Debian's popcon, but taking process names rather than package names into account. We have received lots of data but, frankly speaking, we don't really know how to process all this. I tried a number of various queries and came out with some stats like "60% of all containers run a web server", but all in all, this is not so useful (or interesting) for OpenVZ development. More to say, users raised a concern that such stuff should not be looked into at all. That sounds about right, so we have just released vzstats-0.5.2 with the top-ps script removed.
We would like to thank all users who are participating in vzstats. Please continue to do so, and if you have any concerns, please speak up -- we are listening.
|Currently, our best kernel line is the one that is based on Red Hat Enterprise Linux 6 kernels (RHEL6 for short). This is our most feature-reach, up-to-date yet stable kernel -- i.e. the best. Second-best option is RHEL5-based kernel -- a few years so neither vSwap nor ploop, but still good.|
There is a dilemma of either releasing the new kernel version earlier, or delay it for more internal testing. We figured we can do both! Each kernel branch (RHEL6 and RHEL5) comes via two channels -- testing and stable. In terms of yum, we have four kernel repositories defined in openvz.repo file, their names should be self-explanatory:
The process of releasing kernels is the following: right after building a kernel, we push it out to the appropriate -testing repository, so it is available as soon as possible. We when do some internal QA on it (that can either be basic or throughout, depending on amount of our changes, and whether we did a rebase to newer RHEL6 kernel). Based on QA report, sometimes we do another build with a few more patches, and repeat the process. Once the kernel looks good to our QA, we put it from testing to stable. In some rare cases (such as when we do one simple but quite important fix), new kernels go right into stable.
So, our users can enjoy being stable, or being up-to-the-moment, or both. In fact, if you have more than a few servers running OpenVZ, we strongly suggest you to dedicate one or two boxes for running -testing kernels, and report any bugs found to OpenVZ bugzilla. This is good for you, because you will be able to catch bugs early, and let us fix them before they hit your production systems. This is good for us, too, because no QA department is big enough to catch all possible bugs in a myriad of hardware and software configurations and use cases.
Enabling -testing repo is easy: just edit openvz.repo, setting
|A shiny new vzctl 4.4 was released just today. Let's take a look at its new features.|
As you know, vzctl was able to download OS templates automatically for quite some time now, when vzctl create --ostemplate was used with a template which is not available locally. Now, we have just moved this script to a standard /usr/sbin place and added a corresponding vztmpl-dl(8) man page. Note you can use the script to update your existing templates as well.
Next few features are targeted to make OpenVZ more hassle-free. Specifically, this release adds a post-install script to configure some system aspects, such as changing some parameters in /etc/sysctl.conf and disabling SELinux. This is something that has to be done manually before, so it was described in OpenVZ installation guide. Now, it's just one less manual step, and one less paragraph from the Quick installation guide.
Another "make it easier" feature is automatic namespace propagation from the host to the container. Before vzctl 4.4 there was a need to set a nameserver for each container, in order for DNS to work inside a container. So, the usual case was to check your host's /etc/resolv.conf, find out what are nameservers, and set those using something like
Now, since defaults for most container parameters can be set in global OpenVZ configuration file /etc/vz/vz.conf, if it contains a line like
Another small new feature is ploop-related. When you start (or mount) a ploop-based container, fsck for its inner filesystem is executed. This mimics the way a real server works -- it runs fsck on boot. Now, there is a 1/30 or so probability that fsck will actually do filesystem check (it does that every Nth mount, where N is about 30 and can be edited with tune2fs). For a large container, fsck could be a long operation, so when we start containers on boot from the /etc/init.d/vz initscript, we skip such check to not delay containers start-up. This is implemented as a new
Thanks to our user and contributor Mario Kleinsasser, vzmigrate is now able to migrate containers between boxes with different VE_ROOT/VE_PRIVATE values. Such as, if one server runs Debian with /var/lib/vz and another is CentOS with /vz, vzmigrate is smart enough to note that and do proper conversion. Thank you, Mario!
Another vzmigrate enhancement is option -f/--nodeps which can be used to disable some pre-migration checks. For example, in case of live migration destination CPU capabilities (such as SSE3) are cross-checked against the ones of the source server, and if some caps are missing, migration is not performed. In fact, not too many applications are optimized to use all CPU capabilities, therefore there are moderate chances that live migration can be done. This --nodeps option is exactly for such cases -- i.e. you can use it if you know what you do.
This is more or less it regarding new features. Oh, it makes sense to note that default OS template is now centos-6-x86, and NEIGHBOR_DEVS parameter is commented out by default, because this increases the chances container networking will work "as is".
Fixes? There are a few -- to vzmigrate, vzlist, vzctl convert, vzctl working on top of upstream kernel (including some fixes for CRIU-based checkpointing), and build system. Documentation (those man pages is updated to reflect all the new options and changes.
A list of contributors to this vzctl release is quite impressive, too -- more than 10 people.
As always, if you find a bug in vzctl, please report it to bugzilla.openvz.org.
|OpenVZ ploop is a wonderful technology, and I want to share more of its wonderfulness with you. We have previously covered ploop in general and it's write tracker feature to help speed up container migration in particular. This time, I'd like to talk about snapshots and backups.|
But let's start with yet another ploop feature -- it's expandable format. When you create a ploop container with say 10G of disk space, ploop image is just slightly larger than the size of actual container files. I just created centos-6-x86 container -- ploop image size is 747M, and inside CT df shows that 737M is used. Of course, for empty ploop image (with a fresh filesystem and zero files) the ratio will be worse. Now, when CT is writing data, ploop image is auto-growing up to accomodate the data size.
Now, these images can be layered, or stacked. Imagine having a single ploop image, consisting of blocks. We can add another image on top of the first one, so that new reads will fall through to the lower image (because the upper one is empty yet), while new writes will end up being written to the upper (top) image. Perhaps this image will save some more words here:
The new (top) image is now accumulating all the changes, while the old (bottom) one is in fact the read-only snapshot of the container filesystem. Such a snapshot is cheap and instant, because there is no need to copy a lot of data or do other costly operations. Of course, ploop is not limited to only two levels -- you can create much more (up to 255 if I remember correctly, which is way above any practical limit).
What can be done with such a snapshot? We can mount it and copy all the data to a backup (update: see openvz.org/Ploop/backup). Note that such backup is very fast, online and consistent. There's more to it though. A ploop snapshot, combined with a snapshot of a running container in memory (also known as a checkpoint) and a container configuration file(s), can serve as a real checkpoint to which you can roll back.
Consider the following scenario: you need to upgrade your web site backend inside your container. First, you do a container snapshot (I mean complete snapshot, including an in-memory image of a running container). Then you upgrade, and realize your web site is all messed up and broken. Horror story, is it? No. You just switch to before-upgrade snapshot and keep working as it. It's like moving back in time, and all this is done on a running container, i.e. you don't have to shut it down.
Finally, when you don't need a snapshot anymore, you can merge it back. Merging process is when changes from an upper level are written to a lower level (i.e. the one under it), then the upper level is removed. Such merging is of course not as instant as creating a snapshot, but it is online, so you can just keep working while ploop is working with merge.
All this can be performed from the command line using vzctl. For details, see vzctl(8) man page, section Snapshotting. Here's a quick howto:
Create a snapshot:
Mount a snapshot (say to copy the data to a backup):
Rollback to a snapshot:
Delete a snapshot (merging its data to a lower level image):
|Big news for every serious OpenVZ user. Finally we have it!|
Parallels, a sponsor behind OpenVZ project, is now offering an OpenVZ Maintenance Partnership program. The program provides bug resolution support and feature development to the OpenVZ community. The OpenVZ Maintenance Partnership has a small annual fee and provides two benefits to partnership members.
Partnership members will receive a support ID that will allow them to submit up to 10 high priority bugs per year. These bugs will be placed at the highest priority level in the development stack.
Partnership members will also be able to submit a feature request(s) which will be reviewed by the Parallels engineering team. They will work with you to clarify the requirements and implementation options and provide an implementation estimate and a schedule.
Learn more and join the OpenVZ Maintenance Partnership here