Log in

No account? Create an account

Previous Entry | Next Entry

On vSwap and 042stab04x kernel improvements


The best feature of the new (RHEL6-based) 042 series of the OpenVZ kernels is definitely vSwap. The short story is, we used to have 22 user beancounter parameters which every seasoned OpenVZ user knows by heart. Each of these parameters is there for a reason, but 22 knobs are a bit too complex to manage for a mere mortal, especially bearing in mind that
  • many of them are interdependent;
  • the sum of all limits should not exceed the resources of a given physical server.
Keeping this configuration optimal (or even consistent) is quite a challenging task even for a senior OpenVZ admin (with a probable exception of an ex airline pilot). This complexity is the main reason why there are multiple articles and blog entries complaining OpenVZ is worse than Xen, or that it is not suitable for hosting Java apps. We do have some workarounds to mitigate this complexity, such as:This is still not the way to go. While we think high of our users, we do not expect all of them to be ex airline pilots. To solve the complexity, the number of per-container knobs and handles should be reduced to some decent number, or at least most of these knobs should be optional.

We worked on that for a few years, and the end result is called vSwap (where V is for Vendetta, oh, pardon me, Virtual).

vSwap concept is as simple as a rectangle. For each container, there are only two required parameters: the memory size (known as physpages) and the swap size (swappages). Almost everyone (not only an admin, but even an advanced end user) knows what is RAM and what is swap. On a physical server, if there is not enough memory, the system starts to swap out memory pages to disk, then swap in some other pages, which results in severe performance degradation but it keeps the system from failing miserably.

It's about the same with vSwap, except that
  • RAM and swap are configured on a per container basis;
  • no I/O is performed until it is really necessary (this is why swap is virtual).

Some VSwap internals

Now, there are only two knobs per container on a dashboard, namely RAM and swap, and all the complexity is hidden under the hood. I am going to describe just a bit of that undercover mechanics and explain what does the "Reworked VSwap kernel memory accounting" line from the 042stab040.1 kernel changelog stands for.

The biggest problem is, RAM for containers is not just RAM. First of all, there is a need to distinguish between
  • the user memory,
  • the kernel memory,
  • the page cache,
  • and the directory entry cache.
The user memory is more or less clear, it is simply the memory that programs allocate for themselves to run. It is relatively easy to account for, and it is relatively simple to limit it (but read on).

The kernel memory is really complex thingie. Right, it is the memory that kernel allocates for itself in order for programs in a particular container to run. This includes a lot of stuff I'd rather not dive into, if I want to keep this piece as an article not a tome. Having said that, two particular kernel memory types are worth explaining.

First is the page cache, the kernel mechanism that caches disk contents in memory (that would be unused otherwise) to minimize the I/O. When a program reads some data from a disk, that data are read into the page cache first, and when a program writes to a disk, data goes to the page cache (and then eventually are written (flushed) to disk). In case of repeated disk access (which happens quite often) data is taken from a page cache, not from the real disk, which greatly improves the overall system performance, since a disk is much slower than RAM. Now, some of the page cache is used on behalf of a container, and this amount must be charged into "RAM used by this container" (i.e. physpages).

Second is the directory entry cache (dcache for short) is yet another sort of cache, and another sort of the kernel memory. Disk contents is a tree of files and directories, and such a tree is quite tall and wide. In order to read the contents of, say, /bin/sh file, kernel have to read the root (/) directory, find 'bin' entry in it, read /bin directory, find 'sh' entry in it and finally read it. Although these operations are not very complex, there is a multitude of those, they take time and are repeated often for most of the "popular" files. In order to improve performance, kernel keeps directory entries in memory — this is what dcache is for. The memory used by dcache should also be accounted and limited, since otherwise it's easily exploitable (not only by root, but also by an ordinary user, since any user is free to change into directories and read files).

Now, the physical memory of a container is the sum of its user memory, the kernel memory, the page cache and the dcache. Technically, dcache is accounted into the kernel memory, then kernel memory is accounted into the physical memory, but it's not overly important.

Improvements in the new 042stab04x kernels

Better reclamation and memory balancing

What to do if a container hit a physical memory limit? Free some pages by writing their contents to the abovementioned virtual swap. Well, not quite yet. Remember that there is also a page cache and a dcache, so the kernel can easily discard some of the pages from these caches, which is way cheaper than swapping out.

The process of finding some free memory is known as reclamation. Kernel needs to decide very carefully when to start reclamation, how many and what exact pages to reclaim in every particular situation, and when it is the right time to swap out rather than discard some of the cache contents.

Remember, we have four types of memory (kernel, user, dcache and page cache) and only one knob which limits the sum of all these. It would be easier for the kernel, but not for the user, to have separate limits for each type of memory. But, for the user convenience and simplicity, the kernel only have one knob for these four parameters, so it needs to balance between those four. One major improvement in 042stab040 kernel is that such balancing is now performed better.

Stricter memory limit

During the lifetime of a container, the kernel might face a situation when it needs more kernel memory, or user memory, or perhaps more dcache entries, and the memory for the container is tight (i.e. close to the limit), so it needs to either reclaim or swap. The problem is there are some situations when neither reclamation nor swapping is possible, so the kernel can either fail miserably (say by killing a process) or go beyond the limit and hope that everything will be fine and mommy won't notice. Another big improvement in 042stab040 kernel is it reduces the number of such situations, in other words, the new kernel obeys memory limit in a more strict way.


Finally, the kernel is now in a pretty good shape, so we can afford some polishing, minor optimizations, and fine tuning. Such polishing was performed in a few subsystems, including checkpointing, user beancounters, netfilter, kernel NFS server and VZ disk quota.

Some numbers

Totally, there are 53 new patches in 042stab040.1, compared to previous 039 kernels. On top of that, 042stab042.1 adds another 30. We hope that the end result is improved stability and performance.


( 14 comments — Leave a comment )
Nov. 16th, 2011 08:39 pm (UTC)

"User Beancounters manual" link actually points to the vzsplit man page :)
Nov. 17th, 2011 05:06 pm (UTC)
thanks, fixed
Nov. 17th, 2011 04:20 pm (UTC)
Thanks for finally explaining this and giving me some hope that things will improve. I have been struggling with the new RHEL6 kernel memory management. Page caching in particular. Will have to give the 040+ kernels a try and see if you have smoothed out the bumps.
Nov. 17th, 2011 05:09 pm (UTC)
Can you be more specific? I am not aware of any major bugs in 042stab03x kernels memory management.
Nov. 24th, 2011 06:08 am (UTC)
How should a config then look like?

how should a config file then look like? Just VWAP und MEMINFO inserted? I'm a little bit confused right now.

Nov. 24th, 2011 11:58 am (UTC)
Re: How should a config then look like?
Something like this (an example for 1G RAM and 2G swap)


And we will make it look simpler in future versions.
Nov. 24th, 2011 01:31 pm (UTC)
Re: How should a config then look like?
OK thanks but i still have to set KMEMSIZE, DCACHESIZE and LOCKEDPAGES? When will we see a vzcfgvalide which support these templates?
Nov. 24th, 2011 01:38 pm (UTC)
Re: How should a config then look like?

It's still a work in progress, we will fix it in newer vzctl versions.

> When will we see a vzcfgvalide which support these templates?

I am still working on it. Currently I made vzsplit working for new mode.
Florian Bantner
Oct. 15th, 2013 09:30 pm (UTC)
I still didn't get how I can check if I'm using vSwap. The ram allocation works but user_beancounter shows unlimited for most values. and "vzlist -o vswap 123" tells me "Unknown field: vzswap" (on Wheezy)
Oct. 17th, 2013 12:48 am (UTC)
Which version of vzctl is this vzlist comes from?
Florian Bantner
Oct. 23rd, 2013 08:55 am (UTC)
dpkg -l | grep vzctl
ii vzctl amd64 server virtualization solution - control tools
Oct. 28th, 2013 02:26 am (UTC)
This is a very old version. Please get 4.5 from http://ftp.openvz.org/debian
Grzegorz Palak
Nov. 4th, 2013 05:43 pm (UTC)

Is it possible to disable software emulation of slowed down swap then using vswap?
I have a lots of spare memory more than 40GB which I would like to use as fast tmpfs shared storage between VEs.

Nov. 7th, 2013 11:35 pm (UTC)
If you are talking about tmpfs, then just bind-mount it to all containers that you need, that's it. This has nothing to do with vswap.
( 14 comments — Leave a comment )

Latest Month

July 2016
Powered by LiveJournal.com
Designed by Tiffany Chow