The latest prepatch for the stable Linux kernel tree, 2.6.19-rc1, now includes some pieces of OS-level virtualization from OpenVZ, IBM, and Eric Biederman. Those patches have been sitting in -mm (Andrew Morton’s) tree for some time already, and now, during the “2.6.19 merge window,” Andrew has submitted them to Linus Torvalds. So it’s now a part of “vanilla” Linux, and will be finally released as a part of the 2.6.19 kernel when it is released.
So, what exactly went into the Linux kernel? Essentially, three sets of patches that implementing three features needed for any OS-level virtualization solution.
First is IPC virtualization, otherwise known as IPC namespace, contributed by OpenVZ’s Kirill Korotaev and Pavel Emelianov. IPC stands for inter-process communication. This is functionality that enables different processes to create shared memory segments, send messages to each other, and use semaphores. In a virtualized system, you don’t want a container (VE) to see IPC objects from another container.
Second is utsname() virtualization (otherwise known as UTS namespace), contributed by Serge Hallyn from IBM. utsname() returns basic information about the kernel being run (same as displayed by uname -a) — such as the kernel version/release, host and domain names, and system architecture (for example, i686). So, before we had a single utsname structure in the kernel, visible to all the processes. Why do we need to virtualize it? At the very least every virtualized system should have its own hostname. We might want to change other fields, too.
Third is preliminary work needed to introduce PID namespaces feature, mostly contributed by Eric W. Biederman (and also some bits from Oleg Nesterov, IBM's Sukadev Bhattiprolu and Cedric Le Goater). Every container (VE) should be able to use its own set of process IDs (PIDs), and should not see another container's PIDs. Eric's approach is to not use pid directly in the kernel, but use a pointer to the struct pid — a structure that could hold both PID and VEID (i.e. container ID). Submitted set of patches cleans up different places in kernel where it uses PID directly, to switch to struct pid.
I am really happy it is a community work and a community process (like I said before). We see different parties bringing in code and expertize, reviewing each other's code, making suggestions, exchanging ideas and improving things — to everybody's benefit!
These are just the first steps. Much more is needed to have full OS-level virtualization in the mainstream Linux kernel. Don’t worry — we are already working on that. A few days ago Kirill sent another iteration (v5) of beancounter patchset for further review and possible inclusion. Beancounters can be used to implement per-VE limits and guarantees for certain resources such as memory.