Fedora 22 was released today. Congrats Fedora Project!
I updated the Fedora 22 OS Template I contributed so it was current with the release today... and for the fun of it I recorded a screencast showing how to make a Fedora 22 MATE Desktop GUI container... and how to connect to it via X2GO.
OpenVZ Project Leader Kir Kolyshkin gave a presentation on Saturday, April 25th, 2015 at LinuxFest Northwest entitled, "OpenVZ, Virtuozzo, and Docker". I recorded it but I think my sdcard was having issues because there are a few bad spots in the recording... but it is totally watchable. Enjoy!
Looking forward to 2015, we have very exciting news to share on the future on OpenVZ. But first, let's take a quick look into OpenVZ history.
Linux Containers is an ancient technology, going back to last century. Indeed it was 1999 when our engineers started adding bits and pieces of containers technology to Linux kernel 2.2. Well, not exactly "containers", but rather "virtual environments" at that time -- as it often happens with new technologies, the terminology was different (the term "container" was coined by Sun only five years later, in 2004).
Anyway, in 2000 we ported our experimental code to kernel 2.4.0test1, and in January 2002 we already had Virtuozzo 2.0 version released. From there it went on and on, with more releases, newer kernels, improved feature set (like adding live migration capability) and so on.
It was 2005 when we finally realized we made a mistake of not employing the open source development model for the whole project from the very beginning. This is when OpenVZ was born as a separate entity, to complement commercial Virtuozzo (which was later renamed to Parallels Cloud Server, or PCS for short).
Now it's time to admit -- over the course of years OpenVZ became just a little bit too separate, essentially becoming a fork (perhaps even a stepchild) of Parallels Cloud Server. While the kernel is the same between two of them, userspace tools (notably vzctl) differ. This results in slight incompatiblities between the configuration files, command line options etc. More to say, userspace development efforts need to be doubled.
Better late than never; we are going to fix it now! We are going to merge OpenVZ and Parallels Cloud Server into a single common open source code base. The obvious benefit for OpenVZ users is, of course, more features and better tested code. There will be other much anticipated changes, rolled out in a few stages.
As a first step, we will open the git repository of RHEL7-based Virtuozzo kernel early next year (2015, that is). This has become possible as we changed the internal development process to be more git-friendly (before that we relied on lists of patches a la quilt but with home grown set of scripts). We have worked on this kernel for quite some time already, initially porting our patchset to kernel 3.6, then rebasing it to RHEL7 beta, then final RHEL7. While it is still in development, we will publish it so anyone can follow the development process.
Our kernel development mailing list will also be made public. The big advantage of this change for those who want to participate in the development process is that you'll see our proposed changes discussed on this mailing list before the maintainer adds them to the repository, not just months later when the the code is published and we'll consider any patch sent to the mailing list. This should allow the community to become full participants in development rather than mere bystanders as they were previously.
Bug tracking systems have also diverged over time. Internally, we use JIRA (this is where all those PCLIN-xxxx and PSBM-xxxx codes come from), while OpenVZ relies on Bugzilla. For the new unified product, we are going to open up JIRA which we find to me more usable than Bugzilla. Similar to what Red Hat and other major Linux vendors do, we will limit access to security-sensitive issues in order to not compromise our user base.
Last but not least, the name. We had a lot of discussions about naming, had a few good candidates, and finally unanimously agreed on this one:
Virtuozzo Core
Please stay tuned for more news (including more formal press release from Parallels). Feel free to ask any questions as we don't even have a FAQ yet.
This is a topic I always wanted to write about but was afraid my explanation would end up very cumbersome. This is no longer the case as we now have a picture that worth a thousand words!
The picture describes how we develop kernel releases. It's bit more complicated than the linearity of version 1 -> version 2 -> version 3. The reason behind it is we are balancing between adding new features, fixing bugs, and rebasing to newer kernels, while trying to maintain stability for our users. This is our convoluted way of achieving all this:
As you can see, we create a new branch when rebasing to a newer upstream (i.e. RHEL6) kernel, as regressions are quite common during a rebase. At the same time, we keep maintaining the older branch in which we add stability and security fixes. Sometimes we create a new branch to add some bold feature that takes a longer time to stabilize. Stability patches are then forward-ported to the new branch, which is either eventually becoming stable or obsoleted by yet another new one.
Of course there is a lot of work behind these curtains, including rigorous internal testing of new releases. In addition to that, we usually provide those kernels to our users (in rhel6-testing repo) so they could test new stuff before it hits production servers, and we can fix more bugs earlier (more on that here). If you are not taking part in this testing, well, it's never late to start!
For many years, when people asked us how can they help the project, our typical answer was something like "just use it, file bugs, spread the word". Some people were asking specifically about how to donate money, and we had to say "currently we don't have a way to accept", so it was basically "no, we don't need your money". It was not a good answer. Even if our big sponsor is generously helping us with everything we need, that doesn't mean your money would be useless.
Today we have opened a PayPal account to accept donations. Here:
How are we going to spend your money? In general:
Hardware for development and testing
Travel budget for conferences, events etc.
Accolades for active contributors
In particular, right now we need:
About $200 to cover shipping expenses for donated hardware
About $300 for network equipment
About $2000 for hard drives
Donations page is created on our wiki to track donations and spending. We hope that it will see some good progress in the coming days, with a little help from you!
NOTE that if you feel like spending $500 or more, there is yet another way to spend it -- a support contract from Parallels, with access to their excellent support team and 10 free support tickets.
It's been two months since we have released vzctl 4.7 and ploop 1.11 with
some vital improvements
to live migration that made it about 25% faster. Believe it or not,
this is another "make it faster" post, making it, well, faster.
That's right, we have more surprises in store for you! Read on and be delighted.
Asynchronous ploop send
This is something that should have been implemented at the very beginning,
but the initial implementation looked like this (in C):
/* _XXX_ We should use AIO. ploopcopy cannot use cached reads and
* has to use O_DIRECT, which introduces large read latencies.
* AIO is necessary to transfer with maximal speed.
*/
Why there are no cached reads (and, in general, why ploop library is
using O_DIRECT, and the kernel also uses direct I/O, bypassing the cache)?
Note that ploop images are files that are used as block
devices containing a filesystem. That ploop block device and filesystem
access is going through cache as usual, so if we'd do the same for ploop
image itself, it would result in double caching and waste of RAM. Therefore,
we do direct I/O on lower level (when working with ploop images), and allow
usual Linux cache to be used on the upper level (when container is accessing
files inside the ploop, which is the most common operation).
So, ploop image itself is not (usually) cached in memory, and other tricks
like read-ahead are also not used by the kernel, and reading each block from
the image takes time as it is read directly from disk. In our test lab
(OpenVZ running inside KVM with a networked disk) it takes about 10
milliseconds to read each 1 MB block. Then, sending each block to the
destination system over network also takes time, it's about 15 milliseconds
in our setup. So, it takes 25 milliseconds to read and send a block of data,
if we do it one after another. Oh wait, this is exactly how we do it!
Solution -- let's do reading and sending at the same time! In other words,
while sending the first block of data we can already read the second one,
and so on, bringing the time required to transfer one block down to
MAX(Tread, Tsend) instead of Tread + Tsend.
Implementation uses POSIX threads. Main ploop send process spawns a separate
sending thread and two buffers -- one for reading and one for sending, then
they change place. This works surprisingly well for something that uses
threads, maybe because it is as simple as it could ever be (one thread, one
mutex, one conditional). If you happen to know something about pthreads
programming, please review the appropriate
commit 55e26e9606.
The new async send is used by default, you just need to have a newer ploop
(1.12, not yet released). Clone and compile ploop from git if you are curious.
As for how much time we save using this, it happen to be 15 microseconds
instead of 25 per block of data, or 40% faster! Now, the real savings depend
on the number of blocks needs to be migrated after container is frozen,
it can be 5 or 100, so overall savings can be from 0.05 to 1 second.
ploop copy with feedback
The previous post on live migration improvements
described an optimization of doing fdatasync() on the receiving
ploop side before suspending the container on the sending side. It also noted
that implementation is sub-par:
The other problem is that sending side should wait for fsync to finish,
in order to proceed with CT suspend. Unfortunately, there is no way to solve
this one with a one-way pipe, so the sending side just waits for a few
seconds. Ugly as it is, this is the best way possible (let us know if
you can think of something better).
So, the problem is trivial -- there's a need for a bi-directional
channel between ploop copy sender and receiver, so the receiver can say
"All right, I have synced the freaking delta back to sender". In addition,
we want to do it in a simple way, not much more complicated as it is now.
After some playing around with different approaches, it seemed
that OpenSSH
port forwarding, combined with tools like netcat, socat, or bash
/dev/tcp feature, can do the trick of establishing a two-way pipe
between ploop copy sides.
The netcat (or nc) is available in various varieties, which might or might
not be available and/or compatible. socat is a bit better, but one problem
with it is it ignores its child exit code, so there is no (simple) way
to figure out an error. A problem with /dev/tcp feature is it is
bash-specific, but bash itself might not be universally available
(Debian and Ubunty users are well aware of that fact).
So, all this was replaced with a tiny home-grown vznnc ("nano-netcat) tool.
It is indeed nano, as there is only
only 200
lines of code in it. It can either listen or connect to a specified
port at localhost (note that ssh is doing real networking for us), and
run a program with its stdin/stdout (or a specified file descriptor)
already connected to that TCP socket. Again, similar to netcat, but
small, to the point, and hopefully bug-free.
Finally, with this in place, we can make sending side of ploop copy to
wait for feedback from the receiving side, so it can suspend the container
as soon as remote side finished syncing. This makes the whole migration
a bit faster (by eliminating the previously hard-coded 5 seconds
wait-for-sync delay), but it also helps to slash the frozen time
as we suspend the container as soon as we should, so it won't be given
any extra time to write some more data we'll be needing to copy while
it's frozen.
It's hard to measure the practical impact of the feature, but in our
tests it saves about 3.5 seconds of total migration time, and from 0
to a few seconds of frozen time, depending on container's disk I/O
activity.
For the feature to work, you need the latest versions of vzctl (4.8)
and ploop (1.12) on both nodes (if one node doesn't have newer tools,
vzmigrate falls back to old non-feedback way of copying). Note that
those new version are not yet released at the time of writing this,
but you can get ones from git and compile yourself.
Using ssh connection multiplexing
It takes about 0.15 seconds to establish an ssh connection in our test lab.
Your mileage may wary, but it can't be zero. Well, unless an OpenSSH feature
of reusing an existing ssh connection is put to work! It's call connection
miltiplexing, and when used, once a connection is established, you can have
subsequent ones in practically no time. As vzmigrate is a shell script and
runs a number of commands at remote (destination) side, using it might save
us some time.
Unfortunately, it's a relatively new OpenSSH feature and is implemented
quite ugly in a version available in CentOS 6 -- you need to keep one open
ssh session running. They fixed it in a later version by adding a special
daemon-like mode enabled with ControlPersist option, but alas not
in CentOS 6. Therefore, we have to maintain a special "master" ssh
connection for the duration of vzmigrate. For implementation details,
see commits
06212ea3d
and
00b9ce043.
This is still experimental, so you need to specify --ssh-mux
flag to vzmigrate
to use it. You won't believe it (so, go test it yourself!), but this alone
slashes container frozen time by about 25% (which is great, as we fight for
every microsecond here), and improves the total time taken by vzmigrate
by up to 1 second (which is probably not that important but still nice).
What's next?
Current setup used for development and testing is two OpenVZ instances
running in KVM guests on a
ThinkCentre
box. While it's convenient and oh so very space-saving, is is probably
not the best approximation of a real world OpenVZ usage. So, we need
some better hardware:
If you are able to spend $500 to $2000 on ebay, please
let us know by email to donate at openvz dot org so we can arrange it.
Now, quite a few people offered hosted servers with a similar configuration.
While we are very thankful to all such offers, this time we are looking
for physical, not hosted, hardware.
Update: Thanks to FastVPS, we got the Supermicro 4 node server. If you want to donate, see wiki: Donations.
It has been almost two years since we wrote about effective live migration with ploop write tracker. It's time to write some more about it, since we have managed to make ploop live migration yet more effective, by means of pretty simple optimizations. But let's not jump to resolution yet, it's a long and interesting story to tell.
As you know, live migration is not quite live, although it looks that way to a user. There is a short time period, usually a few seconds, during which the container being migrated is frozen. This time (shown if -t or -v option to vzmigrate --live is used) is what needs to be optimized, making it as short as possible. In order to do that, one needs to dig into details on what's happening when a container is frozen.
Typical timings obtained via vzmigrate -t --live look like this. We ran a few iterations migrating a container back and forth between two OpenVZ instances (running inside Parallels VMs on the same physical machine), so there are a few columns at the right side.
Apparently, the first suspect to look at is that "undump + resume". Basically, it shows timing of vzctl restore command. Why it is so slow? Apparently, ploop mount takes some noticeable time. Let's dig deeper into that process.
First, implement timestamps in ploop messages, raise the log level and see what is going on here. Apparently, adding deltas is not instant, takes any time from 0.1 second to almost a second. After some more experiments and thinking it becomes obvious that since ploop kernel driver works with data in delta files directly, bypassing the page cache, it needs to force those files to be written to the disk, and this costly operation happens while container is frozen. Is it possible to do it earlier? Sure, we just need to force write the deltas we just copied before suspending a container. Easy, just call fsync(), or yet better fdatasync(), since we don't really care about metadata being written.
Unfortunately, there is no command line tool to do fsync or fdatasync, so we had to write one and call it from vzmigrate. Is it any better now? Yes indeed, delta adding times went down to from tenths to hundredths of a second.
Except for the top delta, of course, which we migrate using ploop copy. Surely, we can't fsync it before suspending container, because we keep copying it after. Oh wait... actually we can! By adding an fsync before CT suspend, we force the data be written on disk, so the second fsync (which happens after everything is copied) will take less time. This time is shown as "Pcopy after suspend".
The problem is that ploop copy consists of two sides -- the sending one and the receiving one -- which communicate over a pipe (with ssh as a transport). It's the sending side which runs the command to freeze the container, and it's the receiving side which should do fsync, so we need to pass some sort of "do the fsync" command. Yet better, do it without breaking the existing protocol, so nothing bad will happen in case there is an older version of ploop on the receiving side.
The "do the fsync" command ended up being a data block of 4 bytes, you can see the patch here. Older version will write these 4 bytes to disk, which is unnecessary but OK do to, and newer version will recognize it as a need to do fsync.
The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).
To summarize, what we have added is a couple of fsyncs (it's actually fdatasync() since it is faster), and here are the results:
As you see, both "pcopy after suspend" and "undump + resume" times decreased, shaving off about a second of time, which gives us about 25% improvement. Now, take into account that tests were done on an otherwise idle nodes with mostly idle containers, we suspect that the benefit will be more apparent with I/O load. Let checking if this statement is true will be your homework for today!
I've used RHEL, CentOS and Fedora for many years... and as many of you already know... back in January, CentOS became a sponsored project of Red Hat. For the upcoming CentOS 7 release they are going beyond just the normal release that is an as-perfect-as-possible clone of RHEL. They have this concept of variants... where Special Interest Groups (SIGs) are formed around making special purpose builds of CentOS... spins or remixs if you will. I don't know a lot about it yet but I think I have the basic concept correct.
Although reporting is optional, the popularity of CentOS as both an OpenVZ host and an OpenVZ container surely has to do with the fact that the two stable branches of the OpenVZ kernel are derived from RHEL kernels.
Wouldn't be nice if there were a CentOS variant that has the OpenVZ kernel and utils pre-installed? I think so.
While I have made CentOS remixes in the past just for my own personal use... I have not had any official engagement with the CentOS community. I was curious if there were some OpenVZ users out there who are already affiliated with the CentOS Project and who might want to get together in an effort to start a SIG and ultimately an OpenVZ CentOS 7 variant. Anyone? I guess if not, I could make a personal goal of building a CentOS and/or Scientific Linux 6-based remix that includes OpenVZ... as well as working on it after RHEL7 and clones are released... and after such time the OpenVZ Project has released a stable branch based on the RHEL7 kernel.
I will acknowledge up front that some of the top CentOS devs / contributors have historically been fairly nasty to OpenVZ users on the #centos IRC channel. They generally did not want to help someone using a CentOS system running under an OpenVZ kernel... but then again... their reputation is for being obnoxious to many groups of people. :) I don't think we should let that stop us.
In a light of CRIU 1.1 release that happened last week, we will be doing a hangout on air to tell more about CRIU past and the future, and will be answering your questions. Event page is here, it is going to happen as soon as this Friday, Feb 7th, at 6:00am PST / 9:00am EST / 15:00 CET / 18:00 MSK. Feel free to ask your questions now (go to event page and click on "Play").
I tried it and was able to migrate a CentOS 7 container... but the Fedora 22 one seems to be stuck in the "started" phase. It creates a /vz/private/{ctid} dir on the destination host (with the same…
The fall semester is just around the corner... so it is impossible for me to break away for a trip to Seattle. I hope one or more of you guys can blog so I can attend vicariously.
Comments
Do you still stand by your opinions above now in 2016?…