|Here is a rare interview with a legendary Alexey Kuznetsov, who happen to work for Parallels. Alan Cox once said we had thought for a long time that "Kuznetsov" is a collective name for a secret group of Russian programmers -- because no single man can write so much code at once.|
An interview is taken by lifehacker.ru and is part of "work places" series. I tried to do my best to translate it to English, but it's far from perfect. I guess this is still a very interesting reading.
Q: Who are you and what you do?
Since mid-90s I was one of Linux maintainers. Back then all the communication was done via conferences and Linux mailing lists. Pretty often I was aggressively arguing with someone there, don't remember for which reasons. Now it's fun to recall. Being a maintainer, I wasn't just making something on my own, but had to control others. Kicking out those who were making rubbish (from my point of view), and supporting those who were making something non-rubbish. All these conflicts, they were exhausting me. Since some moment I started noticing I am becoming "bronzed" [Alexey is referring to superiority complex -- Kir]. You said or did some crap, and then learn that this is becoming the right way now, since ANK said so.
I started to doubt, maybe I am just using my authority to maintain status quo. Every single morning started with a fight with myself, then with the world. In 2003 I got fed by it, so I went away from public, and later switched to a different field of knowledge. At that time I started my first project in Parallels. The task was to implement live migration of containers, and it was very complicated.
Now in Parallels we work on Parallels Cloud Storage project, developing cluster file systems for storing virtual machine images. The technology itself is a few years old already, we did a release recently, and are now working on improving it.
Q: How does your workplace look like?
My workplace is a bunch of computers. But I only work on a notebook, currently it's Lenovo T530. Other computers here are used for various purposes. This display, standing here, I never use it, nor this keyboard. Well, only if something breaks. Here we have different computers, including a Power Mac, an Intel and an AMD. I was using those in different years for different experiments. Once I needed to create a cluster of 3 machines right here at my workplace. One machine here is really old, and its sole purpose is to manage a power switch, so I can reboot all the others when working remotely from home. Here I have two Mac Minis and a Power Mac. They are always on, but I use them rarely, only when I need to see something in Parallels Desktop.
Q: What software do you use?
I don't use anything except for Google Chrome. Well, an editor and a compiler, if they qualify for software. I also store some personal data and notes in Evernote.
I only use a text console. For everything. In fact, on a newer notebooks, when the screen is getting better, the console mode is working worse and worse. So I am working in a graphical environment now, running a few full-screen terminals on top of it. It looks like a good old Unix console. So this is how I live, read email, work.
I do have a GMail account, I use it to read email from my phone. Sometimes it is needed. Or, when I see someone sent me a PDF, I have nothing else to do than to forward this email to where I can open that PDF. Same for PPT. But this is purely theoretical, in practice I never work with PPT.
I use Linux. Currently it is Fedora 13 -- not a newest one, to say at least. I am always using a version that was a base for a corresponding RHEL release. Every few years a new Red Hat [Enterprise Linux] is released, so I install a new system. When I do not change anything for a few years. Say, 5 years. I can't think of any new feature of an OS that would force me to update. I am just using the system as an editor, same as I have used it 20 years ago.
I have a phone, Motorola RAZR Maxx, running Android. I don't like iOS. You can't change anything in there. Not that I like customizations, I like a possibility to customize. I got a Motorola because I hate Samsung. This hatred is absolutely irrational. I had no happiness with any Samsung product, they either did't work for me or they break. I need a phone to make calls and check emails, that is all I need. Everything else is disabled -- to save the battery.
I am also reading news over RSS every day, like a morning paper. Now Feedly, before it was Google Reader, until they closed it. I have a list of bloggers I read, I won't mention their names. I am also reading Russian and foreign media. Lenta.ru, for example. There's that nice english-language
service, News 360. It fits for what I like and gives me the relevant news. I am not able to say if it works or not, but the fact is, what it shows to me is really interesting to me. It was showing a lot of sports news at first, but then they disappeared.
I don't use instant messengers like Skype or ICQ, it's just meaningless. If you need something, write an email. If you need it urgently, call. Email and phone covers everything.
Speaking of social networks, I do have a Facebook account with two friends -- my wife and my sister. I view this account only when they post a picture, I don't wander there for no reason.
Q: Is there a use for paper in your work?
It's a mess. I don't have a pen, so when I would need it I could not find it. If I am close to the notebook and I need to write something -- I write to a file. If I don't have a notebook around, I write to my phone. For these situations I recently started to use Google Keep, a service to store small notes. It is convenient so far. Otherwise I use Evernote. Well, I don't have a system for that. But I do have a database of everything on my notebook: perpetual emails, all the files and notes. All this stuff is indexed. Total size is about 10 gigabytes, since I don't have any graphics. Well, if we throw away all the junk from there, almost nothing will remain.
Q: Is there a dream configuration?
What I have here is more than enough for me. This last notebook is probably the best notebook I ever had.
I was getting used to it for a long time, swore a lot. I only use Thinkpads for a long time. They are similar from version to version, but each next one is getting bigger and heavier, physically. This is annoying. This model, they changed the keyboard. I had to get used to it, but now I realize this is the best keyboard I ever had. In general, I am pretty satisfied with ThinkPads. Well, if it would had a Retina screen and be just 1 kilogram less weight -- that would be ideal.
|It has been almost two years since we wrote about effective live migration with ploop write tracker. It's time to write some more about it, since we have managed to make ploop live migration yet more effective, by means of pretty simple optimizations. But let's not jump to resolution yet, it's a long and interesting story to tell.
As you know, live migration is not quite live, although it looks that way to a user. There is a short time period, usually a few seconds, during which the container being migrated is frozen. This time (shown if
Typical timings obtained via
(Software: vzctl 4.6.1, ploop-1.10) Iteration 1 2 3 4 5 6 AVG Suspend + Dump: 0.58 0.71 0.64 0.64 0.91 0.74 0.703 Pcopy after suspend: 1.06 0.71 0.50 0.68 1.07 1.29 0.885 Copy dump file: 0.64 0.67 0.44 0.51 0.43 0.50 0.532 Undump + Resume: 2.04 2.06 1.94 3.16 2.26 2.32 2.297 ------ ---- ---- ---- ---- ---- ----- Total suspended time: 4.33 4.16 3.54 5.01 4.68 4.85 4.428
Apparently, the first suspect to look at is that "undump + resume". Basically, it shows timing of vzctl restore command. Why it is so slow? Apparently, ploop mount takes some noticeable time. Let's dig deeper into that process.
First, implement timestamps in ploop messages, raise the log level and see what is going on here. Apparently, adding deltas is not instant, takes any time from 0.1 second to almost a second. After some more experiments and thinking it becomes obvious that since ploop kernel driver works with data in delta files directly, bypassing the page cache, it needs to force those files to be written to the disk, and this costly operation happens while container is frozen. Is it possible to do it earlier? Sure, we just need to force write the deltas we just copied before suspending a container. Easy, just call fsync(), or yet better fdatasync(), since we don't really care about metadata being written.
Unfortunately, there is no command line tool to do fsync or fdatasync, so we had to write one and call it from vzmigrate. Is it any better now? Yes indeed, delta adding times went down to from tenths to hundredths of a second.
Except for the top delta, of course, which we migrate using ploop copy. Surely, we can't fsync it before suspending container, because we keep copying it after. Oh wait... actually we can! By adding an fsync before CT suspend, we force the data be written on disk, so the second fsync (which happens after everything is copied) will take less time. This time is shown as "Pcopy after suspend".
The problem is that ploop copy consists of two sides -- the sending one and the receiving one -- which communicate over a pipe (with ssh as a transport). It's the sending side which runs the command to freeze the container, and it's the receiving side which should do fsync, so we need to pass some sort of "do the fsync" command. Yet better, do it without breaking the existing protocol, so nothing bad will happen in case there is an older version of ploop on the receiving side.
The "do the fsync" command ended up being a data block of 4 bytes, you can see the patch here. Older version will write these 4 bytes to disk, which is unnecessary but OK do to, and newer version will recognize it as a need to do fsync.
The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).
To summarize, what we have added is a couple of fsyncs (it's actually fdatasync() since it is faster), and here are the results:
(Software: vzctl 4.7, ploop-1.11) Iteration 1 2 3 4 5 6 AVG Suspend + Dump: 0.60 0.60 0.57 0.74 0.59 0.80 0.650 Pcopy after suspend: 0.41 0.45 0.45 0.49 0.40 0.42 0.437 (-0.4) Copy dump file: 0.46 0.44 0.43 0.48 0.47 0.52 0.467 Undump + Resume: 1.86 1.75 1.67 1.91 1.87 1.84 1.817 (-0.5) ------ ---- ---- ---- ---- ---- ----- Total suspended time: 3.33 3.24 3.12 3.63 3.35 3.59 3.377 (-1.1)
As you see, both "pcopy after suspend" and "undump + resume" times decreased, shaving off about a second of time, which gives us about 25% improvement. Now, take into account that tests were done on an otherwise idle nodes with mostly idle containers, we suspect that the benefit will be more apparent with I/O load. Let checking if this statement is true will be your homework for today!
|I've used RHEL, CentOS and Fedora for many years... and as many of you already know... back in January, CentOS became a sponsored project of Red Hat. For the upcoming CentOS 7 release they are going beyond just the normal release that is an as-perfect-as-possible clone of RHEL. They have this concept of variants... where Special Interest Groups (SIGs) are formed around making special purpose builds of CentOS... spins or remixs if you will. I don't know a lot about it yet but I think I have the basic concept correct.|
Looking at the numbers on stats.openvz.org I see:
Top host distros
ALT Linux 10
Top 10 CT distros
Although reporting is optional, the popularity of CentOS as both an OpenVZ host and an OpenVZ container surely has to do with the fact that the two stable branches of the OpenVZ kernel are derived from RHEL kernels.
Wouldn't be nice if there were a CentOS variant that has the OpenVZ kernel and utils pre-installed? I think so.
While I have made CentOS remixes in the past just for my own personal use... I have not had any official engagement with the CentOS community. I was curious if there were some OpenVZ users out there who are already affiliated with the CentOS Project and who might want to get together in an effort to start a SIG and ultimately an OpenVZ CentOS 7 variant. Anyone? I guess if not, I could make a personal goal of building a CentOS and/or Scientific Linux 6-based remix that includes OpenVZ... as well as working on it after RHEL7 and clones are released... and after such time the OpenVZ Project has released a stable branch based on the RHEL7 kernel.
I will acknowledge up front that some of the top CentOS devs / contributors have historically been fairly nasty to OpenVZ users on the #centos IRC channel. They generally did not want to help someone using a CentOS system running under an OpenVZ kernel... but then again... their reputation is for being obnoxious to many groups of people. :) I don't think we should let that stop us.
Comments, feedback, questions?
|Seven problems with Linux containers is a detailed LWN.net description of the talk I gave at Southern California Linux Expo (aka SCALE) last week in Los Angeles. Couldn't have written it better myself!|
|In a light of CRIU 1.1 release that happened last week, we will be doing a hangout on air to tell more about CRIU past and the future, and will be answering your questions. Event page is here, it is going to happen as soon as this Friday, Feb 7th, at 6:00am PST / 9:00am EST / 15:00 CET / 18:00 MSK. Feel free to ask your questions now (go to event page and click on "Play").|
|OpenVZ precreated OS templates are tarballs of a pre-installed Linux distributions. While there are other ways to create a container, the easiest one is to take such a tarball and extract its contents. This is what takes 99.9% of |
To save some space and improve download speeds, those tarballs are compacted with good ol'
So, why don't we switch to xz which apparently looks way better? Well, there are other criteria to optimize for, except for file size and download speed. In fact, the main optimization target is container creation speed! I just ran a quick non-scientific test on my notebook in order to proof my words, measuring the time it takes to run
time tar xf tar.gz: ~7 seconds
time tar xf tar.xz: ~13 seconds
See, it takes twice the time if we switch to xz! Note that this ratio doesn't change much when I switched from fast SSD to (relatively slow) rotating hard disk drive:
time tar xf tar.gz: ~8 seconds
time tar xf tar.xz: ~16 seconds
Note, while I call it non-scientific, I still ran each test at least three times, with proper syncs, rms and cache drops in between.
Now, do we want to trade a double increase of container creation time for saving 80 MB of disk space. We sure don't!
|James Bottomley, CTO, Server Virtualization for Parallels gave a presentation entitled, "Containers and The Cloud: Do you need another virtual environment?" on Oct 23. The Linux Foundation recently posted it to YouTube.|
There is a lot of good information in the video even for us OpenVZ folks. Enjoy:
|vzctl 4.6 build hit download.openvz.org (and its many mirrors around the world) last week. Let's see what's in store, shall we?|
First and foremost, I/O limits, but the feature is already described in great details. What I want to add is the feature was sponsored by GleSYS Internet Services AB, and is one of the first results of OpenVZ partnership program in action. This program is a great opportunity for you to help keep the project up and running, and also experience our expert support service just in case you'd need it. Think of it as a two-way support option. Anyway, I digress. What was it about? Oh yes, vzctl.
Second, improvements to UBC setting in VSwap mode. Previously, if you set RAM and swap, all other UBC parameters not set explicitly were set to unlimited. Now they are just left unset (meaning that the default in-kernel setting is used, whatever it is). Plus, in addition to physpages and swappages, vzctl sets lockedpages and oomguarpages (to RAM), vmguarpages (to RAM+swap).
Plus, there is a new parameter vm_overcommit, and it works in the following way -- if set, it is used as a multiplier to ram+swap to set privvmpages. In layman terms, this is a ratio of real memory (ram+swap) to virtual memory (privvmpages). Again, physpages limits RAM, and physpages+swappages limits real memory used by a container. On the other side, privvmpages limits memory allocated by a container. While it depends on the application, generally not all allocated memory is used -- sometimes allocated memory is 5 or 10 times more than used memory. What vm_overcommit gives is a way to set this gap. For example, command
is telling OpenVZ to limit container $CTID with 2 GB of RAM, 4 GB of swap, and (2+4)*3, i.e. 18 GB of virtual memory. That means this container can allocate up to 18 GB of memory, but can't use more than 6 GB. So, vm_overcommit is just a way to set privvmpages implicitly, as a function of physpages and swappages. Oh, and if you are lost in all those *pages, we have extensive documentation at http://openvz.org/UBC.
A first version of vzoversell utility is added. This is a proposed vzmemcheck replacement for VSwap mode. Currently it just summarizes RAM and swap limits for all VSwap containers, and compares it to RAM and swap available on the host. Surely you can oversell RAM (as long as you have enough swap), but sum of all RAM+swap limits should not exceed RAM+swap on the node, and the main purpose of this utility is to check that constraint.
vztmpl-dl got a new --list-orphans option. It lists all local templates that are not available from the download server(s) (and therefore can't be updated by vztmpl-dl). Oh, by the way, since vzctl 4.5 you can use
vzubc got some love, too. It now skips unlimited UBCs by default, in order to improve the signal to noise ratio. If you want old behavior, (i.e. all UBCs), use -v flag.
Surely, there's a bunch of other fixes and improvements, please read the changelog if you want to know it all. One thing in particular worth mentioning, it's a hack to vzctl console. As you might know, in OpenVZ container's console is sort of eternal, meaning you can attach to it before a container is even started, and it keeps its state even if you detach from it. That creates a minor problem, though -- if someone run, say, vim in console, then detaches and reattaches, vim is not redrawing anything and the console shows nothing. To workaround, one needs to press Ctrl-L (it is also recognized by bash and other software). But it's a bit inconvenient to do that every time after reattach. Although, this is not required if terminal size has changed (i.e. you detach from console, change your xterm size, then run vzctl console again), because in this case vim is noting the change and redraws accordingly. So what vzctl now does after reattach is telling the underlying terminal its size twice -- first the wrong size (with incremented number of rows), then the right size (the one of the terminal vzctl is running in). This forces vim (or whatever is running on container console) to redraw.
Finally, new vzctl (as well as other utilities) are now in our Debian wheezy repo at http://download.openvz.org/debian,
Enjoy, and don't forget to report bugs to http://bugzilla.openvz.org/
Today we are releasing a somewhat small but very important OpenVZ feature: per-container disk I/O bandwidth and IOPS limiting.
OpenVZ have I/O priority feature for a while, which lets one set a per-container I/O priority -- a number from 0 to 7. This is working in a way that if two similar containers with similar I/O patterns, but different I/O priorities are run on the same system, a container with a prio of 0 (lowest) will have I/O speed of about 2-3 times less than that of a container with a prio of 7 (highest). This works for some scenarios, but not all.
So, I/O bandwidth limiting was introduced in Parallels Cloud Server, and as of today is available in OpenVZ as well. Using the feature is very easy: you set a limit for a container (in megabytes per second), and watch it obeying the limit. For example, here I try doing I/O without any limit set first:
root@host# vzctl enter 777 root@CT:/# cat /dev/urandom | pv -c - >/bigfile 88MB 0:00:10 [8.26MB/s] [ <=> ] ^C
Now let's set the I/O limit to 3 MB/s:
root@host# vzctl set 777 --iolimit 3M --save UB limits were set successfully Setting iolimit: 3145728 bytes/sec CT configuration saved to /etc/vz/conf/777.conf root@host# vzctl enter 777 root@CT:/# cat /dev/urandom | pv -c - >/bigfile3 39.1MB 0:00:10 [ 3MB/s] [ <=> ] ^C
If you run it yourself, you'll notice a spike of speed at the beginning, and then it goes down to the limit. This is so-called burstable limit working, it allows a container to over-use its limit (up to 3x) for a short time.
In the above example we tested writes. Reads work the same way, except when read data are in fact coming from the page cache (such as when you are reading the file which you just wrote). In this case, no actual I/O is performed, therefore no limiting.
Second feature is I/O operations per second, or just IOPS limit. For more info on what is IOPS, go read the linked Wikipedia article -- all I can say here is for traditional rotating disks the hardware capabilities are pretty limited (75 to 150 IOPS is a good guess, or 200 if you have high-end server class HDDs), while for SSDs this is much less of a problem. IOPS limit is set in the same way as iolimit (
Finally, to play with this stuff, you need:testing kernels.
|Oh, such a provocative subject! Not really. Many people do believe that OpenVZ is obsoleted, and when I ask why, three most popular answers are:|
1. OpenVZ kernel is old and obsoleted, because it is based on 2.6.32, while everyone in 2013 runs 3.x.
2. LXC is the future, OpenVZ is the past.
3. OpenVZ is no longer developed, it was even removed from Debian Wheezy.
Let me try to address all these misconceptions, one by one.
1. "OpenVZ kernel is old". Current OpenVZ kernels are based on kernels from Red Hat Enterprise Linux 6 (RHEL6 for short). This is the latest and greatest version of enterprise Linux distribution from Red Hat, a company who is always almost at the top of the list of top companies contributing to the Linux kernel development (see 1, 2, 3, 4 for a few random examples). While no kernel being ideal and bug free, RHEL6 one is a good real world approximation of these qualities.
What people in Red Hat do for their enterprise Linux is they take an upstream kernel and basically fork it, ironing out the bugs, cherry-picking security fixes, driver updates, and sometimes new features from upstream. They do so for about half a year or more before a release, so the released kernel is already "old and obsoleted", as it seems if one is looking at the kernel version number. Well, don't judge a book by its cover, don't judge a kernel by its number. Of course it's not old, neither obsoleted -- it's just more stable and secure. And then, after a release, it is very well maintained, with modern hardware support, regular releases, and prompt security fixes. This makes it a great base for OpenVZ kernel. In a sense, we are standing on the shoulders of a red hatted giant (and since this is open source, they are standing just a little bit on our shoulders, too).
RHEL7 is being worked on right now, and it will be based on some 3.x kernel (possibly 3.10). We will port OpenVZ kernel to RHEL7 once it will become available. In the meantime, RHEL6-based OpenVZ kernel is latest and greatest, and please don't be fooled by the fact that uname shows 2.6.32.
2. OpenVZ vs LXC. OpenVZ kernel was historically developed separately, i.e. aside from the upstream Linux kernel. This mistake was recognized in 2005, and since then we keep working on merging OpenVZ bits and pieces to the upstream kernel. It took way longer than expected, we are still in the middle of the process with some great stuff (like net namespace and CRIU, totally more than 2000 changesets) merged, while some other features are still in our TODO list. In the future (another eight years? who knows...) OpenVZ kernel functionality will probably be fully upstream, so it will just be a set of tools. We are happy to see that Parallels is not the only company interested in containers for Linux, so it might happen a bit earlier. For now, though, we still rely on our organic non-GMO home grown kernel (although it is already optional).
Now what is LXC? In fact, it is just another user-space tool (not unlike vzctl) that works on top of a recent upstream kernel (again, not unlike vzctl). As we work on merging our stuff upstream, LXC tools will start using new features and therefore benefit from this work. So far at least half of kernel functionality used by LXC was developed by our engineers, and while we don't work on LXC tools, it would not be an overestimation to say that Parallels is the biggest LXC contributor.
So, both OpenVZ and LXC are actively developed and have their future. We might even merge our tools at some point, the idea was briefly discussed during last containers mini-conf at Linux Plumbers. LXC is not a successor to OpenVZ, though, they are two different projects, although not entirely separate (since OpenVZ team contributes to the kernel a lot, and both tools use the same kernel functionality). OpenVZ is essentially LXC++, because it adds some more stuff that are not (yet) available in the upstream kernel (such as stronger isolation, better resource accounting, plus some auxiliary ones like ploop).
3. OpenVZ no longer developed, removed from Debian. Debian kernel team decided to drop OpenVZ (as well as few other) kernel flavors from Debian 7 a.k.a. Wheezy. This is completely understandable: kernel maintenance takes time and other resources, and they probably don't have enough. That doesn't mean though that OpenVZ is not developed. It's really strange to argue that, but please check our software updates page (or the announce@ mailing list archives). We made about 80 software releases this year so far. This accounts for 2 releases every week. Most of those are new kernels. So no, in no way it is abandoned.
As for Debian Wheezy, we are providing our repository with OpenVZ kernel and tools, as it was announced just yesterday.