As you know, live migration is not quite live, although it looks that way to a user. There is a short time period, usually a few seconds, during which the container being migrated is frozen. This time (shown if -t or -v option to vzmigrate --live is used) is what needs to be optimized, making it as short as possible. In order to do that, one needs to dig into details on what's happening when a container is frozen.
Typical timings obtained via vzmigrate -t --live look like this. We ran a few iterations migrating a container back and forth between two OpenVZ instances (running inside Parallels VMs on the same physical machine), so there are a few columns at the right side.
(Software: vzctl 4.6.1, ploop-1.10)
Iteration 1 2 3 4 5 6 AVG
Suspend + Dump: 0.58 0.71 0.64 0.64 0.91 0.74 0.703
Pcopy after suspend: 1.06 0.71 0.50 0.68 1.07 1.29 0.885
Copy dump file: 0.64 0.67 0.44 0.51 0.43 0.50 0.532
Undump + Resume: 2.04 2.06 1.94 3.16 2.26 2.32 2.297
------ ---- ---- ---- ---- ---- -----
Total suspended time: 4.33 4.16 3.54 5.01 4.68 4.85 4.428
Apparently, the first suspect to look at is that "undump + resume". Basically, it shows timing of vzctl restore command. Why it is so slow? Apparently, ploop mount takes some noticeable time. Let's dig deeper into that process.
First, implement timestamps in ploop messages, raise the log level and see what is going on here. Apparently, adding deltas is not instant, takes any time from 0.1 second to almost a second. After some more experiments and thinking it becomes obvious that since ploop kernel driver works with data in delta files directly, bypassing the page cache, it needs to force those files to be written to the disk, and this costly operation happens while container is frozen. Is it possible to do it earlier? Sure, we just need to force write the deltas we just copied before suspending a container. Easy, just call fsync(), or yet better fdatasync(), since we don't really care about metadata being written.
Unfortunately, there is no command line tool to do fsync or fdatasync, so we had to write one and call it from vzmigrate. Is it any better now? Yes indeed, delta adding times went down to from tenths to hundredths of a second.
Except for the top delta, of course, which we migrate using ploop copy. Surely, we can't fsync it before suspending container, because we keep copying it after. Oh wait... actually we can! By adding an fsync before CT suspend, we force the data be written on disk, so the second fsync (which happens after everything is copied) will take less time. This time is shown as "Pcopy after suspend".
The problem is that ploop copy consists of two sides -- the sending one and the receiving one -- which communicate over a pipe (with ssh as a transport). It's the sending side which runs the command to freeze the container, and it's the receiving side which should do fsync, so we need to pass some sort of "do the fsync" command. Yet better, do it without breaking the existing protocol, so nothing bad will happen in case there is an older version of ploop on the receiving side.
The "do the fsync" command ended up being a data block of 4 bytes, you can see the patch here. Older version will write these 4 bytes to disk, which is unnecessary but OK do to, and newer version will recognize it as a need to do fsync.
The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).
To summarize, what we have added is a couple of fsyncs (it's actually fdatasync() since it is faster), and here are the results:
(Software: vzctl 4.7, ploop-1.11)
Iteration 1 2 3 4 5 6 AVG
Suspend + Dump: 0.60 0.60 0.57 0.74 0.59 0.80 0.650
Pcopy after suspend: 0.41 0.45 0.45 0.49 0.40 0.42 0.437 (-0.4)
Copy dump file: 0.46 0.44 0.43 0.48 0.47 0.52 0.467
Undump + Resume: 1.86 1.75 1.67 1.91 1.87 1.84 1.817 (-0.5)
------ ---- ---- ---- ---- ---- -----
Total suspended time: 3.33 3.24 3.12 3.63 3.35 3.59 3.377 (-1.1)
As you see, both "pcopy after suspend" and "undump + resume" times decreased, shaving off about a second of time, which gives us about 25% improvement. Now, take into account that tests were done on an otherwise idle nodes with mostly idle containers, we suspect that the benefit will be more apparent with I/O load. Let checking if this statement is true will be your homework for today!


Comments
http://git.openvz.org/?p=vzctl;a=commitdiff;h=062120ea3d
http://git.openvz.org/?p=vzctl;a=commitdiff;h=00b9ce0433
http://git.openvz.org/?p=vzctl;a=commitdiff;h=d1c0c8648c7
I will look into HPN SSH