Kir Kolyshkin (k001) wrote in openvz,
Kir Kolyshkin

ploop and live migration: 2 years later

It has been almost two years since we wrote about effective live migration with ploop write tracker. It's time to write some more about it, since we have managed to make ploop live migration yet more effective, by means of pretty simple optimizations. But let's not jump to resolution yet, it's a long and interesting story to tell.

As you know, live migration is not quite live, although it looks that way to a user. There is a short time period, usually a few seconds, during which the container being migrated is frozen. This time (shown if -t or -v option to vzmigrate --live is used) is what needs to be optimized, making it as short as possible. In order to do that, one needs to dig into details on what's happening when a container is frozen.

Typical timings obtained via vzmigrate -t --live look like this. We ran a few iterations migrating a container back and forth between two OpenVZ instances (running inside Parallels VMs on the same physical machine), so there are a few columns at the right side.

(Software: vzctl 4.6.1, ploop-1.10)

              Iteration     1     2     3     4     5     6    AVG

        Suspend + Dump:   0.58  0.71  0.64  0.64  0.91  0.74  0.703
   Pcopy after suspend:   1.06  0.71  0.50  0.68  1.07  1.29  0.885
        Copy dump file:   0.64  0.67  0.44  0.51  0.43  0.50  0.532
       Undump + Resume:   2.04  2.06  1.94  3.16  2.26  2.32  2.297
                        ------  ----  ----  ----  ----  ----  -----
  Total suspended time:   4.33  4.16  3.54  5.01  4.68  4.85  4.428

Apparently, the first suspect to look at is that "undump + resume". Basically, it shows timing of vzctl restore command. Why it is so slow? Apparently, ploop mount takes some noticeable time. Let's dig deeper into that process.

First, implement timestamps in ploop messages, raise the log level and see what is going on here. Apparently, adding deltas is not instant, takes any time from 0.1 second to almost a second. After some more experiments and thinking it becomes obvious that since ploop kernel driver works with data in delta files directly, bypassing the page cache, it needs to force those files to be written to the disk, and this costly operation happens while container is frozen. Is it possible to do it earlier? Sure, we just need to force write the deltas we just copied before suspending a container. Easy, just call fsync(), or yet better fdatasync(), since we don't really care about metadata being written.

Unfortunately, there is no command line tool to do fsync or fdatasync, so we had to write one and call it from vzmigrate. Is it any better now? Yes indeed, delta adding times went down to from tenths to hundredths of a second.

Except for the top delta, of course, which we migrate using ploop copy. Surely, we can't fsync it before suspending container, because we keep copying it after. Oh wait... actually we can! By adding an fsync before CT suspend, we force the data be written on disk, so the second fsync (which happens after everything is copied) will take less time. This time is shown as "Pcopy after suspend".

The problem is that ploop copy consists of two sides -- the sending one and the receiving one -- which communicate over a pipe (with ssh as a transport). It's the sending side which runs the command to freeze the container, and it's the receiving side which should do fsync, so we need to pass some sort of "do the fsync" command. Yet better, do it without breaking the existing protocol, so nothing bad will happen in case there is an older version of ploop on the receiving side.

The "do the fsync" command ended up being a data block of 4 bytes, you can see the patch here. Older version will write these 4 bytes to disk, which is unnecessary but OK do to, and newer version will recognize it as a need to do fsync.

The other problem is that sending side should wait for fsync to finish, in order to proceed with CT suspend. Unfortunately, there is no way to solve this one with a one-way pipe, so the sending side just waits for a few seconds. Ugly as it is, this is the best way possible (let us know if you can think of something better).

To summarize, what we have added is a couple of fsyncs (it's actually fdatasync() since it is faster), and here are the results:

(Software: vzctl 4.7, ploop-1.11)

              Iteration     1     2     3     4     5     6    AVG

        Suspend + Dump:   0.60  0.60  0.57  0.74  0.59  0.80  0.650
   Pcopy after suspend:   0.41  0.45  0.45  0.49  0.40  0.42  0.437 (-0.4)
        Copy dump file:   0.46  0.44  0.43  0.48  0.47  0.52  0.467
       Undump + Resume:   1.86  1.75  1.67  1.91  1.87  1.84  1.817 (-0.5)
                        ------  ----  ----  ----  ----  ----  -----
  Total suspended time:   3.33  3.24  3.12  3.63  3.35  3.59  3.377 (-1.1)

As you see, both "pcopy after suspend" and "undump + resume" times decreased, shaving off about a second of time, which gives us about 25% improvement. Now, take into account that tests were done on an otherwise idle nodes with mostly idle containers, we suspect that the benefit will be more apparent with I/O load. Let checking if this statement is true will be your homework for today!

Tags: live migration, openvz, optimization, ploop, vzctl

  • OpenVZ 7.0 released

    I'm pleased to announce the release of OpenVZ 7.0. The new release focuses on merging OpenVZ and Virtuozzo source codebase, replacing our own…

  • Meet OpenVZ at FOSDEM 2016

    The most important gathering of free software and open source enthusiasts in Europe is coming on Jan 30-31, in Brussels and OpenVZ will have a…

  • Join Our Team at OpenStack Summit 2015 Tokyo

    We're very excited that this year OpenVZ will have exhibit space at OpenStack Summit in Tokyo Japan, October 27-30. We will be showing and demoing…

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.