WL#3569: Online backup: Backup synchronization algorithm

Affects: Server-6.0 — Status: Complete

Description
High Level Architecture
Low Level Design

DESCRIPTION
The main backup algorithm.

DELIVERABLES
- Documentation of backup algorithm (in this WL)
- Implementation of synchronization phase of backup

NOTES
- Sinisa writes 2006-11-01:
  "lock for all kind of updates" ... But this lock will be fully
   implemented by new transactional lock, as per Monty's specs. This lock
   will lock transactional engines in one way and non-transactional in
   another way. I think we should re-examine WL 3569 in the light of this
   new development."
  We need to check with Monty what kind of locks he is working on...
  Anyone has any reference to a WL?  Please add here if you have...

LIMITATIONS
- Commit blocker is not implemented in this WL.  See WL#3956.

The main task of the algorithm is to properly initialize and synchronize
creation of local backup images by several storage engines. 

Engines can use different methods for creating their backup images. These
methods can be divided into two broad classes:

"At end", or Point Behind methods:

Engine scans its data while continuing normal operation. Changes to the data are
recorded in a log which is added to the backup image. The validity point of the
image can be created only after the data scan is complete.

"At begin", or Point Forward methods:

Engine can create a snapshot of its data at any given moment and continue to
work normally. The snapshot will be sent to backup kernel after its creation.
The validity point of the backup image is the moment at which snapshot was created.


The backup algorithm treats both methods in a uniform way. From it's point of
view a process of creating backup image by a storage engine is divided into
three phases:

 1. Initial phase in which engine makes preparations before it can create a 
    validity point. For "at begin" engines this phase will be empty as these 
    engines are ready for creating a validity point right away. For "at end" 
    engines this is the data scan phase.

 2. Waiting phase. In this phase engine is ready for creating a validity point 
    but it should continue its normal operation since it must wait for other 
    engines to be ready. It is possible that engine (e.g. "at end" type) sends
    additional log data during that phase.

 3. Final phase -- after all engines have created their validity points. For "at 
    end" engines this phase will be empty since they have already sent all their 
    backup image data (they may still need to send small amount of "closing 
    data"). For "at begin" engines this is the phase during which all 
    the backup data will be sent.


  +- start backup     +- ready for creating validity point
  |                   |
  |                   |                  +- validity point    +- end of backup
  |                   |                  |                    |
  |-------------------|------------------|--------------------|---> time
    initial phase        waiting phase        final phase



In any of the above phases an engine can send data to the backup kernel. All
these data will be collected and form the backup image for that engine. As seen
already, some engines ("at end") will send majority of the data in the initial
phase while other ("ar begin") in the final phase. It might be also necessary to
send data in the waiting phase to keep the backup image in sync with the ongoing
data changes. Note: in this model it is possible to handle even more general
backup schemas in which data is sent in all three phases.

As far as the backup kernel is concerned, it doesn't matter what kind of data is
sent in each phase of a backup process. These can be logical records, physical
data blocks, log entries or any other kind of data used by a particular backup
method. What is crucial is that the backup kernel knows when the initial phase
ends and engine is ready for creating a validity point. This information is used
to synchronize several backup images. 

Backup image synchronization
============================

The synchronization happens after all engines are finished with the initial
phase of their backup process and are ready for creating validity points for
their local backup images. 

Consistency between all the local backup images is achieved by having the backup
kernel to decide when the validity points should be created. Backup kernel picks
the same moment for all participating engines and the locking mechanism
described below ensures that all backup images are valid for exactly the same
point in time.

For this to work, engines must be able to create validy point of their backup
images upon a request from the backup kernel. It is important that:

a) The validity point is created instantly, without any waiting.
b) The validity point can be created at any time choosen by the kerenel.

Requirement a) is important because other engines will be blocked during
creation of the validity point and the blocking period should be as short as
possible. Reqirement b) is here for the same reason -- backup kernel (and other
engines) can not wait until a suitable moment arrives.

Guilhem gave example showing why b) is an issue here. In MyISAM engine (and
possibly others) it can be important that at the time the validity point is
established index files are in sync with data files. Thus there should be no
ongoing table updates when one wants to create a validity point. For that reason
the engine must do extra preparations (waiting for all ongoing updates to finish
and blocking any new updates) before it can satisfy requirement b), that is, be
ready for creating a validity point at any given time.

Taking into account all these issues, the synchronization protocol which will
ensure consistency between different backup images works as follows.

  1. Backup kernels ask all engines to prepare for synchronization. The engines
     should not block during that stage but they should do any preparations they
     need in order to be ready for instant creation of a validity point in 
     step 2.

  2. When all engines signalled that they are ready, backup kernel asks them to 
     create their validity points. Each engine creates it and freezes its 
     current (backup) state so that the validity point remains valid 
     until the end of synchronization process. (In most cases freezing would 
     mean just blocking any commits).

  3. When validity points for all images are created, the kernel unfreezes each 
     engine. At that moment the synchronization process is finished.

The reason for freezing engines in step 2 is illustrated below:

               Backup for           Backup for
                engine 1.            engine 2.
                   |                    |
                   |                    |
request to create  |                    |
validity point  -->+ t1                 |
                   :                    |
request to create  :                    |
validity point  ----------------------->+  t2
                   :                    : <--- global validity point
continue --------->+ t3                 :
                   |                    :
continue ------------------------------>+  t4
                   |                    |


The local validity points of engines 1 and 2 are at t1 and t2, respectively. If
any data changes in engine 1 between time t1 and t2 then its state at t2 will be
different from the state at the validity point t1. As a consequence the global
image will be inconsistent since it will describe state of engine 2 at time t2 
and state of engine 1 at time t1.

If the first engine is frozen between time t1 and t2 then the global image is
consistent as it describes state of both engines at time t2 (state of engine 1
at t2 is the same as its state at t1).

Timing backup image creation
============================

For efficiency reasons it is important that the initial phase of all engines
involved in creating a backup image ends at approximately the same time so that
no engine waits unnecessarily. 

For example any "at begin" engines whose initial phase is empty should be
initialized only after all the "at end" engines have finished their initial data
transfer. 

As another example consider two "at end" engines such that first of them needs
to send 1MB of data in the intial phase and the other one 1TB of data. If both
engines are initialized at the same moment, then the first one will have to wait
for a long time untill the second one finishes its initial transfer:

        +- E1 ready for v.p. after sending 1MB of data
        |
        |  E1 has to wait until E2 finishes its initial transfer
E1  |===|----------------------------------------------------------|----
  
E2  |============================================================|-|----
            E2 needs to send 1TB in the initial phase...           |
                                                                   +- creating
                                                                      validity
                                                                      point.

To minimize the waiting period, backup algorithm initializes backup process on
different engines at different times based on the estimates of the amount of
data to be sent in the initial phase (if avaliable).

In the above example, if the engines report 1MB and 1TB as estimates of their
initial data transfer size then the timing will be as follows:


        E1 initialized after E2 has sent 999MB of data -----+
E1                                                          |===|--|----
  
E2  |============================================================|-|----
            E2 needs to send 1TB in the initial phase...           |
                                                                   +- creating
                                                                      validity
                                                                      point.

In particular, "at begin" engines will return 0 as the estimate which will
result in initializing them *after* any other engines have sent their initial
data. Since it might be difficult to estimate size of the data, engine can
return UNKNOWN_SIZE as an estimate. Such engine will be assumed to be of "at
end" type.

No LLD needed, the most important thing with this WL was to get the 
algorithm right.
-- Lars, 2007-07-05

Note that there has been no push of code assigned to this
specific task. Per Lars:
"This WL describes the general algorithm, but the code is pushed in
other WLs.  It was just convenient to document the algorithm in the
WL."
-- Trudy Pelzer, 2007-07-05