WL#3169: Online backup: Server kernel

Affects: Server-6.0 — Status: Complete

Description
High Level Architecture
Low Level Design

SUMMARY
========
The task of the backup kernel is to create an image of the current state of a
server instance or to restore the state from such an image.

The backup image contains a snapshot of a (selected elements of) current 
instance state at some definite point in time. However, not all elements of 
the state are stored in the backup image but only the ones listed below, which 
form a so called "backup state".

DESTRUCTIVE AND NONDESTRUCTIVE RESTORE
==============================================
Note: only destructive restore will be implemented in this WL.

The restore operation has two variants: destructive and nondestructive. The 
first variant completely replaces the current state of an instance with the 
state stored in the backup image. Nondestructive restore merges the state from 
the backup image with the current state, changing state of only these 
components which are stored in the image and leaving other components 
unchanged. In nondestructive restore content of tables/databases stored in the 
backup image is restored while any tables/databases present in the current 
instance but not stored in the image remain untouched. Note that this may 
result in referential inconsistency as non-restored tables can refer to the 
restored ones.

It is assumed that the backup kernel will always create backup images which 
can be used for a full, destructive restore but the user can decide whether he 
wants to do a destructive or nondestructive one.

Additional note on nondestructive restore.
Docs team please take note for emphasis in RefMan:

The problem of potential inconsistency is inherent in the design,
which allows a user to backup/restore only selected database(s). 
This is not related to implementation but to the semantics of such 
an operation. When restoring a single database and leaving other 
untouched, we create a situation where some tables are in the 
restored state (from past) while others are in the current state. 
Since cross-database references are possible there can be 
inconsistencies of at least two types:

   1. State inconsistency: table A refers to B but A is in a state 
      from time t1 while B is in a state from time t2. Thus we create 
      a global state which has never occurred before.
   2. Referential inconsistency: table A refers to B but in the restore 
      process B was deleted or a column in it was removed/retyped. Results 
      in an errornous state.

BACKUP STATE
=======================

The "backup state" is the part of the instance state which
should be stored in a backup image. This is not the whole state since
for instance state of currently active threads or state of ongoing
(not commited) transactions is not a part of it.

The backup state consists of two main parts: 
1. the instance metadata, and 
2. data stored in tables. 
The metadata is split into several items listed in WL#3713 --
the main part of it is the structure of backed-up tables. 

The backup state changes over time so we should speak about a state at a given
time t. This state reflects the situation resulting from all transactions
which are *committed* at time t. Transactions which are "in progress" do
not affect the backup state as defined here.

Notes
-----
1. For consistency let us consider statements which are not part of a
   transaction as transactions consisting of that single statement.

2. Due to limitations in the current XA handling code it is possible
   to have "partially commited" transactions. This will limit functionality
   of the backup system as well.

3. The exact interaction of backup and replication subsystems must be 
   thought over -- possibly when some prototype of backup is ready.

Validity Point
==============

The backup process starts at some time t1 and continues until time t2 
producing a backup image which contains backup state at time t1 < t < t2. The 
time point t is called the validity point of the backup image. After restoring 
from the image the state will be the same as it was at time t.

Backup is considered correct regardless of where its validity point is located
between t1 and t2 but users may prefer to have it as close to t1 as possible.


Handling of errors
===========================

In the first version of the backup kernel, whenever an error is detected:
1. the current operations are canceled, 
2. the error is reported, and 
3. the normal operation of the instance is resumed. 

Canceling creation of a backup image does not affect database state -- it
continues its operation as if the backup request was never issued.

Canceling restore process might result in a changed content of the tables being
restored. However, any other tables remain unaffected. The global data like user
accounts should also be unchanged. 


LIMITATIONS
===============

- An instance on which restore operation is performed must have the same set
  of storage engines as the one on which backup was created (this is because
  backup image is created by individual storage engines and only the engine
  which created it can restore from it).

- A possibility to restore selected databases/tables can lead to referential
  inconsistencies. This can not be avoided in a situation where some tables
  are changed and some not. However, backup kernel can detect this and issue
  warnings.


EXTERNAL REQUIREMENTS
==============================

0. Should correctly save and restore backup state as described above.

1. The database should be functional during the backup process as much
   as possible. Specifically

    a) storage engines should not be locked,
    b) individual tables should not be locked,
    c) it should be possible to process queries (perhaps with some
       restrictions as no DDL operations),

   However, it is ok to block operations which refer to data which is
   currently being restored.

2. Should be possible to use backup image for setting up replication.

3. Format of the image data should be streamable.


INTERNAL REQUIREMENTS
==================

4. Possibility to backup only a part of the backup state (selected
   databases and tables).

5. Possibility to restore only a part of the state saved in the backup
   image.

6. Possibility to choose between destructive and nondestructive restore
   (Not in version one.)

7. Extra requirements on the image format:

    a) Possibility to translate to known backup formats like XBSA (needs
       to be investigated further what this implies).

    b) Possibility to analyze and process the image by external tools.
       This can be provided on different levels:

         b1. image format is completely closed and can be used only for
             restore operation.
         b2. there is some kind of table of contents listing databases
             and tables stored in the image.
         b3. the table data is stored in an open format so that it can
             be understood by external tools

    c) Data compression. This may influence the image format if we want
       to be able to extract partial state (selected databases/tables)
       and still have the data compressed.

    f) Data consistency checking: a possibility to easily detect that
       the image is corrupted.

8. Possibility to use backup/restore functionality to initialize replication.

ENGINES CREATE THE BACKUP IMAGE
===================================
The main design decision is that the image of table data is created by
individual storage engines and not by the backup kernel. 

The engines are free to choose whatever method they like to create 
such an image and they are also free to put the image data in a 
format of their choice (which is mostly opaque to the
kernel). The metadata image is created by the backup kernel.

Given that a set of tables can be stored on several engines, 
the main duties of the backup kernel are:

 - backup/restore metadata,
 - initialize backup/restore of table data on all involved engines (with correct 
   timing to minimize resource consumption)
 - ensure that partial backup images from different engines are all synchronized 
   and correspond to the same point in time.
 - fetch backup images from all engines and put them into global backup image,
 - upon restore, extract partial images from the global one and feed them to the 
   storage engines.
 - create correct environment for storage engines to perform backup/restore 
   tasks (supply arguments, create tables, do neccessary global locking etc.)
 - provide interface to the SQL layer (handle backup related SQL commands, 
   implement backup C API).
 - detect and react to errors during backup/restore. 


BACKUP AND RESTORE ALGORITHMS
========================================

These algorithms implement protocols for correct synchronization of several
backup images created by individual storage engines. They are described in
WL#3569 and WL#3571.


FORMAT OF THE PER ENGINE BACKUP IMAGES
========================================
 
Each storage engine chooses format of the backup image most appropriate for its
internal representation of data and the backup method used. However, to support
selective restore from a given backup image (only selected tables) the backup
image is divided into several "data streams" corresponding to individual tables. 

Each stream contains data needed to restore one table. There is a special
"shared data stream" to which engine can write any data not connected to any
particular table.

It is completely up to the storage engine how it distributes its backup image
among these streams. It is for instance possible that all data will be sent into
the shared stream and per table streams will be empty or vice versa. However, it
is important to keep in mind that upon restore only the shared stream and the
streams corresponding to the tables being restored will be send back to the engine.

As an example consider a request for backing-up tables t1, t2 and t3 on some
storage engine. It creates backup image consisting of four data streams:

#0: the shared data
#1: data for table t1
#2: data for table t2
#3: data for table t3

Later, a user wants to restore tables t1 and t3 only. The backup kernel will
send to the storage engine streams #0, #1 and #3 but not #2. Hence stream #2
should not contain any data which would be needed to restore t1 or t3.

[Lars wants the backup image to consist of objects, e.g. tables, config data, 
auto_inc state, meta_data, triggers, SP, SF etc.  This to make it possible, in
future releases after release 1.0, to select what objects to take backup of and 
what objects to restore.]


BACKUP IMAGE FORMAT VERSIONS
==================================

A backup image created by a storage engine is labelled with the name of the
engine and a version number (obtiained when the image was created).
It is a *strict* requirement that the storage engine provides backward
compatibility for image formats. 

This means that if storage engine X supports version v of image format then it
*must* be able to restore data from all images labelled by "X" and with versions
w <= v. Thus introducing new backup image formats should be done with care.

Questions/issues:

- For differentiation, should we introduce incompatibile backup image "flavours"
   (e.g. "community" and "enterprise" backup formats).
   [Lars thinks not for release 1.0]

- Have backup image format names independent from engine names.
  For instance, the logical backup format created by default algorithms 
  or backup format for many different versioning engines will be 
  engine-independent.
  [Lars thinks yes, lets make the default format engine agnostic.]

- Give a possibility for a storage engine to handle backup images 
  created by a different one.
  [Lars thinks yes, in those cases when the engine has not 
  made it impossible.]

DATA TRANSFER PROTOCOL
==============================

This is a protocol used to fetch backup image data from storage engines or send
this data to them in a controlled way.

Design goals and decisions:

- memory for data buffers is allocated by the kernel
  (reason: safer than allocating by storage engines),

- backup server is pulling data from engines: (reason: gives server  
  precise control over speed of data transfer from different engines 
  which is needed for synchronization),

- flexibility allowing for creating backup images either in parallel 
  threads or in the main thread of the backup kernel. Also allowing 
  single/multi buffer solutions (reason: more freedom for storage engine 
  implementors, parallelism and multiple buffers can increase 
  efficiency)

- no callbacks, kernel polls engines for information (reason: 
  simplicity).


Transfer protocol in both directions is based on placing requests for
reading/filling data buffers to the storage engine. The buffers are allocated by
kernel and the kernel decides about size of the buffer. Engine which internaly
manipulates data of different size must repack the data to fit into buffers
supplied by the kernel. 

The requests are processed by storage engine in order in which they arrive. It
is up to engine to decide how to process them -- synchronously using the server
thread or asynchronously by spawning dedicated thread(s). Backup kernel doesn't
know whether engine uses separate threads to process requests or not and is
designed to behave well in both scenarios.

Requests are identified by pointers to data buffers. Using this identification
backup kernel can poll for status of a previously submitted requests. 

Details of the protocol and its implementation are described in WL#3473.

Disclaimer: the fixed buffer size design was specifically requested by Brian who
thinks that it is neccessary for efficiency and correct error handling. The
implementor (Rafal) does not agree with that opinion and thinks that it is
possible and better to allow engines to send/receive chunks of data of variable
sizes choosen by the engine. Anyway, Brian solution is being implemented now.


Backup functionality that must be provided by storage engines
-------------------------------------------------------------

1. Giving an estimate of the size of a backup image to be produced and of how
much of it will be send in the initial phase of the backup synchronization
protocol (see WL#3569).

2. Informing about the backup image version used.

3. Creating, upon request, a backup image of given list of tables stored in the
engine. The backup data should be split into several data streams as described
above.

4. Establish validity point of the backup image using the synchronization
protocol (see WL#3569). This requires being able to freeze engine's state,
blocking any operations which might change it.

5. Restoring selected tables from a previously created backup image, using the
data from streams corresponding to these tables (and the shared data stream) as
described above. Backup image formats of any version smaller or equall to that
reported in point 2 should be supported.

6. Implementing the above data transfer protocol for backup data transfers from
and to the backup kernel.

7. Cancel, upon request, ongoing restore or backup process, clean-up and resume
normal operation.


"Default" backup
----------------

Some storage engines may not support this API.
The server then performs the backup for them. 
It is planned to use mysqlbackup with full lock of 
involved tables until a better solution is developed.

Design given in other WLs.
-- Lars, 2007-07-05