WL#4063: Online Backup: Finalize backup stream format

Affects: Server-6.0   —   Status: Complete

Remove these limitations of the format used in backup prototype.
 - Remove chunk size limit                                                
 - Handle items other than tables in meta-data part                       
 - Add all needed info to the header                                      
 - Add datablock flags (1-2 bytes), e.g. signal end of table              
 - Store charset info for all strings saved in the stream                 
 - Get final format reviewed                                              
 - Necessary changes to the stream writing/reading code                   
 - Configurable size of buffers                                           

Fix and document the format to be used in the first release. Update stream
writing/reading new format. Add unit test program for the library.

Benefits:                                                                
 - Avoid backup not working because of internal buffer size limits      
 - Backward-compatibility possible                                      
 - Make the format capable of storing image also on servers which       
   use non-standard charsets
  
Warning: Backup image format description included here is no longer 100% 
accurate. For most up-to-date description see the doxygen documentation of the 
backup code. Specifically the comments in stream_v1.c and stream_v1_transport.c.

Backup stream format
=====================

Each backup stream starts with 10 bytes prefix containing magic number, and
images's format version number. For backward compatibility, this must be the
case for all versions of online backup!

[backup stream] = [ magic number:8 ! version:2 ! backup image ]

[backup image] is stored in the format indicated by version number.

Version 1 of backup image format
================================

Backup archive is organized into a sequence of data chunks. There are two
independent aspects of the format:

a) transport layer format: determines how to extract data chunks from the byte 
   stream.

b) image format: How to interpret contents of the chunks.

We start with description of image format, assuming that data chunks have been
already extracted and ignoring (for the moment) how it was done. In diagrams
chunk boundaries are marked with '|':

  ... | chunk of data | more chunks ...

Boundaries between fields inside a chunk are marked with '!':

  ... ! a field ! more bytes ...

Backup image consists of three main parts.

[backup image]= [ preamble | table data | summary(*) ]

[Preamble] contains following information:

- version of the server which created archive (X-Y-Z-ext),
- time when archive was created (start of backup),
- global image flags (16 bits),
- information about table data snapshots stored in the image,
- catalogue listing all items stored in the image,
- meta-data for the items.

[Summary] contains this information:

- time of the validity point
- binlog position at the validity point

Information stored in the summary is known only after the archive has been
created. Depending on whether it was possible to rewind the output stream after
backup, the summary can be stored in the preamble or at the end of the archive.
Its position is indicated by flags stored in the preamble.

[preamble] = [ header | summary (*) | catalogue | meta data ]

Header
------

[header]= [ flags:2 ! creation time ! #of snapshots:1 ! server version !
            extra data | snapshot descriptions ]
  
[Flags] inform about position of [summary] and whether it contains binlog 
position info. It also tells what byte-order (endianess) was used on the machine 
which created the image.

[flags]= [ unused:.5 ! 
           has_binlog_pos:.1 ! 
           big_endian:.1 ! 
           summary_in_preamble:.1 ]

[Snapshot descriptions] contains descriptions of the table data snapshots 
present in the image. Each description is stored in a separate chunk (number of 
snapshots is given in the header).
  
[snapshot descriptions]= [ snapshot description | ... | snapshot description ]  

For each snapshot the following information is stored:

- version of the snapshot's format,
- info about backup engine which created this snapshot,
- snapshot options, 
- number of tables in the snapshot.

[snapshot description] = [ image type:1 ! format version: 2 ! global options:2 ! 
                           #of tables ! backup engine info ! extra data ]
  
[Image type] is encoded as follows:

  0 - snapshot created by native backup driver (BI_NATIVE),
  1 - snapshot created by built-in blocking driver (BI_DEFAULT),
  2 - snapshot created using created by built-in driver using consistent 
      read transaction (BI_CS).
  
Format of [backup engine info] depends on snapshot type. It is empty for the 
default and CS snapshots. For native snapshots it has format
  
[backup engine info (native)] = [ storage engine name ! storage engine version ]

Server and engine versions are stored as follows.
    
[server version] = [ major:1 ! minor:1 ! release:1 ! extra string ]
[engine version] = [ major:1 ! minor:1 ]

Summary
-------

If binlog position is stored, both the position of the last binlog entry (at VP 
time) and beginning of event group containing that entry are stored.

[summary]= [ vp time ! end time ! binlog pos ! binlog group pos ]
  
[binlog pos]= [ pos:4 ! binlog file name ]

[binlog group pos] uses the same format as [binlog pos]. Times are stored as
follows.

[time]= [ year and month:2 ! day of month:1 ! hour:1 ! min:1 ! sec:1 ] 
[year and month]= [ year:.12 ! month:.4 ]

Catalogue
---------

The catalogue describes what items are stored in the image. It doesn't contain 
any meta-data which is stored in a separate section. Catalogue only lists the 
items providing information needed to identify and select them.

[catalogue]= [ charsets ! 0x00 ! users ! 0x00 ! databases | 
               db catalogue | ... | db catalogue ]

[Charsets], [users] and [databases] sections are separated by 0x00 marker.  
Catalogue starts with list of charsets where each charset is identified by its
name. In other places of the image, charsets can be identified by their 
positions in this list. Number of charsets is limited to 256 so that one byte 
is enough to identify a charset.
  
[charsets]= [ charset name ! ... ! charset name ]
  
Two first entries in [charsets] have special meaning and should be always
present. 
  
First charset, is the charset used to encode all strings stored in 
the preamble. This should be a universal charset like UTF8, capable
of representing any string. 
  
Second charset in the list is the default charset of the server on which 
image was created. It can be the same as the first charset.
  
The following charsets are any charsets used by the items stored in the image 
and thus needed to restore these items.
  
[users]= [ user name ! ... ! user name ]
  
User list should always be non-empty and start with the root user. The list 
stores these users for which any privileges are stored in the image.
  
After [users] a list of all databases follows. If the list is empty, it 
consists of a single null string. Otherwise it has format:
  
[databases]= [ db info ! ... ! db info ]
  
[db info]= [ db name ! db flags:1 ! optional extra data ]
[db flags]= [ has extra data:.1 ! unused:.7 ]

[optional extra data]= [data len:2 ! the data:(data len) ]
  
[optional extra data] is present only if indicated in the flags. If there are 
no databases in the image, the database list is empty and there are no database 
catalogues.

[catalogue (no databases)] = [ charsets ! 0x00 ! users ! 0x00 ]

Database catalogue, lists all tables and other per-db items belonging to that
database.

[db catalogue]= [ db-item info ! ... ! db-item info ]
  
Each entry in the catalogue describes a single item. Currently only tables are 
stored in the database catalogue.
  
[db-item info]= [ type:2 ! name ! optional item data ]
  
[optional item data] is used only for tables:
  
[optional item data (table)]= [ flags:1 ! snapshot no:1 ! optional extra data ]
  
[snapshot no] tells which snapshot contains table's data. Presence of extra data
is indicated by flag.

[flags]= [ has_extra_data:.1 ! unused:.7 ]    
[optional extra data]= [ data_len:1 ! extra data:(data_len) ] 

If database is empty, it stores two 0x00 bytes.

[db catalogue (empty)] = [ 0x00 0x00 ]


Meta data
---------

Meta data is stored as a sequence of entries, each containing data required for
restoring a single item. Normally it is an SQL CREATE statement for that item,
but it can also contain other data stored in binary format.

The order of entries is relevant. They are stored in such order, that items can
be created while reading these entries without breaking any dependencies.

Meta data section is divided into three main parts, storing meta data for 
global items, tables and other items (per-db and per-table).
  
[meta data]= [ global items | tables | other items ]

All sections of [meta data] contain lists of meta-data entries, each containing 
meta-data for a single item. Meta-data for databases is stored in [global items]
section. [Tables] are grouped by database (it is OK since foreign constraints
can be disabled when tables are created). This will help to skip tables upon
selective restore of databeses. For other per-database items such grouping is
not possible because of possible inter-database dependencies. This is why other
per-db items are stored separately in the [other items] section. 

[tables] = [ tables from db1 | ... | tables from dbN ]

[Other items] has two parts for all per-database items (except tables) and 
all per-table items. These are separated by 0x00 0x00 marker.

[other items]= [ per-db items ! 0x00 0x00 ! per-table items ]
  
If there are no databases in the image, [meta data] consists of [global items]
only.

[meta data (no databases)]= [ global items ]
    
Meta data item lists can be empty or consist of several item entries. Empty 
item list consist of two 0x00 byte which can not start any valid 
[item entry].

[item list] = [ item entry ! ... ! item entry ]
[item list (empty)]= [ 0x00 0x00 ]
  
If list starts with byte different than 0x00, then it is a sequence of 
meta data item entries, each having the following format:
  
[item entry]= [ type:1 ! flags:1 ! position in the catalogue ! 
                optional extra data ! optional CREATE statement ]

Type of the item is encoded as follows:

  1 - character set,
  2 - user,
  3 - privilege,
  4 - database,
  5 - table,
  6 - view.

Actualy, in the first version of online backup, meta-data will be stored only 
for entries of type 3,4 and 5.
  
Item meta-data contains a CREATE statement or other binary data or both. [flags] 
inform about which meta-data elements are present in the entry.

[flags]= [ has_extra_data:.1 ! has_create_stmt:.1 ! unused:.6 ]
  
Item position in the catalogue is represented by 1-3 numbers, depending on its 
type:
  
[item position (global)]= [db no]
[item position (table)]= [ snap no ! pos in snapshot's table list]
[item position (other per-db item)]= [ pos in db item list ! db no ]
[item position (per-table item)] = 
                      [ pos in table's item list ! db no ! table pos ]
  
Note that table is identified by its position inside the snapshot to which it
belongs.
  
[optional extra data]= [ data_len:2 ! extra data:(data_len) ]  

Table data
----------

Table data is a sequence of data chunks.

[table data]= [ table data chunk | ... | table data chunk ]
 
[table data chunk]= [ snapshot no:1 ! seq no:2 ! flags:1 ! table no ! data ]
 
Snapshots are numbered from 1. Thus [table data chunk] never starts with 0x00 
byte. This can be used to mark end of a sequence of such chunks.
Data chunks of each snapshot are numbered by consecutive numbers ([seq no]). 
This can be used to detect discontinuities in a backup stream. Currently only 
one flag is used, indicating last data chunk for a given table.
  
[flags]= [ unused:.7 ! last data block:.1 ]


Formats for basic types
-----------------------

Numbers in fixed width format are stored with least significant bytes coming 
first. There are 2 byte and 4 byte fixed width numbers.

Numbers can be also stored in a variable length format. In this format a number
is stored as a sequence of bytes, each byte storing 7 bits (least significant 
bits coming first). The most significant bit in a byte tells if there are more 
bytes to follow (if it is set) or if current byte is the last one (if it is not 
set). 

Example: binary number 100101011101011010110100011110 will be split into 7 bit
groups 10 0101011 1010110 1011010 0011110 and then stored in 5 bytes (least 
significant bits first): 0x9E 0xDA 0xD6 0xAB 0x02

Note: we store only unsigned values.

String are stored using following encoding.

[string] = [ size ! bytes:(size) ]

Length is stored using the above variable length encoding. All strings are 
stored using the same universal character set, which is listed in image's 
catalogue as the first entry.

Carrier format
--------------

Data is stored in fixed size blocks. The size of a block is stored in first 4
bytes of first several blocks. The redundancy is for detecting data corruption.

archive data=
[ first block ! initial block ! ... ! initial block !
  regular block ! ... ! regular block ]

first block=
[ block_size:4 ! #of initial blocks:1 !
  block data:(block_size-5) ]

[block_size] is given in units of 512 bytes.

initial block=
[ block_size:4 ! block data:(block_size-4) ]

regular block=
[ block data:(block_size) ]

A single chunk can span several blocks. It is also possible that several chunks
fit into a single block. Therefore data chunks are split into fragments, each of
which can fit into a single block. These fragments are packed into blocks.

block data= [ fragment ! ... ! fragment ]

where [fragment] is one of
 
[EOC marker] = [ 0x80 ]
[EOS marker] = [ 0xC0 ]
[fragment extending to the end of block, last in a chunk] = [ 0x00 ! payload ]
[fragment extending to the end of block, more follow] = [ 0x40 ! payload ]
[limited size fragment] = [ fragment header:1 ! payload ]
 
The header of [limited size fragment] consists of two bit fragment type info 
followed by 6 bit, non-zero value encoding length of the fragment:
 
[fragment header] = [ type:.2 ! value:.6 ]
 
There are four types of fragments determined by the first 2 bits of the header:
 
 00 - small fragment which is the last fragment of a chunk,
 01 - small fragment with more fragments following it,
 10 - big fragment (more fragments follow),
 11 - huge fragment (more fragments follow).
 
Encoding of the size of the fragment depends on its type:
 
 - for small fragments: size= value  
 - for big fragments:   size= value << 6
 - for huge fragments:  size= value << 12
 
For small fragments, second bit of the header determines if it is the last
fragment of a chunk or there are more fragments to come. Chunk can't end with
a big or huge fragment and thus for these fragments we always expect more
fragments to come. [EOC marker] ends a chunk, even if the last fragment said 
that more fragments will follow.
   
The biggest fragment size is 64*(2^12) ~= 250 Kb. The format of fragment header
puts constraints on possible fragment sizes. If a chunk of data has size not 
possible to encode by a single fragment header, it is split into several 
fragments of correct sizes.


Post review changes
===================

There are small changes introduced when implementing above specifications.

1. "Carrier format" was renamed to "Transport layer format"

2. In transport layer, [block_size] stored in initial blocks is expressed in
bytes, not units of 512 bytes. Since we use 4 bytes for [block_size], the
biggest possible block size is 0xFFFF = 64k - 1.

3. The meaning of "last fragment of a chunk" flag has been reversed - if this
flag is set then this is the last fragment of a chunk. The updated
specifications for [fragment] are as follows:

 [EOC marker] = [ 0x80 ]
 [EOS marker] = [ 0xC0 ]
 [fragment extending to the end of block, last in a chunk] = [ 0x40 ! payload ]
 [fragment extending to the end of block, more follow] = [ 0x00 ! payload ]
 [limited size fragment] = [ header:1 ! payload ]

 The header of [limited size fragment] consists of two bit fragment type info
 followed by 6 bit, non-zero value encoding length of the fragment:

 [fragment header] = [ type:.2 ! value:.6 ]

 There are four types of fragments determined by the first 2 bits of the header:

 00 - small fragment with more fragments following it,
 01 - small fragment which is the last fragment of a chunk,
 10 - big fragment (more fragments follow),
 11 - huge fragment (more fragments follow).

 Encoding of the size of the fragment depends on its type:

 - for small fragments: size= value
 - for big fragments:   size= value << 6
 - for huge fragments:  size= value << 12

Backup stream format is implemented as a library libbackupstream of functions
for writing and reading stream in this format. The library uses C API so that it
can be used from programs written in C (not necessarily in C++). 

Library is designed so that it can be compiled and used independent from the 
rest of the code. This is why it doesn't use any mysql/mysys functions or types, 
but only standard C types and types defined from these.

There are low level functions for writing/reading chunks of data using the
carrier format. These functions use bstream_blob type to represent region of
memory containing data to be written or buffer where data should be read.

int bstream_write(backup_stream*, bstream_blob*)  - write data from a given blob
int bstream_end_chunk(backup_stream*)	          - close a chunk
int bstream_flush(backup_stream*)		  - flush stream

int bstream_read(backup_stream*, bstream_blob*)	- read into a blob
int bstream_next_chunk(backup_stream*)	        - move to next chunk of data

Higher level functions write various parts of backup archive using these low
level functions. They use structures:

st_bstream_image_header	- global inforamtion about backup image,
st_bstream_data_chunk	- describes single data chunk in the table data section.

Functions bstream_{wr,rd}_catalogue() and bstream_{wr,rd}_meta_data() need
access to image's catalogue listing all items stored in the image. The backup
stream library doesn't specify how this catalogue is implemented. It is
represented by a bs_bstream_image_header* pointer and the application using this
library must implement the service functions for browsing and populating the
catalogue with data (see sercvices API below)

int bstream_wr_preamble(backup_stream*, struct st_bstream_image_header*) 
int bstream_rd_preamble(backup_stream*, struct st_bstream_image_header*) 

int bstream_wr_header(backup_stream*, struct st_bstream_image_header*)
int bstream_rd_header(backup_stream*, struct st_bstream_image_header*)

int bstream_wr_summary(backup_stream*, struct st_bstream_image_header*)
int bstream_rd_summary(backup_stream*, struct st_bstream_image_header*)

int bstream_wr_catalogue(backup_stream*, struct st_bstream_image_header*)
int bstream_rd_catalogue(backup_stream*, struct st_bstream_image_header*)

int bstream_wr_meta_data(backup_stream *, struct st_bstream_image_header*)
int bstream_rd_meta_data(backup_stream *, struct st_bstream_image_header*)

int bstream_wr_data_chunk(backup_stream*, struct st_bstream_data_chunk*)
int bstream_rd_data_chunk(backup_stream*, struct st_bstream_data_chunk*)

To store and read basic types the following functions are used.

int bstream_wr_byte(backup_stream*, unsigned short int); 
int bstream_rd_byte(backup_stream*, unsigned short int*); 

int bstream_wr_int2(backup_stream*, unsigned int);
int bstream_rd_int2(backup_stream*, unsigned int*);

int bstream_wr_int4(backup_stream*, unsigned long int);
int bstream_rd_int4(backup_stream*, unsigned long int*);

int bstream_wr_num(backup_stream*, unsigned long int);
int bstream_rd_num(backup_stream*, unsigned long int*);

int bstream_wr_string(backup_stream*, bstream_blob); 
int bstream_rd_string(backup_stream*, bstream_blob*);

int bstream_wr_time(backup_stream*, bstream_time_t*); 
int bstream_rd_time(backup_stream*, bstream_time_t*); 

Services API
------------

This API consists of the functions which are used by the backup stream 
library and should be implemented by the application using this library.

Functions for populating catalogue with items:

int bcat_reset(struct st_bstream_image_header *catalogue) 
    - prepare catalogue for inseting items.

int bcat_close(struct st_bstream_image_header *catalogue) 
    - signal that all items have been inserted.

int bcat_add_item(struct st_bstream_image_header *catalogue, 
                  struct st_bstream_item_info *item)
    - add item to the catalogue.

Functions for accessing items stored in the catalogue:

struct st_bstream_item_info* 
bstream_cat_get_item(struct st_bstream_image_header *catalogue, 
                     unsigned long int pos)

struct st_bstream_dbitem_info* 
bstream_cat_get_db_item(struct st_bstream_image_header *catalogue, 
                        struct st_bstream_db_info *db, 
                        unsigned long int pos);

struct st_bstream_titem_info* 
bstream_cat_get_table_item(struct st_bstream_image_header *catalogue, 
                           struct st_bstream_table_info *table, 
                           unsigned long int pos);

Functions creating iterators which iterate over items in the catalogue:

void* bstream_cat_iterator_get(struct st_bstream_image_header *, 
                               unsigned int type);
void  bstream_cat_iterator_free(struct st_bstream_image_header *catalogue, 
                                void *iter);

struct st_bstream_item_info* 
bstream_cat_iterator_next(struct st_bstream_image_header *catalogue, 
                          void *iter);

Iterator over items belonging to a given database:

void* bstream_cat_db_iterator_get(struct st_bstream_image_header *catalogue, 
                                  struct st_bstream_db_info *db);
void  bstream_cat_db_iterator_free(struct st_bstream_image_header *catalogue, 
                                   struct st_bstream_db_info *db, 
                                   void *iter);
struct st_bstream_dbitem_info* 
bstream_cat_db_iterator_next(struct st_bstream_image_header *catalogue, 
                             struct st_bstream_db_info *db, 
                             void *iter);

Memory management services
--------------------------

The library doesn't do its own memory management. To allocate/free memory,
it uses external functions bstream_alloc() and bstream_free(). These functions
should be defined by the application using this library.

Abstract stream
---------------

The backup stream library doesn't determine the final destination of the data
written or where the bytes to be read come from. Instead, the user of this
library needs to define the underlying read/write mechanism by storing pointers
to functions performing basic I/O operations inside a st_abstract_stream structure.


A patch introducing backup stream library can be found here 


Here is another patch adding support for storing block size in the stream and
reading it from there. Also fixing several bugs in the library.