WL#4901: Ideas for improving the first backup image format.

Affects: Server-6.x   —   Status: On-Hold

This WL collects ideas for improving backup image format which were accumulated
over several months of development of MySQL Backup system.

The backup image format used in the development was essentially frozen from the
time it was first proposed (WL#4063). Although this original design of the
format proved to be sufficient for the current functionality of the system, some
problems and possible enhancements have been discovered. 

Since the current version of the system does not support multiple image formats,
any changes in the existing format would break full backward compatibility. This
is why the format is frozen.

However, since MySQL Backup system was not yet officially released, perhaps it
is still possible to update the backup image format used by the first release,
taking into account the experience gained while developing the system. This can
improve the quality of the first release of MySQL Backup system and simplify its
further development.

Even if it is decided that the image format can not be changed now, the ideas
collected here can be used for developing future formats.
Simplify catalog coordinates
============================

Objects saved in backup image are collected in image's catalog and can be
identified by catalog coordinates. Currently the coordinates are

for global objects: 
 - position in the list of global objects of given type.

for tables:
 - snapshot number,
 - position in the list of tables belonging to that snapshot.

for per-database objects: 
 - database number,
 - position in the database catalog list.

The different treatment of tables and other per-database objects complicates the
format and seems to be redundant. A simpler addressing scheme could be used:

for global objects: 
 - position in the list of global objects of given type.

for per-database objects: 
 - database number,
 - position in the database catalog list.

Thus tables will be treated the same as other per-database objects. There will
be no need to split database catalog into separate lists of tables and other
objects, but a single list of all objects belonging to a given database could be
used. As in the current format, table's snapshot number and its position within
the snapshot will be stored in the catalog entry of that table.

Simplify metadata section of the image
======================================

Metadata section contains a list of entries, each storing metadata for one of
the objects. The order of entries is important, as it ensures correct handling
of object dependencies.

For certain reasons (support for selective restore of selected databases), this
list was arranged as follows:
- first metadata for all global items is stored,
- then comes metadata for all tables, grouped by database,
- finally the metadata for all other objects.

Thus the format of the image imposes certain restrictions on the order in which
object's metadata is stored. This complicates the code for writing and reading
this section of the image while the benefits are doubtful. 

A much simpler and cleaner approach would be to store metadata for all objects
as a single list of entries. The image format would put no restrictions on the
order in which metadata is stored - the application which writes the image would
be free to arrange them in the most appropriate way.

Remove per-table items
======================

Currently, metadata section has a dedicated sub-section for storing metadata for
per-table objects. However, this section can not be used, because there is no
space in the catalog to store per-table object info. Thus the format could be
simplified and confusion avoided by removing this sub-section.

If above simplification of metadata section is implemented, this will happen
automatically and at the same time, it would be easy to add per-table items
later, if needed.

Add flags field to summary section
==================================

Image header contains flags field. However, the header is written to the stream
at the beginning of the backup process and values of some flags can be known
only at the end of that process. For example, only after VP we will know if
binlog was enabled at that time and image contains valid VP binlog position.

The flags which are known only at the end of the process, could be stored in the
summary section of the image. A final set of flags would be obtained by bitwise
OR of the flags from the header and from the summary.

Location of the summary section
===============================

Current format allows for storing summary section at the end of backup image, or
in the preamble (as indicated by a flag in image header). There are some issues:

- Current code does not support writing/reading summary inlined in the preamble.
- If supporting inlined summary, perhaps it will be necessary that it has a
fixed length, so that a "hole" of known size can be left in the image for
storing the summary there. Current format of summary makes it variable length
(since we don't know the length of binlog file path).
- Even when summary is inlined in the preamble, perhaps a copy should be added
at the end of the image. This way the reading code could be simpler because
summary would be always present at the end of the image.
- Having summary both in the preamble and at the end, the variable size problem
could be solved as follows: in the preamble, there will be a fixed space
reserved for summary. If some parts of the summary do not fit into that space,
this will be indicated by special flag, and reading code could get missing
information from the second copy at the end of the image.

Because having inlined summary is not essential I (Rafal) would suggest to
remove this possibility from the first version of image format. This would agree
with the current code which can not use this feature, even if the format
theoretically supports it.

Add image comment field
=======================

It was suggested that it would be good to store a user provided comment in the
backup image. This should be a simple extension of the existing format, as image
header contains a variable length area for extra data. Thus probably a comment
field could be added while maintaining full compatibility with the existing format.

Remove group position from binlog coordinates
=============================================

Binlog coordinates of the VP are stored in the summary section. Apart from the
coordinates of the last binlog even at VP time, there is also space for storing
coordinates of the event group to which this event belongs. Storing of event
group coordinates is not implemented in the code. It is also not clear if it is
necessary or useful to store group coordinates. If it is decided that they are
not needed, summary section format could be simplified by removing this field.