WL#7069: Provide data dictionary information in serialized form

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

Executive summary
#################

Meta data must be provided in a new serialized form, since the current .FRM
files will be abandoned. The new serialized format will be used by NDB as a
replacement for “.FRM shipping”, and by InnoDB to be stored in .IBD files to
support transportable table spaces, and to support recovery as a last resort.

To support this functionality, we need to provide an extension of the handler-
and handlerton APIs with virtual functions for handling serialized meta data.
Each storage engine can handle this information according to its preference, 
e.g.:

- NDB can ship the meta data across the cluster to other nodes for re-creation.
- InnoDB can store the meta data in tablespace files to make them self 
  contained. 

Other storage engines, e.g. MyISAM, can rely on the default implementation,
which will serialize the meta data and save it into a separate file, hence
providing the same benefits as the .FRM files for simple engines:

- Simple backup/restore by copying all files related to tables.
- Recover from a severe crash by re-creating the dictionary based on the 
  serialized dictionary data. 
- Retrieve needed table definitions from an offline server.

Support must also be provided for re-creating dictionary items based on
serialized meta data.


High level description
######################

With the new DD, there is still a need for serialized dictionary information for
the following purposes:

- Repair dictionary: If normal recovery does not succeed, it may still be 
  possible to get hold of dictionary information and table data and import this 
  into a new server instance. 

- View and browse dictionary definitions: Dictionary definitions can be 
  provided offline, independently of a running MySQL server.

- Simple (but unreliable) backup: Backup may be done by copying files, but is 
  unreliable since transaction consistency is not guaranteed. Restore must be 
  done by an explicit import command.

- Ship dictionary information: MySQL Cluster must be able to ship table meta 
  data to a remote MySQL server within the same cluster.

Thus, we need to support the following core functionality: 

1. Provide serialized meta data for various dictionary items, and support SE 
   specific handling of the data. The default handling will be to write the 
   serialized data to a single file. Specific engines may take other actions, 
   e.g. store it in a tablespace file (InnoDB) or ship it to a different 
   process (NDB). 

2. Provide an internal API for (re-)creating a dictionary item based on 
   serialized meta data. For external usage by an end user, re-creating 
   dictionary items may be done by means of explicit commands for importing 
   tables (or, for InnoDB, tablespaces). 

Changes in current functionality: Auto discovery
================================================

Auto discovery is the mechanism which is invoked when a requested table is not
defined in the dictionary. In 5.7, this happens if a requested .FRM file is not
present. Then, the storage engines implementing the discover() function in the
handlerton API are invoked and asked to provide the requested .FRM file somehow.
As of now, two storage engines support the auto discovery mechanism:

- MySQL Cluster will try to retrieve the file from the NDB dictionary.
- Archive will try to retrieve the file from within the corresponding ARZ data 
  file. 

For 8.0 and beyond, auto discovery is relevant when a requested table is not
present in the data dictionary; this will be the situation equivalent to a
missing .FRM file. Thus, MySQL Cluster will still depend on this mechanism for
propagating meta data. However, it is suggested to abandon the auto discovery
functionality for the Archive storage engine. DROP TABLE of an archive table
will remove the .ARZ file, so abandoning auto discovery will not break
recoverability afte dropping a table.

Changes in current functionality: Populating the dictionary cache
=================================================================

In 5.7, .FRM files contain meta data. On a cache miss, the .FRM file is looked
up in the file system; this is the way the dictionary cache is populated as of
5.7. This has the side effect that for some storage engines, it has been
possible to "import" tables by copying files into the proper directory. Below,
we will refer to this side effect as "auto import". 

With the new DD, this "auto import" side effect will disappear. The equivalent
of copying FRM files into the proper directory would actually be to allow
inserting rows into the data dictionary tables. For obvious reasons, this cannot
be allowed. Thus, we suggest to abandon the "auto import" mechanism and instead
provide an explicit SQL command for importing tables. 

Use Cases
=========

The following use cases may be relevant: 

Detect inconsistency when opening a table or importing a tablespace
-------------------------------------------------------------------

Storing serialized dictionary information makes it possible to detect
inconsistencies when opening tables, and when importing tablespaces (see below).
Checking for consistency should be done by the SQL
layer. Two types of consistency checks are relevant: Data vs. meta data, and
meta data vs meta data (i.e., serialized dictionary information compared to the
meta data stored in the  global data dictionary). 

Transportable tablespaces (InnoDB)
----------------------------------

Transportable tablespaces as of 5.7 is supported using a .CFG file for
transferring required meta data. With the new DD, the .CFG file will not be
required when importing the data. Instead, the SDI may be used to re-create
table(s) or validate the schema. One of the restrictions of transportable
tablespaces as of now, is that they are only guaranteed to work if the import 
and export is done on the same server version. Thus, directly importing a pre-
8.0 tablespace into a server supporting the new data dictionary is not 
supported, nor is import in the other direction.

With the new DD, the restriction regarding import and export being supported 
only within the same server version, will still apply as of now. However, in the
future, this restriction may be relaxed.

When importing a tablespace where the meta data refers to contents in other
tablespaces, import may be aborted or not depending on the circumstances, and
depending on the implemented support for multi tablespace operations. The
important thing in the context of this worklog is to keep the opportunity open
in the implementation to support multi tablespace operations in the future. 

Simple backup/restore or export/import (MyISAM)
-----------------------------------------------

For simple engines, tables may be backed up by copying files, as mentioned
above. InnoDB will rely on transportable tablespaces (see above) for achieving
this functionality. 

The files that are copied may not be transaction consistent. There is no special
support for ensuring transaction consistency in this scenario, this must be done
by other means, e.g. by issuing FLUSH TABLES WITH READ LOCK for the duration of
the data copying. Alternatively, an explicit export command may be provided.

For “restoring” the data, the files must be copied into the relevant location
under the server's data directory. Then, an explicit import command can be used
to read the serialized dictionary information and update the server's data
dictionary. 

Disaster recovery
-----------------

If the data dictionary and the data file where it is located are left readable
and consistent after a server crash, the server should start as usual without
any need for the serialized dictionary data. There will be a recovery mechanism
implemented within InnoDB, but this will not make use of the serialized
dictionary information. However, if the data dictionary cannot be read in the
usual way, parts of it may still be repaired using the serialized dictionary
information: 

- If the server fails to initialize the data dictionary, it may be started 
  without the system tablespace files, i.e., the files must be moved elsewhere, 
  or the data directory must be changed, forcing the server to initialize a new 
  system tablespace. This will be equivalent to starting a new server instance. 

- Tables and tablespaces may then be imported from the old server to the new 
  server instance.

If the serialized meta data in the definition files or the tablespace files is
corrupted, tool support may be needed to repair it. Such tool support is beyond
the scope of this worklog. 


Table definition shipping
-------------------------

MySQL Cluster is shipping meta data between the different MySQL servers within
the cluster to re-create the dictionary items in order to have the same
dictionary contents available on each server. Shipping serialized dictionary
information should be quite similar to shipping the .FRM files. When a table is
created, its meta data should be serialized and shipped to the other MySQL
servers in the cluster. On the receiving nodes, the meta data should be
submitted to a call to the new dictionary API in order to create the dictionary
entry for the table. Thus, this should work pretty much in the same way as
today. There are, however two situations adding some complexity: Backup/restore
and upgrade/downgrade (see below). 

Backup/restore
--------------

In addition to the simple file copy based backup/restore procedure mentioned
above, there are also other backup/restore solutions: 

- MySQL Enterprise Backup: This is a commercial product that is not part of the 
  MySQL server. It will be up to them to adapt to the new data dictionary if 
  any changes are needed. 

- Mysqldump: This utility performs logical backup which dumps the contents of 
  the server as SQL statements for re-creating the dictionary items, and also 
  with the data contents as insert statements. Thus, restore is done by 
  executing the resulting SQL statements via the mysql client. Since this is a 
  logical backup, it will most likely not be affected by the changes to the 
  serialization format. 

- MySQL Cluster: There is special backup/restore functionality that works only 
  for MySQL Cluster. This functionality is implemented by a backup command that 
  is part of the NDB management daemon, and a restore utility which is a 
  separate executable. The NDB data dictionary will be part of the backup. With 
  pre-8.0 server versions, this means that the .FRM files that are stored 
  within the NDB data dictionary will also be part of the backup. For MySQL 
  Cluster using server versions 8.0 and later, the new serialized dictionary 
  information will be stored in the NDB data dictionary instead of the .FRM 
  files, and hence, the new serialized data will be in the backup files too. 
  This means that backup/restore between MySQL Cluster versions that do not 
  have the same type of data dictionary (i.e., one version is using the old 
  data dictionary and the other version is using the new data dictionary) is 
  supported only by using a workaround as described below. 

In general, except when using mysqldump, restore of an old backup (from a MySQL
server not supporting the new data dictionary) directly into a server supporting
the new data dictionary is not guaranteed to work. The suggested  workaround is
the following: 

1. Start a MySQL server version compatible with the backup version (e.g. the 
   same version as the backup was taken from). 
2. Restore the backup into the MySQL server. 
3. Upgrade the MySQL server. 
4. Backup the server contents using the upgraded server version, now supporting 
   the new data dictionary (or alternatively, export InnoDB tablespaces, copy 
   MyISAM files, etc.). 
5. Make the data available to the target server by restore, import, or another 
   mechanism as appropriate. 

Upgrade/downgrade
-----------------

Upgrade/downgrade between versions that both support the new dictionary should
not be very different from the current situation in 5.7. The new data dictionary
should not present any new problems, regardless of the storage engine involved.
No special problems
should be present when running with MySQL Cluster either. The important thing to
consider is that new server versions must be able to de-serialize and interpret
old serialized meta data (please note that we are talking about the old server
still using the new serialized format, but since this format may change over
time, compatibility must be ensured). This is required for crash recovery,
importing tablespaces, restoring old MySQL Cluster backups (backed up by a
cluster version supporting the new data dictionary), etc. 

For upgrade/downgrade between versions that not both support the new dictionary,
there are two situations to consider: Upgrade with and without MySQL Cluster (in
either case, downgrade is not supported). 

Without MySQL Cluster
.....................

This scenario is based on offline upgrade where the server is stopped, a
dedicated tool is used to support the upgrade by generating meta data on a form
that can be understood by the new server version, and then, the new version of
the server is started. 

With MySQL Cluster
..................

The upgrade procedure for MySQL Cluster is to first upgrade the management
server(s), then all the NDB processes, and finally the MySQL servers. In the
context we are discussing here, context, the server upgrade is the difficult
issue. In theory, we could do the upgrade offline if downtime is acceptable,
however, for MySQL Cluster, we also need to be able to support online upgrade.
This means online in terms of the cluster as a whole, each individual MySQL
server will be temporarily down while being restarted, of course. Each MySQL
server process can be upgraded individually. However, in a cluster where there
are several servers running concurrently, there are additional problems: 

- The upgrade tool will generate serialized 
  dictionary information, which will then be imported into the new 8.0 server. 
  The old MySQL servers will not understand this information if receiving it, 
  so shipping this information must probably be disabled until upgrade is 
  finished. 

- For the same reason, it also makes sense to disable DDL operations for the 
  duration of the upgrade. 

- When importing the serialized meta data into the starting 8.0 server, the 
  serialized information will also be added to the data dictionary stored in 
  NDB. Since the NDB processes have already been upgraded, they will be able to 
  support this, but they also need to be able to support the old MySQL servers 
  until upgrade is finished. Hence, NDB may need to support holding both the 
  old and new serialized dictionary information blobs at the same time. 

It is up the the MySQL Cluster team how to handle this. The description above is
just a very high level suggestion. 

Responsibilities
================

This section will define the scope of this worklog. We will also outline what we
expect from other worklogs within the same release, and what we assume will be
relevant to implement in forthcoming releases.

Within the Scope of this Worklog
--------------------------------
Within the scope of this worklog (WL#7069), we will do the following:

- Define the format of the serialized dictionary information for tables and 
  their strongly related dictionary information.

- Define which contents of the data dictionary objects should be included in 
  the serialized meta data. 

- Define and implement an API to be used for serializing dictionary information 
  with the defined format and contents.

- Define a virtual extension of the handler API to be called when a table is 
  changed (created, altered or deleted), providing the meta data as a 
  parameter. 

- Provide a default implementation of this API writing serialized information 
  to a file.

- Define extensions of the handlerton API for handling tablespaces to support 
  maintenance of serialized dictionary information in the tablespace 
  dictionary. 

- Extend the tablespace handling functions to also maintain the serialized 
  dictionary information in the tablespace dictionary. 

- Extend current functions that create and alter tables and table related items 
  to also make use of the tablespace API and handler API extension. 

- Define and implement an API for (re-)creating tables and their strongly 
  related items based on serialized dictionary information. 

Within the Scope of other Worklogs
----------------------------------

We expect the following from other worklogs within the same release:

- Runtime/InnoDB: Write serialized data to file (into an .IBD file), i.e., 
  extend the handlerton API to provide necessary functions. 

- Runtime/InnoDB: Implement support importing and exporting tables and 
  tablespaces

- InnoDB: Provide support for extracting serialized dictionary information from 
  tablepace files.

- Server general: Make sure mysqldump will handle the new data dictionary 
  implementation. Being a logical backup, mysqldump is probably not affected by 
  the changes in the new data dictionary. 

- MySQL Cluster: Verify and implement support for backup/restore and 
  upgrade/downgrade. Modify current usage of .FRM file based functions 
  (writefrm etc.) to instead use functions in the new data dictionary API for 
  storing the serialized meta data in the MySQL server.

Definitions and Abbreviations
=============================

- DO: Dictionary object, i.e., an object in the DD cache representing, e.g., a 
  table. 
- PDO: Persistent dictionary object: All dictionary objects that survive a 
  system restart (i.e., not objects like those representing temporary tables). 
- SDI: Serialized dictionary information, meta data that is stored in addition 
  to the data dictionary itself.
- OID: Dictionary object identity, i.e., the "id" column from a DD table. 
- Strongly related dictionary objects: Aggregation or object pointers are 
  strong relations, e.g. table to columns.
- Weakly related dictionary objects: An object referring to another object by 
  means of its id data member, name or  similar has a weak relation to that 
  object, e.g. table to schema.

Functional Requirements
=======================
 
Below, there are separate sections for storage engine specific requirements
(InnoDB and MySQL Cluster), followed by general requirements grouped by overall
functionality. 

InnoDB requirements
-------------------

IR-01. For each tablespace, only SDI related to the tables contained in the
tablespace are stored. If a table is stored in several tablespaces, the SDI is
also stored in the same tablespaces as the table data. If a tablespace is spread
over several files, an identical copy of the SDI may be stored in each file. It
is up to the storage engine to decide how to distribute the SDI among the
tablespace files. 

IR-02. The SDI is stored in the tablespace files with the the OID and object
type as primary key along with a BLOB containing the SDI.

IR-03. There must be one SDI blob for each table in each of the tablespaces
where the table data is stored, and this blob shall contain all required
information that is strongly related to the table, including foreign key
constraints, indexes, etc. Having a blob for the entire table space would be too
much, and one blob per index would be too scattered. 

IR-04. In addition to the strongly related dictionary information specified in
IR-03, the containing schema SDI shall be serialized along with the
table to ease reconstruction on import. The tablespace SDI shall also be
serialized and stored in the tablespace dictionary. 

MySQL Cluster requirements
--------------------------

CR-01. There shall be a function to read the SDI from the global data
dictionary. This function will replace the current "readfrm" function. The SDI
shall be on a format that can be stored (a sequence of bytes). 

CR-02. There shall be a function for creating a new dictionary entry (the meta
data) using the SDI. This function will replace the current "writefrm" function.
[NDB does not really need a physical file containing the SDI, just some way of
creating the meta data (dictionary entry) using the SDI]. 

CR-03. There shall be a function to compare two SDI blobs to determine if they
are equal or not. This function will replace the current "cmpfrm" function. 

CR-04. It shall be possible to check if a table exists, given the complete name
of the table (complete name = schema + table name). A table exists if there is a
dictionary entry for it.

CR-05. The SDI shall contain the same information as the old .FRM files.
Required information not present in the dictionary objects as explicit fields
may be represented in the se_private_data field as key=value pairs. 

CR-06. It shall be possible to retrieve the table identity from the SDI. Table
identity is based on the complete table name (schema + table name), plus engine
ownership. This means that the SDI must contain these pieces of information.

CR-07. The mechanism of checking for an NDB file to determine engine ownership
shall be replaced by explicitly representing engine ownership in the meta data.

CR-08. It shall be possible to open an existing table using table identity
(complete name + engine ownership). 

CR-09. It shall be possible to force the data dictionary to swap the meta data
of an old table with a new one. 

CR-10. The SDI shall be architecture independent. The information is put into
the NDB backup files, and may be
restored on an architecture with, e.g., different endianness. Architecture
independence must be ensured for all information, including, e.g., column 
defaults. 

CR-11. There shall be support for iterating over existing tables. This is needed
during server start for checking for meta data staleness. 

CR-12. There shall be support for iterating over databases/schemata. This is
needed during server start for checking for meta data staleness. 

General requirements
--------------------

SQL statements and SDI Contents
...............................

FR-01. The following SQL statements shall have their implementation extended to
also manage SDI: 

- InnoDB only, not implemented yet: CREATE/DROP/IMPORT/EXPORT TABLESPACE
- CREATE/ALTER/DROP/RENAME/TRUNCATE TABLE
- CREATE/DROP INDEX

Thus, the following PDOs shall be supported in terms of generating serialized
dictionary information: Tables, indexes, and columns. Foreign keys will also be
included in the SDI when they are supported by the global DD. Additionally, for
tablespace operations, SDI must be managed, both for the tablespace itself, and
for the tables (and related information) present in the tablespace.

FR-02. The following PDOs shall be supported in terms of (re-)creation: All PDOs
specified in FR-01.

FR-03. The following information shall be stored in a SDI: The set of strongly
related dictionary objects for the PDO. Additionally, schemata and
tablespace information shall be stored for tables to support re-creation.

The server layer must ensure that all required fields will be present both when
writing and after reading back the SDI. The SDI should start with:

• dd.version
• engine name
• DO type

Retrieving SDIs Internally
--------------------------

FR-04. In a running server, it shall be possible to retrieve the SDI for any PDO
given by FR-01 at any time (lock waits must be considered, though). 

FR-05. The SQL layer shall provide functions for retrieving the SDI from a PDO.

FR-06. When a PDO listed in FR-01 is created, changed or deleted, the PDO, its
associated tablespace PDO, and resulting SDI shall be submitted as parameters to
a function pointer in the handlerton object, if provided by the storage engine. 

The exact negotiation between the SQL layer and the storage engine while
allocating, creating and serializing the meta data will be specified in the low
level design further below. 

Storing SDIs to File
--------------------

FR-07. If the function pointer variables in FR-06 have not been set by the
storage engine (have the value NULL) a default implementation
will write the SDI for the PDO to a file.

FR-08. The name of the file in FR-07 shall be generated by the server: 

- Use the character conversions currently used for the .FRM file names, but 
  restrict the conversion to e.g. the first 16 characters of the table name. 
- Add a string representation of the OID of the dictionary object and a file 
  name extension ".SDI". The OID ensures uniqueness, which is required since 
  several tables may map to same name.
- The relationship between the DD object name and the filename need not be an 
  invariant, i.e., the names may be changed independently. 
- The existing names may be kept on upgrade from 5.7 to 5.8. It will be 
  preferable to avoid renaming

Re-creating PDOs
---------------

FR-09. It shall be possible to repair table meta data using stored SDIs after a
server crash where the data dictionary is left corrupted, or in other ways is
unreadable. We must assume that the referenced bits in the tablespace file(s)
can be read. This repair mechanism is the last resort of crash recovery if the
InnoDB internal recovery mechanism cannot recover the dictionary. 

FR-10. The contents of the .SDI files may be edited to ensure correctness and
consistency. The SDI blobs stored in the tablespace files may not be edited.
Editing support is outside the scope of this worklog. 

Upgrade
-------

FR-11. The SDI format must be versioned in a way that ensures backwards
compatibility for dictionary items where new functionality has not been applied.
If a server is upgraded, the SDI blobs must be upgraded to match the format of
the new version. The SDI format version number will be the same as the DD
version number stored in the dd.Version table. 

FR-12. When upgrading from a previous MySQL server version not supporting the
new DD, to a MySQL server version supporting it, an external offline tool must
be run to generate the new data dictionary. The tool will generate SDI
representing the old meta data, and the SDI may then be imported into the new
server. 

Requirements for serialization software
=======================================

We may make use of third party software for generating the actual serialized
representation. The following requirements are relevant.

Top priority requirements
-------------------------

PR-01. The software shall support serializing an object structure detecting
multiple inclusion of the same instance, cycles, etc. according to the needs of
the data dictionary structures, i.e., there is no need to implement handling of
situations that will not occur. 

PR-02. The software shall generate a platform independent representation. 

PR-03. De-serializing SDIs from a supported DD version shall be supported.

PR-04. Extending the schema definitions of the new data dictionary shall be easy
to support as far as SDI handling is concerned. Adding new function calls for
(de)serializing added members, and implementing new functions for 
(de)serializing added classes is acceptable. 

Highly recommended
------------------

HR-01. Interfacing to external tools shall be supported and encouraged. It shall
be possible to interpret the information for usage in other tools, and
preferably even in other programming languages. Thus, either the representation
must be based on an open standard, or the software used for (de)serializaton
must be available as a separate utility, library, or even in a different
programming language. 

HR-02. The solution shall support querying object fields without de-serializing
into the complete data dictionary object structure. NDB may need to retrieve
e.g. table names to check for table existence, other engines may want to
retrieve the engine private data to verify that the required capabilities are
provided, etc. 

HR-03. It should be possible to submit a non default memory allocator, to
support allocating on memroot rather them on the heap.

External interfaces (visible to users)
======================================

EI-1. New file format: Serialized dictionary information on a JSON format,
replacing the .FRM files.

EI-2. Protocol extension for NDB (MySQL Cluster responsibility). May simply
continue using the same protocol, but with different serialized meta data. 

EI-3. Error messages: New error messages will be required. 

Internal interfaces
===================

II-1. Extension of the handlerton API: Extensions to provide engine specific
handling of tablespace meta data changes, using the serialization API. 

II-2. Extension of the handlerton API: We also need handlerton extensions for
engine specific handling of serialized meta data, regardless of object type
(e.g., InnoDB will store it in a tablespace, NDB will ship to other servers, etc.

Serialization software
######################

Classes of alternatives
=======================

There are several fundamentally different approaches to this: 

1. The classes to be serialized can be defined in a dedicated language, and C++
code (with class definitions, serialization support, etc.) can be generated for
these definitions. An example of this is the Google Protobuf implementation. 

2. Existing class definitions can be extended with third party support for
serializing the objects directly. Thus, code must be implemented to explicitly
define which members should be serialized, but the actual serialization and
encoding is handled by the third party software. The Boost serialization API is
an example of this. 

3. Existing class definitions can be extended with transformations into an
intermediate representation which has third party support for serialization into
a standardized format. Here, as in the previous item, own code must be
implemented to say which members to include in the serialization. However,
unlike the previous item, the transformation into an intermediate representation
may need type conversions, handling platform dependencies, etc. An example is
using a JSON based approach, e.g. rapidjson, or a slightly more general
mechanim, like the Boost property tree, which can generate JSON as well as XML. 

Using alternative 1 to serialize the new data dictionary objects directly into a
binary representation will probably be unacceptable since it would mean that the
dictionary classes would be generated by an external tool. We could use the
external class definition language to define classes to be used in an
intermediate representation, but it would require some work to keep the
definitions open to support extensibility, and it would be of little advantage
compared to mechanism 2 or 3 above. 

Alternative 2 provides good support for all the absolute requirements, but is
weaker on the recommended requirements. In particular, the stored representation
is not likely to follow an open standard, making it hard to interpret the
serialized date in external tools. 

Alternative 3 will need special care in the implementation to support PR-01, but
is likely to provide good support for the other requirements. Thus, this is the
alternative we would like to choose. 

We have chosen to use rapidjson due to good performance and functionality
provided (SAX + DOM). 


Overall design
##############

The implementation can be divided into 3 primary tasks:

1. Implement a serializer which has an overloaded serialize functions accepting
objects of various types, including 
- Tables
  - Columns
  - Indices
  - Foreign keys
  - Partitions

- Schema
- Tablespace

The serialize method will serialize the object itself and all its
strongly related dictionary objects. Each DO impl type that is to be serialized
will have virtual member functions for serializing and deserializing itself and
closely connected objects. This implies that the (de)serialize mfs will not be
part of the DO interface.

Additional logic (setting up rapidjson etc.) will be placed in functions that
will call the virtual member functions.


2. Extend the handlerton API to support engine specific handling of serialized
meta data. For InnoDB, the handlerton api will provide the interface for
storing/retrieving SDI from tablespace files.

For other engines, specific behavior for storing and retrieving
the SDI may also be put in the handlerton interface.

The handlerton API for storing and and retrieving sdi-blobs
should be private, meaning that client code using the new dictionary
should not have to know details of how the handlerton for a
particular storage engine manipulates sdi-blobs. 


3. Change client code to use new API. 

We will later:

- Implement new commands for importing MyISAM tables and InnoDB tablespaces. 

- Implement new server command line options to scan for MyISAM files and InnoDB
tablespace files. 

- Extend import to make use of the new SDI items for verification or
re-creation of the dictionary item. 


Low Level Design
################

Definitions:

DO_TYPE_ID: This is used in this worklog as a placeholder for a
datatype that will be used to identify dictionary objects. Most of the
current dictionary code appears to be using a )-string for this. But this may change as part
of the new cache implementation. 

Below, the overall items outlined above are explained in more detail. 


Serialization
=============

Need to provide functions accepting an object in the global data
dictionary as a parameter, and will generate a json string
representing the object. All json strings must contain the dd version
which created it as a top-level attribute of type
.


Scope
=====

1. All top level dictionary objects; Table, Tablespace, Schema, can be
the starting point for serialization. The table blob may refer to the
schema by means of the Object_id. The added information may be used
on de-serialization for validation or re-creation of the items.

2. All strongly related dictionary items will be transitively
serialized into a nested structure. Weakly related child objects are
represented by their Object_id

3. The serialization may be extended to handle multiple references to
the same item, cycles, etc. depending on the needs.

4. Events, collations and charsets will not be serialized. But a
serialized schema will include their Object_id.

5. Triggers may be serialized in the future, but this is not included
in the scope of this worklog.


SDI data type
=============

It would be convenient to have a datatype for the sdi-blobs when
manipulating them in C++ code. The sdi-api uses a
(char*, uint64) pair for this. Since the SDI-blobs will contain json
it would be convenient to have a string-like type to ease parameter
passing and memory management. An utf-8 encoded unicode string should
be able to hold any json string we generate, and this can be stored in
an std::string (but not all positions in that string is necessarily av
valid codepoint). So an sdi-blob type could be as simple as:

typedef std::string sdi_t;

Identifying SDIs
================

The interface use an SDI_KEY, where SDI_KEY is defined as follows

/* The following structure is required to uniquely identify a
Serialized Dictionary Info Object */
struct sdi_key {
	uint64 id;   
	uint32 type; 
};

typedef struct sdi_key SDI_KEY;

to identify an SDI.

Note that unless SDI_KEY and DO_TYPE_ID can be merged, there needs to
be a mapping between the two.


Mapping Dictionary Objects to Tablespaces
=========================================

Each of the three top-level dictionary objects which can be the start
point for serialization must (for InnoDB, at least), be stored in one
or more tablespaces:

- Tablespace SDIs are stored in the tablespace they describe.

- Table SDIs are stored in the same tablespace as the table itself.

- Schemas SDIs must be stored in all tablespaces that contain a table
  which belongs to the schema. This is a bit tricky as we don't have a
  schema -> table reference in the dictionary. So to compute the set
  of tablespaces to which we must store a schema SDI we need to (pseudo
  code):

std::set tablespaces;
for (auto t : dd::tables) {
  if (t.getSchema() == s) {
     tablespaces.insert(t.getTablespace());
  }
}

for (auto t : tablespaces) {
    sdi_set(t, schema_sdi, ...)
}


Serializing Object Relationships
================================

There are 4 different object relationships among the dictionary
objects which will be serialized:

1) Owning (Parent-Child) Tight connections:

Table => Column
Table => Index
Table => Foreign Key
Table => Partition

Column => Column_element

Index => Index_element

Foreign_key => Foreign_key_element

Partition => Partition_value
Partition => Partition_index


Embedded in the the serialized parent.

2) Non-owning Tight connections:

Partition_index --> Index
Foreign_key     --> Index
Foreign_key     --> Column
Index_element   --> Column

Represented by raw pointer member variables.

These connections require special attention as the raw pointer cannot stored
directly in the SDI, as the pointer value would be incorrect after
deserialization into new objects. Since the pointees always have an ordinal in
its parent's list which can be used as a logical pointer, it is sufficient to
include this ordinal position in the SDI in order to be able to obtain the (new)
pointer value during deserialization. 

3) Loose connections:

Abstract Table  --> Schema
Table           --> Tablespace
Partition       --> Tablespace
Partition_index --> Tablespace
Index           --> Tablespace

These connections are represented by the associated object's Object_is in the 
dictionary object and the corresponding dictionary tables. In general, it is 
unlikely that this Object_id is still a valid reference at the time of 
deserialization. To mitigate this the connection is instead represented by the 
associated dictionary object's Entity_object::name() attribute. Note that this 
does not guarantee that the identified object is "correct", as in identical to 
the object which was referenced at the time of serialization. The code invoking 
deserialization must make sure that the objects referenced by name exist and are 
suitable.

Initially, deserialization will fail if the references cannot be resolved. Fall 
back to defaults and/or explicitly provided alternatives, may be added later if 
there are use-cases which require this.
 

4) References to objects that are not serialized: 

Table -> Collation
Schema -> Collation
Column -> Collation

Represented by Object_id. The assumption here is that the id of the
non-serialized object will not change. If new ids are added and an SDI
containing such new ids (e.g. new collations) are imported into a server version
which does not have those collation ids, a default id has to be chosen/an error
reported.

4) Back-references, i.e. references back to the closely connected
parent object:

Table     <- Tablespace file
Table     <- Partition
Table     <- Foreign_key
Table     <- Index
Table     <- Column
Partition <- Partition_value
Partition <- Partition_index
Index     <- Index_element
Column    <- Column_type_element
Collection <- Column_type_element

These are omitted to avoid cycles, and are recreated during deserialization.


Representing Binary Values in Json
==================================

Binary values (where endianness matters) will be stored as base64
encoded strings in Json. The DD contains 3 such values:

- Partition_value_impl::value
- Column_type_element::name (*)
- Column_impl::default_value

In the following we use the notation  for such values.

(*) This value is not really binary, but it is a string value which
can be in any character set, and for this reason it is stored as
VARBINARY in the dictionary. If we can be sure that any such string
can be safely stored and retrieved as sequence of bytes, it may not be
necessary to use base64 encoding here.


Json Schema for Dictionary Objects
==================================

Sdi_header :=
  "dd_version": ,
  "dd_object_type": , TBD
  "dd_object": Schema | Table | Tablespace

Schema := {
  "name": , // Entity_object_impl::m_name
  // reference to non-serialized object
  "default_collation": ,
  "created": ,
  "last_altered": 
}

Tablespace_file := {
  "ordinal_position": ,
  "filename": ,
  "se_private_data": 
}

Tablespace := {
  "name": , // Entity_object_impl::m_name
  "comment": ,
  "options": Properties,
  "se_priavate_data": ,
  "engine": ,
  "files": [ Tablespace_file, ...]
}

Partition_value := {
  "max_value" : ,
  "null_value" : ,
  "list_num" : ,
  "column_num" : ,
  "value" : ,
  //"partition" : back-reference to tightly connected owner omitted
}

Partition_index := {
  "options" : ,
  "se_private_data" : ,
  //"partition" :  back-reference to tightly connected owner omitted
  "index" : 
  "tablespace" : ,
}

Partition := {
  "name": , // Entity_object_impl::m_name
  "level": ,
  "number": ,
  "comment": ,
  "options": ,
  "se_private_data": Properties,
  // "table": back-reference to tightly connected owner omitted
  "values": [Partition_value, ...],
  "indexes": [Partition_index,...],
  "tablespace": 
}

Foreign_key_element := {
  // "foreign_key": back-reference to tightly connected owner omitted          
  "column": ,
  "ordinal_position": ,
  "referenced_column_name": 
}

Foreign_key := {
  "name": , // Entity_object_impl::m_name
  "match_option": ,
  "update_rule": ,
  "delete_rule": ,
  "unique_constraint": ,
  "referenced_table_catalog_name": , 
  "referenced_table_schema_name": ,
  "referenced_table_name": ,
  // "table": back-reference to tightly connected owner omitted
  "elements": [Foreign_key_element,...]
}

Index_element := {
  "ordinal_position" : ,
  "length" : ,
  "order" : ,
  "options" : ,
  "hidden" : ,
  //"index" : back-reference to tightly connected owner omitted
  "column" : 
}

Index := {
  "name": , // Entity_object_impl::m_name
  "hidden": ,
  "is_generated": ,
  "ordinal_position": ,
  "comment": ,
  "options": ,
  "se_private_data": ,
  "type": 
  "algorithm": 
  // "table": back-reference to tightly connected owner omitted
  "elements": [Index_element,...],
  "tablespace": ,
  "engine": 
}

Column_type_element := {
  "name" : , // Column_type_element_impl::m_name
  "index" : ,
  //"column": back-reference to tightly connected owner omitted
  //"collection": back-reference to containing collection omitted
}

Column := {
  "name": , // Entity_object_impl::m_name
  "type": ,
  "is_nullable": ,
  "is_zerofill": ,
  "is_auto_increment": ,
  "hidden": ,
  "ordinal_position": ,
  "char_length": ,
  "numeric_precision": ,
  "numeric_scale" : ,
  "datetime_precision" : ,
  "default_value_null" : ,
  "default_value" : ,
  "default_option" : ,
  "update_option" : ,
  "comment" : ,
  "options" : ,
  "se_private_data" : ,
  // "table": back-reference to tightly connected owner omitted
  "enum_elements" : [Column_type_element,...],
  "set_elements" : [Column_type_element,...]
}

Table := {
  "name": , // Entity_object_impl::m_name
  // Abstract table members
  "version" : ,
  "created" : ,
  "last_altered" : ,
  "options" : ,
  "columns" : [Column,...],
  "schema" : ,

  // Table members
  "hidden" : ,
  "engine" : ,
  "comment" : ,
  "default_values" : ,
  "se_private_data" : ,
  "partition_type" : ,
  "partition_expression" : ,
  "default_partitioning" : ,
  "subpartition_type" : ,
  "subpartition_expression" : ,
  "default_subpartitioning" : ,

  "indexes" : [Index,...],
  "foreign_keys" : [Foreign_key,...],
  "partitions" : [Partition,...],

  // reference to non-serialized object
  "collation" : , 
  "tablespace" : 
}


Serialization Using Rapidjson
=============================

In Rapidjson the conversion of types to its Json string representation
is handled by a rapidjson::Writer, which expects to send its output to
something which implements the rapidjson stream concept. Rapidjson
provides a StringBuffer with UTF-encoding and the ability to use
custom allocators which implements this concept.

This leads to the question of whether there needs to be a serializer
class (and object), or if free (overloaded) functions of the form

serialize(rapidjson::Writer *w, T t)
{
  w->Int(t.getX()).Bool(t.getY()). ...
}

would be sufficient. When using rapidjson there does not seem
to be a state that would need to be captured by a Serializer object as
that is already handled by the rapidjson::StringBuffer.

Rapidjson's allocator concept is simpler than that of std C++, and
allows, IIUC, the use of stateful allocators, so writing a wrapper for
using memroot should be doable.

In order to pass the resulting string to the sdi-api it is necessary to
materialize the whole sdi in memory. 
Assuming that the average dictionary object is less than 1 kb,
and Innodb imposes a limit of 1000 columns per table and we assume that the
number of indices, foreign keys and (partitions) are not any greater,
we end up with a maximum size for the json string around 4 Mb, which
should not be a problem.

De-serialization Using Rapidjson
================================

When using Rapidjson to parse a json document, the result is a
rapidjson::Document which contains a graph (tree) of rapidjson value
objects. In order to create/populate a dictionary object based on this
information it will be necessary to traverse this structure and
identify the various dictionary objects described and intstantiate and
populate them based on the information in the json object graph. The
new dictionary objects are not connected or checked against the
dictionary, and the client which is invoking de-serialization is
responsible for merging the returned object graph with objects
already existing in the dictionary, if applicable.

T* de_serialize(const std::string &json_doc) 
{
  Document tdoc;
  // Use rapidjson to parse json string
  if (tdoc.Parse<0>().HasParseError(json_doc.c_str()))
  {
    return std::null_ptr; // error handling?
  }
  
  // traverse object graph and create and populate 
  T *skeleton = dd::create_object(); 
  skeletonp->setX(jsonObject["key"].GetInt())
  [...]
  return skeleton;
}

Pointer-Hydrating after De-serialization
========================================

When an DO contains raw pointer to another object which will be
serialized into the same sdi, it becomes necessary to somehow restore
those pointers when deserializing. The deseriailization process will
re-create the objects being pointed to, but we are not guaranteed that
the referenced object has been created before we create the object
referencing it.

To overcome this we choose to include the Object_id for all objects
(even if this value likely is incorrect for the DD which the objects
are being created) when serializing, and replace raw pointers with
the pointee's Object_id. This way we can create a mapping Object_id ->
Object during deserialization, and use this update the raw pointer
variables so that they refer to the correct new object.

Serialization Example (InnoDB)
==============================

ALTER SCHEMA CHARACTER SET = 'utf8';
-> Change dd::Schema_impl.default_collation
-> UPDATE SCHEMATA SET default_collation_id =  WHERE id = <>
-> Serialize dd::Schema_impl to schema_SDI
-> Iterate across tables in schema, and do
--> sdi_set(table.tablespace(), ..., schema_SDI, ...)) 

ALTER TABLE ADD COLUMN ... ;
-> Add new dd:Column_impl object, including any associated Column_type_element
objects.
-> Add pointer to Column_impl object to dd::Table_impl
-> INSERT INTO COLUMN_TYPE_ELEMENTS VALUES (...)
-> INSERT INTO COLUMNS VALUES (...)
-> Serialize dd::Table_impl to table_SDI (will include the strongly connected
new Column_impl and Column_type_impl objects)
-> sdi_set(t.tablespace(), ..., table_SDI, ...)


Handlerton extensions
=====================

The handlerton must be extended to provide storage engine
specific handling of the serialized data:

The following is a preliminary suggestion that will need to be adapted later:

  bool (*sdi_set_schema)(const class dd::Tablespace *tablespace,
                         const class dd::Schema &schema, const dd::sdi_t sdi);
  bool (*sdi_set_table)(const class dd::Tablespace *tablespace,
                        const class dd::Table &table, const dd::sdi_t sdi);
  bool (*sdi_set_tablespace)(const class dd::Tablespace &tablespace,
                             const dd::sdi_t sdi);

  bool (*sdi_delete_schema)(const class dd::Tablespace *tablespace,
                            const class dd::Schema &schema);
  bool (*sdi_delete_table)(const class dd::Tablespace *tablespace,
                           const class dd::Table &table);
  bool (*sdi_delete_tablespace)(const class dd::Tablespace &tablespace);


Usage of the SDIs will require functions to retrieve SDIs from InnoDB
tablespaces (proposal):

- std::string handlerton::dd_get_sdi()
  Need to get tablespace id and SDI key.
  This needs to wrap a call to sdi_get_len_by_id() followed by an allocation, 
  followed by a call to sdi_get_by_id()

  and as a result it must be able to extract tablespace id and sdi_key from x.


Changes to Existing Code
========================

Existing code will invoke interact with serialization by storing and deleting
SDIs. This will happen in the following (member) functions in defined in
table_share.cc:

Table_share_utils::create_dd_user_table(...)

dd_remove_table(...)

dd_rename_table(...)

There are some specific challenges related to storing SDIs in files:

* File operations are not transactional, so a "best effort" approach must 
employed to try to avoid losing valuable information in case of a
failure/abort. Typically this is achieved by writing to a temporary file and
then move/rename this file to its proper when it "looks certain" that the
operation will succeed.  The assumption being that the move/rename will not
fail, or at least be atomic.

* The naming of SDI files (as mandated by FR-08 above) implies that only those 
rename operations which modify the first 16 characters of the table name will
actually store the new version of the SDI in a new file name, (the OID does not
change even if the object is later modified). In all other cases the file name
is unchanged so that a plain write of the new SDI risk destroying the old
version if the write were to fail. For DDL operations that actually change the
name, the old SDI file will be intact, but will have to be removed when the new
file has been successfully written to avoid having garbage SDI files left in
the data directory. 

Approach:

If the name of the SDI file does not change; do
  When the final object version is available, create the SDI and write it to a
temporary file name.
  When the transaction is "known to commit", move the temporary table name to
the correct one. This will then both remove the temporary and the old version
of the SDI file.

If the name of the SDI file DOES change; do
  Store the name of the old SDI file somewhere (tied to THD?)
  When the final object version is available, create the SDI and write it to
the new file name. 
  When the transaction is "known to commit", remove the old SDI file.

The current approach where write_sdi() is called from dd::rename_table will
require that the old SDI file name be stored since dd::rename_table only sees
the old table name and at the intermediate name, or the intermediate name and
the new name. Even if the old name gets stored it is not clear that it is ok to
rename SDI file and remove the old SDI file at the end of the last
dd::rename_table.

This approach also implies that it must be possible to identify the
intermediate tables as here is no point in writing SDIs for intermediate
tables, as the name of the table is part of the SDI. I.e. and SDI for an
intermediate table has the wrong content, so it cannot simply be renamed to the
proper name at the end of the transaction.