WL#9538: InnoDB_New_DD: Integrating InnoDB SDI with new DD

Affects: Server-8.0   —   Status: Complete

Main objective of this WL is to integrate new DD with InnoDB new dictionary branch.

This WL will handle the below cases:

1. Removal of one SDI copy in InnoDB tablespaces

2. Fix the integration issues. Make sure SDI in InnoDB tablespace 
   is updated after ALTER.

3. On upgrade from 5.7, SDI is created for tablespaces

1. SDI should be present in all InnoDB tablespaces except
   temporary tablespace(ibtmp*) and undo tablespaces
2. SDI should be updated after ALTERs
3. ibd2sdi should be able extract SDI from InnoDB tablespace when server
   is offline
4. SDI stored in tablespace is in JSON format and written by Server.
   So the output from ibd2sdi should be JSON parseable.

1. ALTER performance shouldn't be degraded because of SDI.
Reasons for changing design:

1. The original requirement of two copies is not met

   The original requirement from support is that, we should have two copies
   of SDI and these two copies should be far away in .ibd file. So even if one
   SDI is corrupted because of bad disk(page corruption), there are more 
   chances to recover if the SDI copies are far.

   This far requirement cannot be met. So there is equal chance of
   corrupting both copies. 

[Proposed Change]: Remove one copy

2. Increase in .ibd size
   Each copy will create one root page. Assuming 16K page size, 
   32K size increase per .ibd file.

   This might look OK. But with 1 million tables, the overhead is 20G for SDI

   For file per table tablespaces, there will be only one entry 
   (or at max 2-3 entries). So even one SDI index page is overkill on
   storage space.

   We cannot go for further optimization because the SDI storage promised is
   "transactional". To be transactional, undo logging & redo logging is

   And undo logging is not possible without an index page. So, we have to 
   live with one SDI index page even for file_per_table tablespaces.

[Proposed Change]: One copy will reduce the size but still 10G overhead.

3. Trying to be too perfect

   Perfect here means, even after a DDL crash, the aim is to have one copy of
   SDI as it was before DDL and another copy of SDI after WL.

   We realized that the main purpose of SDI is to extract data when we only
   have an .ibd. So as long as we can get the schema for clustered index,
   we can extract clustered index data and rebuild other indexes. 

   For file per table, when a clustered index is changed, we will always have
   intermediate .ibd. This intermediate ibd will contain new SDI.
[Proposed Change]: See below

4. Flushing tricks assumed in design

   It is assumed that we could flush one copy of SDI before trx commit and
   another after trx commit.

   This is not straight forward. We need more tweaks in flushing to achieve
   this. Lesser the tweaks, the better

[Proposed Change]: Since the changes to SDI pages are undo & redo logged, there
   is no need to flush SDI pages during ALTER. But it is required only for
   those cases where undo & redo is not available. SDI will not be consistent
   without applying undo & redo.
   For version 1, we will not do any extra flushing of SDI pages. Undo & redo
   has to be available to make SDI consistent.

5. No version for SDI

   One mistake in existing design is that, there is no version tag for on-disk
   SDI. So it is hard to change formats of SDI.

[Proposed Change]: Since one copy is removed, we will use that 4 bytes as
   version number at FIL_PAGE_FILE_FLUSH_LSN offset on pages 1 & 2.

Summary(tasks in this WL)
1. Remove one SDI copy. Remove all code which deals with two copies.

2. Use 4 bytes of FIL_PAGE_FILE_FLUSH_LSN offset on pages 1 & 2 as version 1

3. SDI updates should be done in the same trx used for DDL. There will be only
   single transacation that covers entire DDL.
   For integration purpose, we will
   create a internal transaction to do SDI changes.

4. InnoDB will compress the SDI and then store it. This will reduce the size
   of the SDI to be stored. This applies to uncompressed tablespaces as well.

   ibd2sdi will decompress the SDI first.

   To handle compression, we need two extra fields compressed_len and
   ucompressed_len. So the schema of SDI table becomes

          type             INT UNSIGNED,
          id               BIGINT UNSIGNED,
          uncompressed_len INT UNSIGNED,
          compressed_len   INT UNSIGNED,
          compressed_data  BLOB NOT NULL,
          PRIMARY KEY (type,id));

    Compression library used is zlib.

5. Enable SDI in mysql tablespace (mysql.ibd)

6. API changes:

   i) Change dd::sdi_key_t from (id, type) to (type, id)
      After this change, objects of same type will be stored together.
      In future, accessing objects of same type could be made faster,
      because they are stored continuously

      Since the format of SDI is different, ibd2sdi tool should handle this.

   ii)Change return types of SDI APIs from bool to uint32_t. This allows APIs
      to return more error code. Zero on success, Non-zero on failure

  iii)Add dd::Table* to SDI APIs
      Unlike other SE's which stores SDI in files, InnoDB stores SDI in

      So a dd::Table object can be serialized and stored only after it has
      valid se_private_id. This is because of presence of valid se_private_id
      indicates table is created, so a tablespace (implicit) is also created.

      Similary for partitions, a valid se_private_id of all partitions indicate
      the tablespaces for partitions are created.

      So sdi_set() can be called twice for same dd::Table object. Once without
      valid se_private_id and later with valid se_private_id.

      SE is responsible for this check and it should avoid storing sdi if
      dd::Table or Partitions do not have valid se_private_id.

      Thus, to do these checks, dd::Table* is added to SDI APIs
      sdi_set() & sdi_delete()

      Side notes:
      For partitions, the SDI is stored as a whole(includes all partition info).

      For example, if table has 10 partitions, and lets say each partition goes
      to a separate tablespace(file_per_table), then SDI will be stored 10 times    
      in all the partitioned tablespaces. 

      SDI is with all partitions info, stored in each tablespace.

      Another example: 
      If there is table with 10 partitions, 4 of them in one tablespace
      (general tablespace) and remaining 6 as file_per_table tablespaces.

      In this case, there will be 7 sdi_set(). Once per each tablespace of each
      partitions. For tablespace where it is already stored, duplicate stores
      are avoided.

      In the above case, where 4 partitions are in one general tablespace,
      there will only one sdi_set(table_blob_with_all_partiton_info);
      instead of 4.