WL#7053: InnoDB: Provide storage for tablespace dictionary

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

Introduction
------------------------------------------------------------------------
This worklog is about providing transactional storage for serialized data
dictionary information within persistent InnoDB tablespaces. The data dictionary
will be stored in all InnoDB tablespaces that contain
persistent data dictionary objects, except in the Data Dictionary Tablespace.

Definitions
------------------------------------------------------------------------
Shared Tablespace
 A tablespace that can contain multiple tables.

User Tablespace
 A tablespace created by an SQL command that contains user created data
 and that uses an extension '.ibd'.

Single Table User Tablespace:
 A User Tablespace which is created for a single table in a CREATE TABLE
 command when innodb_file_per_table=ON.  

Shared User Tablespace:
 A User Tablespace that can contain multiple tables and is created with
 CREATE TABLESPACE.  In future releases they could contain
 separate indexes and off-page columns.

System Tablespace:
 This tablespace is the default location of user tables as well as all other
 InnoDB system page types.  The first data file in this tablespace is called
 'ibdata1' by default.

Data Dictionary Tablespace:
 A shared tablespace where the data dictionary tables are stored.
 This may be different from the system tablespace.

Shared Temporary Tablespace:
 A tablespace containing temporary data. This tablespace is streamlined for
 performance so it is not redo logged, not covered by persistent undo logs,
 and is not linked to a persistent data dictionary. Since temporary tables
 are not persistent, temporary tablespaces will not contain serialized
 dictionary information.

Undo Log Tablespace:
 A persistent or transient tablespace that contains undo log pages.
 As these do not contain any data dictionary items, they do not contain
 serialized dictionary information either.

Tablespace object:
 Currently, a tablespace object describes a complete table but
 in the future, we could;
 * Allow off-page columns such as blobs to be stored in a separate
   tablespace than the clustered index leaf pages.
 * Allow a secondary index to be stored separate from the clustered
   primary data records.

SDI
 Serialized Dictionary Information - The BLOB record that this worklog
 provides storage for.

Background
------------------------------------------------------------------------
Currently, InnoDB stores all dictionary information in the system tablespace
(eg ibdata1). There is no dictionary information in the user tablespace
(eg t1.ibd, t2.ibd etc).

InnoDB user tablespaces do not contain any metadata that would allow it to
interpret the contents of the user tablespace, or to detect mismatch with the
data dictionary (residing in system tablespace) when opening the user tablespace.

Purpose
-------------------------------------------------------------------------
The reasons for providing an API and a storage location for serialized
dictionary information are as follows:

* Make tablespaces (single-table and shared tablespaces) self descriptive.
* Provides redundancy (one copy in the Global DD tablespace, multiple copies
  in tablespaces containing the data).
* Server core can validate the tablespace dictionary to detect mismatches.
* External tools can dump the tablespace dictionary.
InnoDB will not be aware of the structure of the Data Dictionary information
passed from the server, hence it is a SDI BLOB to InnoDB.

The SDI describes a data dictionary object.

InnoDB will not detect mismatches or resolve conflicts between the SDI in the
tablespace and the data dictionary tables.

There will be either zero or multiple copies of the SDI in each tablespace
containing persistent data dictionary objects, except in data dictionary
tablespaces.

Each copy of the SDI BLOB will be stored in a transactional pseudo-table
consisting of a clustered index only. Modifications are made with undo logging,
ROLLBACK and MVCC.

Each pseudo-table is assigned a 64-bit table_id that is derived from the 32-bit
space_id. For example for the kth copy of the table;

    #define sdi_table_id(k) ib_id_t(-k) << 32 | space_id

The sdi_table_id will only be used for undo logging.  For this reason they will
not be stored into dd.tables.se_private_id.  Also, the table_id of normal tables
stored in dd.tables.se_private_id must not conflict with sdi_table_id.

When a database is upgraded from an earlier MySQL version, the SDI may be
missing. Likewise, when importing a user tablespace from an earlier version,
there will not be any SDI. InnoDB will work whether or not the SDI is available.
This simplifies the upgrade from older MySQL versions, because it will
not be necessary to access or modify the user tablespace files during the upgrade.

There will be a flag in the first pages of the tablespace to indicate if the SDI
is present or not.

Earlier versions of MySQL must refuse to open tablespaces that contain SDI. When
downgrading or exporting to an earlier version, the SDI can be stripped,
possibly with an external tool.
InnoDB will not be aware of the SDI structure and will not do anything
automatically about these SDI BLOBs. The MySQL server should use the new API to
control the SDI creation or manipulation with a DDL transaction.

The data dictionary information for a given tablespace is split into several
records, so that we can avoid rewriting all of the data when an index or table
is created or dropped in a shared tablespace that contains thousands of indexes
or tables.

The SDI records will be stored in one clustered index B-tree, similar to a
regular table that only defines a primary key, no secondary indexes. The primary
key will allow quick access to tablespace objects in a tablespace.

In case a record does not fit in the clustered index leaf page, it may be stored
off-page (a linked list of BLOB pages).

The SDI B-tree can be present even in file-per-table .ibd files and in any
tablespace that can contain DD objects.  This will help us in using a common
interface for all tablespace types.

InnoDB will prevent the writing of SDI only to Temporary tablespaces.
It is not expected that SDI will be written to a Global Data Dictionary
tablespace, but InnoDB will not prevent it.

Undo-Logging & Redo-Logging
----------------------------
There are five types of pages related to tablespace dictionary storage.
① clustered index leaf pages (containing the actual data records)
② BLOB pages (if some record is too long to fit in a leaf page)
③ clustered index node pointer pages (search structure)
④ file segment inode pages (keeping track of allocated pages)
⑤ allocation bitmap pages within the tablespace

To keep global DD and tablespace DD in sync, all the changes to the above
pages will be done in a *single* user transaction, even though it may have
multiple mtr_commit's.

We follow the existing semantics of a transaction. i.e. We redo-log all
changes to the pages, and we will write an undo-log record before starting to
modify a clustered index record.

Since the external tool which reads tablespace dictionary expects the SDI
pages to be in a sane state without having access to the redo and undo logs,
we will have to treat the SDI metadata table specially
and flush the data pages (①,②,③) ASAP. The pages ④,⑤ containing allocation
meta-data will not need special treatment, as the tool would only be
interested in the data pages.

New Page Types
--------------
The tablespace dictionary will use new page types to differentiate from the
normal INDEX pages.

FIL_PAGE_SDI                  /*!< SDI Index page */
FIL_PAGE_SDI_BLOB             /*!< Uncompressed SDI BLOB page*/
FIL_PAGE_SDI_ZBLOB            /*!< Compressed SDI BLOB page */

2-Copies of SDI
------------------------
The API described below supports more than 2 copies of each SDI but this worklog
task will allow only two. InnoDB will store SDI index root page numbers in pages
0, 1 & 2 and there are only 8 free bytes available in pages 1 & 2
(FIL_PAGE_FILE_FLUSH_LSN).  Since we need 4 bytes per root page number, we can
only use two SDI BLOB indexes without drastically changing the InnoDB header
page format.

Storing index root page numbers in multiple pages will protect the SDI from
root page corruption. We can still look into pages 1 & 2 to find the root page
numbers of SDI Index.

Flushing of 2 Copies of SDI in InnoDB
------------------------------------------
We will flush one copy before commit, and the rest after commit, so that in case
of a crash, there is some chance that the offline tool, which does not
have the benefit of using the REDO and UNDO logs, can see both a copy as it
would be after commit, and as it was before commit.

There will be no SQL interface to the tablespace dictionary but
the conceptual representation of the new API is:
------------------------------------------------

CREATE TABLE data_dictionary_information (
    type INT UNSIGNED,
    id BIGINT UNSIGNED,
    data BLOB NOT NULL,
    PRIMARY KEY (type,id));


type        - The type of tablespace object (table, index, column, FK, …)
id          - The id is unique only for specific type. For example, there
              can be an index type object and a BLOB column type object
              with the same id.
data        - The SDI BLOB passed from the MySQL server.
PRIMARY KEY - The composite primary key (type, id) will *uniquely* identify 
	      every tablespace object.


SQL representation of what the API would do:

-- listing all tablespace objects
SELECT type,id FROM data_dictionary_information;
-- fetching the information for a given object
SELECT data FROM data_dictionary_information WHERE id=... AND type=...;
-- insert new information, or update an existing one
REPLACE INTO data_dictionary_information VALUES (id, type, data);
-- delete information
-- (on DROP TABLE, or for example DROP INDEX when it was the last index
    tree of the table in this tablespace)
DELETE FROM data_dictionary_information WHERE id=... AND type=...;

With file-per-table tablespace files, the REPLACE will be an INSERT on CREATE
TABLE, and it will be an UPDATE on ALTER TABLE. With file-per-table tablespace
files, DROP TABLE need not update the information, because it implicitly does a
DROP TABLESPACE.

Handlerton API
--------------

The reads done by this API should use READ COMMITTED isolation level. This is
because we get mismatch when using READ UNCOMMITTED if SDI is read during
rollback of DDL transaction in crash recovery.

copy_num in the below APIs must be from 0 to dd_get_num_sdi_copies(space_id)-1.
If dd_get_num_sdi_copies(space_id) returns 0, the below methods must not be called.
If copy_num is out of bounds, the APIs will return non-zero indicating error.

Even though tablespace_id in APIs is 64bit, InnoDB will use least significant
32bits. If invalid tablespace_id is passed to APIs, a non-zero value will
be returned to indicate 'tablespace does not exist' error.

/* The following structure is required to uniquely identify a
Serialized Dictionary Info Object */
struct sdi_key {
	uint64 id;
	uint32 type;
};

typedef struct sdi_key SDI_KEY;

The tablespace id can be retrieved from the APIs in the below way:
        uint32	space_id;
	if (tablespace.se_private_data().get_uint32("id", &space_id)) {
		return(true); // error, attribute not found
	}


  /**
    Create SDI in a tablespace. This API should be used when upgrading
    a tablespace with no SDI.
    @param[in]      tablespace     tablespace object
    @param[in]      num_of_copies  number of SDI copies
    @return false on success and true on failure
  */
  bool sdi_create(const dd::Tablespace &tablespace,
                  uint32 num_of_copies);

  /**
    Drop SDI in a tablespace. This API should be used only when
    SDI is corrupted.
    @param[in]      tablespace     tablespace object
    @return false on success and true on failure
  */
  bool sdi_drop(const dd::Tablespace &tablespace);

  /**
    Get SDI keys in a tablespace into a vector
    @param[in]      tablespace     tablespace object
    @param[in,out]  list           list of SDI Keys
    @param[in]      copy_num       SDI copy to operate on
    @return false on success and true on failure
  */
  bool sdi_get_keys_by_id(const dd::Tablespace& tablespace,
                          sdi_vector_t &vector,
                          uint32 copy_num);


/* Since the caller of this api will not know the blob length, blob retrieval
should be done in the following way.

    i.   Allocate initial memory of some size (Lets say 64KB)
    ii.  Pass the allocated memory to the below api.
    iii. If passed buffer is sufficient, sdi_get_by_id() copies the sdi
         to the buffer passed and returns success, else sdi_len is modified
         with the actual length of the SDI (and returns false on failure).
         For genuine errors, sdi_len is returned as UINT64_MAX
    iv.  If sdi_len != UINT64_MAX, retry the call after allocating the memory
         of sdi_len
    v.   Free the memory after using SDI (responsibility of caller)
*/

  /**
    Retrieve a SDI for a given SDI key
    @param[in]      tablespace     tablespace object
    @param[in]      sdi_key        SDI key to uniquely identify SDI obj
    @param[in,out]  sdi            SDI retrieved from tablespace
                                   A non-null pointer must be passed in
    @param[in,out]  sdi_len        in: length of the memory allocated
                                   out: actual length of SDI
    @param[in]      copy_num       SDI copy to operate on
    @return false on success and true on failure
  */
  bool sdi_get_by_id(const dd::Tablespace &tablespace,
                     const SDI_KEY *sdi_key,
                     void *sdi, uint64 *sdi_len, uint32 copy_num);

  /**
    Insert/Update a SDI for a given SDI key
    @param[in]  tablespace     tablespace object
    @param[in]  sdi_key        SDI key to uniquely identify SDI obj
    @param[in]  sdi            SDI to write into the tablespace
    @param[in]  sdi_len        length of SDI BLOB returned
    @return 0 on success and non zero on failure
  */
  bool sdi_set(const dd::Tablespace &tablespace,
               const SDI_KEY *sdi_key,
               const void *sdi, uint64 sdi_len);

  /**
    Delete SDI for a given SDI key
    @param[in]  tablespace     tablespace object
    @param[in]  sdi_key        SDI key to uniquely identify SDI obj
    @return false on success and true on failure
  */
  bool sdi_delete(const dd::Tablespace &tablespace,
                  const SDI_KEY *sdi_key);

  /**
    Flush the SDI copies
    @param[in]  tablespace     tablespace object
    @return false on success and true on failure
  */
  bool sdi_flush(const dd::Tablespace &tablespace);

  /**
    Return the number of SDI copies stored in tablespace
    @param[in]  tablespace     tablespace object
  */
  int sdi_get_num_copies(const dd::Tablespace &tablespace);


Note: the intended usage is as follows:

if (sdi_set(tablespace_id, ..., 0) == 0 
    && ... && sdi_set(tablespace_id, ..., 0) == 0) {
   sdi_flush(tablespace_id);
   if (sdi_set(tablespace_id, ..., 1) == 0
       && ... && sdi_set(tablespace_id, ..., 1) == 0) 
   {
     ha_commit_trans();
     sdi_flush(tablespace_id);
     return;
  }
}

ha_rollback_trans();
sdi_flush(tablespace_id);

Note the same applies to sdi_delete() as well

*/

APIs for IMPORT/EXPORT
----------------------
The APIs sdi_get_ids_by_name(), sdi_get_by_name() are required for IMPORT as we
will have only files and don't have space_id available. 
table_id for SDI copies
-----------------------
table_id = (~copy_num << 32) | space_id

And the existing logic which generates table_id is modified to 
wrap around if it reaches the SDI table_id

dict_hdr_get_new_id(): is fixed to wrap around on reaching the lowest possible
table_id

We will not store SDI table_id in dd.tables.sdi_table_id. It will be calculated
on the fly based on space_id & copy_num. Server will not be aware of this table.

index_id for SDI copies
------------------------
index_id must be unique only when there is AHI(Adaptive Hash Index) enabled on a
table. Since AHI is disabled on SDI tables, a fixed index_id can be used for
SDI tables.

We will index_id -1UL for copy 0 and -2UL for copy 1 for SDI indexes.

dict_index_t & dict_table_t for SDI copies
----------------------------------------------
SDI tables are just like any other tables in InnoDB except that the schema
of SDI tables are not stored in SYS_TABLES (or the upcoming global DD).
The dict_table_t of SDI table is created in-memory with the below schema.

CREATE TABLE data_dictionary_information (
    type INT UNSIGNED,
    id BIGINT UNSIGNED,
    data BLOB NOT NULL,
    PRIMARY KEY (type,id));

Each tablespace will either have 0 or 2 SDI tables
Each SDI table will have dict_index_t, dict_table_t

dict_sdi_create_idx_in_mem():
Creates in-memory SDI table & index as long as the tablespace object
(fil_space_t) is already loaded into cache.

Note that dict_table_t is added to dict_sys->table_id_hash upon opening a table.
We explicity remove this table from cache when the tablespace is
dropped.


Creation of SDI B-Trees
-----------------------
fsp_sdi_create_idx():

Creates two SDI copies in a tablespace. It sets an FSP FLAG to indicate the
presence of SDI copies.

/** Set SDI Index bit in tablespace flags */
#define FSP_FLAGS_SET_SDI_IDX(flags)                    \
                (flags | (1 << FSP_FLAGS_POS_SDI_IDX))


The two root page numbers (each 4 byte) are combined into one 8byte integer and
stored in FIL_PAGE_FILE_FLUSH_LSN of pages 1 & 2.

SDI tables are not created in Temporary tablespaces.

The memcached APIs can be re-used to do insert/update/delete into SDI tables.

sdi_set()
---------
1. Check if tablespace has SDI, if not return
2. Create insert tuple from the arguments
3. Open cursor on the SDI table using SDI table_id
4. form search tuple and check if row exists (ib_cursor_read_row())
5. If the row exists, we have to update the existing row
     i.  For update, create search tuple and poistion the cursor on the
        existing row. 
    ii. Create the update vector based on the insert tuple & the tuple read
         above
   iii. Use memcached API ib_cursor_udpate_row() to do the update
6. if the row doesn't exist, use ib_cursor_insert_row() to do the insert.
7. Close the cursor

sdi_get_keys_by_id()
--------------------

1. Check if tablespace has SDI, if not return
2. Open cursor on the SDI table which positions the cursor on the first user
   record
3. Use memcached API ib_cursor_read_row() to read the current row pointed by
   cursor
4. Read the fields 0 & 1 from tuple, create a SDI Key struct and add it to 
   vector
5. Move cursor to next record
6. Repeat until we reach end of index
7. Close cursor

sdi_get_by_id()
---------------
1. First the caller allocates some memory assuming SDI will into it (lets say
   64 KB)
2. Position the cursor on the row.
3. get the length of SDI from the retrieved row. We take len =
   min(actual_len_sdi, memory_size passed) and copy that to memory passed.
4. If actual_len is higher then mem size available, we return error and the
   actual_len so that caller can retry again after allocating memory of size
   actual_len.

sdi_delete()
------------
1. Check if tablespace has SDI, if not return
2. Create search tuple from the arguments of API
3. Open cursor on the SDI table using SDI table_id
4. Position the cursor on the matching user record using the search tuple
5. If success, use memcached API to delete the record under cursor using
   ib_cursor_delete_row()
6. Close cursor

sdi_get_num_copies(space_id)
----------------------------
1. Retrieve page_size for the tablespace
2. Retrieve page 0 header
3. Retireve FSP FLAGS from page 0 and check if DD_INFO flag is set
3. If FSP_FLAG for SDI is not set, return 0
4. Else, retrieve root page number of SDI tables from pages 1 & 2 and check
   that they are not zero
5. Return 2 after the above verification

sdi_flush()
-----------
1. For a SDI table, call buf_flush_sync_all_buf_pools() which will do
   flush to lsn.

Debug Code for testing the above APIs
=====================================
To test above APIs, Memcached APIs will be modified to call the APIs introduced
in this WL.

The changes to Memcached APIs are under UNIV_MEMCACHED_SDI. UNIV_MEMCACHED_SDI
is enabled only on debug builds.

Memcached APIs modified to test SDI APIs
----------------------------------------
1. innodb_remove() - sdi_delete()
2. innodb_get()    - sdi_create(), sdi_get_keys_by_id(), 
                     sdi_get_len_by_id() & sdi_get_by_id()
3. innodb_store()  - sdi_sdi()

Memcached commands for testing SDI APIs:
----------------------------------------
1. Create SDI Copies in Tablespace             get    sdi_create_
2. To insert a ROW in SDI copies               set    sdi_:id:type
3. To get a ROW from SDI copy                  get    sdi_:id:type:copy_num
4. To delete a ROW from SDI copies             delete sdi_:id:type
5. To list all keys in SDI copy                get    sdi_list_copy_num
6. Drop SDI Copies in Tablespace               get    sdi_drop_

InnoDB internal tasks (rollback and purge)
==========================================
We would like to enforce an invariant that at most one copy of the SDI
in a tablespace may be dirty in the buffer pool at any given time.
This would ensure that the WL#7066 tool will always have access to
at least one non-corrupted copy of the SDI at any time.

Unfortunately, we cannot easily implement such an invariant. The reason
is that there can be multiple purge worker threads executing operations
on both copies of the SDI in the same tablespace at the same time.
We will conduct a ‘best effort’, but it will not prevent the following:

thread1> [purge something in SDI1]; mtr.commit();
thread2> [purge something in SDI2]; mtr.commit();
thread1> if (table_id == SDI1 || table_id == SDI2) flush();
thread2> if (table_id == SDI1 || table_id == SDI2) flush();

In this scenario, if the purge operation is merging pages
(pessimistic operation) and if the hardware crashes during the first
call to flush() in such a way that we lose the redo log (and possibly
also undo, DD or system tablespaces), we may end up in a situation
where both copies of the SDI are corrupted. This is because the flush
could have written only a subset of all the pages that were affected
by the pessimistic operations.

Flushing of SDI copies during purge & rollback
-----------------------------------------------
For purge & rollback, for SDI tables, we will try to flush after each operation
(not after every mtr_commit()).

        /* If table is SDI table, we will try to flush as early
        as possible. */
        if (dict_table_is_sdi(node->table->id)) {
                buf_flush_sync_all_buf_pools();
        }

Followup actions post this WL is pushed (to DD branch)
======================================================
1. Remove the supressin added in the innodb_bug16318052.test
    - (This is because SDI indexes are not recreated after TRUNCATE)
    - WL#6795 which maps Truncate to DROP+CREATE will solve this problem

2. Fix sdi_create_alter.inc with the TRUNCATE TABLE (after WL#6795)

3. Fix the tablespace id wrap around in dict_hdr_get_new_id (WL#7141)

4. In dict_sdi_create_idx_in_mem(), use only REC_FORMAT_DYNAMIC for SDI tables. 
   Currently system tablespace doesn't allow REC_FORMAT_DYNAMIC ( after WL#7704)

5. In dict_sdi_create_idx_in_mem(),usecommon "index name" for both SDI Indexes.
   (after WL#7412)

6. Remove space_discarded parameter from dict_sdi_create_idx_in_mem() after
   IMPORT/EXPORT is implemented in WL#7412. As of now it is needed because
   the fil_space_t & the .ibd is removed after DISCARD TABLESPACE

7. Remove dict_sdi_get_clust_index() after WL#7412. As of now, it is used only
   in IMPORT/EXPORT code.

8. Fix fil_ibd_open() to use tablespace flags instead of table flags.
   dict_tf_to_fsp_flags() should be removed (Bug#????). Two occurences.

9. Currently rollback to savepoint is implemented for sdi_set() & sdi_delete() 
   APIs. To be determined if this is necessary.

10. After TRUNCATE, we do not recreate the SDI indexes. Remove this adjustment
    in In row_trunc_complete() (row0trun.cc) after WL#6795

11. In fil_node_open_file(), Remove the adjustment and enable the assert. 
    After dict_tf_to_fsp_flags() removal.

12. Add crash recovery test for SDI. (tested manually, crash  the server in
    middle of SDI inserts/updates & restart the server to see the rollback of
    transaction). The debug crash points do not work from memcached tests.

13. Fix memcached.memc286_sdi_create.test to use actual SDI(when server stores
    SDI in tablespaces). As of now the test uses SHOW CREATE TABLE output as 
    SDI and TABLE_ID as SDI_KEY & SDI_TYPE