WL#7053: InnoDB: Provide storage for tablespace dictionary
Affects: Server-8.0
—
Status: Complete
Introduction ------------------------------------------------------------------------ This worklog is about providing transactional storage for serialized data dictionary information within persistent InnoDB tablespaces. The data dictionary will be stored in all InnoDB tablespaces that contain persistent data dictionary objects, except in the Data Dictionary Tablespace. Definitions ------------------------------------------------------------------------ Shared Tablespace A tablespace that can contain multiple tables. User Tablespace A tablespace created by an SQL command that contains user created data and that uses an extension '.ibd'. Single Table User Tablespace: A User Tablespace which is created for a single table in a CREATE TABLE command when innodb_file_per_table=ON. Shared User Tablespace: A User Tablespace that can contain multiple tables and is created with CREATE TABLESPACE. In future releases they could contain separate indexes and off-page columns. System Tablespace: This tablespace is the default location of user tables as well as all other InnoDB system page types. The first data file in this tablespace is called 'ibdata1' by default. Data Dictionary Tablespace: A shared tablespace where the data dictionary tables are stored. This may be different from the system tablespace. Shared Temporary Tablespace: A tablespace containing temporary data. This tablespace is streamlined for performance so it is not redo logged, not covered by persistent undo logs, and is not linked to a persistent data dictionary. Since temporary tables are not persistent, temporary tablespaces will not contain serialized dictionary information. Undo Log Tablespace: A persistent or transient tablespace that contains undo log pages. As these do not contain any data dictionary items, they do not contain serialized dictionary information either. Tablespace object: Currently, a tablespace object describes a complete table but in the future, we could; * Allow off-page columns such as blobs to be stored in a separate tablespace than the clustered index leaf pages. * Allow a secondary index to be stored separate from the clustered primary data records. SDI Serialized Dictionary Information - The BLOB record that this worklog provides storage for. Background ------------------------------------------------------------------------ Currently, InnoDB stores all dictionary information in the system tablespace (eg ibdata1). There is no dictionary information in the user tablespace (eg t1.ibd, t2.ibd etc). InnoDB user tablespaces do not contain any metadata that would allow it to interpret the contents of the user tablespace, or to detect mismatch with the data dictionary (residing in system tablespace) when opening the user tablespace. Purpose ------------------------------------------------------------------------- The reasons for providing an API and a storage location for serialized dictionary information are as follows: * Make tablespaces (single-table and shared tablespaces) self descriptive. * Provides redundancy (one copy in the Global DD tablespace, multiple copies in tablespaces containing the data). * Server core can validate the tablespace dictionary to detect mismatches. * External tools can dump the tablespace dictionary.
InnoDB will not be aware of the structure of the Data Dictionary information passed from the server, hence it is a SDI BLOB to InnoDB. The SDI describes a data dictionary object. InnoDB will not detect mismatches or resolve conflicts between the SDI in the tablespace and the data dictionary tables. There will be either zero or multiple copies of the SDI in each tablespace containing persistent data dictionary objects, except in data dictionary tablespaces. Each copy of the SDI BLOB will be stored in a transactional pseudo-table consisting of a clustered index only. Modifications are made with undo logging, ROLLBACK and MVCC. Each pseudo-table is assigned a 64-bit table_id that is derived from the 32-bit space_id. For example for the kth copy of the table; #define sdi_table_id(k) ib_id_t(-k) << 32 | space_id The sdi_table_id will only be used for undo logging. For this reason they will not be stored into dd.tables.se_private_id. Also, the table_id of normal tables stored in dd.tables.se_private_id must not conflict with sdi_table_id. When a database is upgraded from an earlier MySQL version, the SDI may be missing. Likewise, when importing a user tablespace from an earlier version, there will not be any SDI. InnoDB will work whether or not the SDI is available. This simplifies the upgrade from older MySQL versions, because it will not be necessary to access or modify the user tablespace files during the upgrade. There will be a flag in the first pages of the tablespace to indicate if the SDI is present or not. Earlier versions of MySQL must refuse to open tablespaces that contain SDI. When downgrading or exporting to an earlier version, the SDI can be stripped, possibly with an external tool.
InnoDB will not be aware of the SDI structure and will not do anything automatically about these SDI BLOBs. The MySQL server should use the new API to control the SDI creation or manipulation with a DDL transaction. The data dictionary information for a given tablespace is split into several records, so that we can avoid rewriting all of the data when an index or table is created or dropped in a shared tablespace that contains thousands of indexes or tables. The SDI records will be stored in one clustered index B-tree, similar to a regular table that only defines a primary key, no secondary indexes. The primary key will allow quick access to tablespace objects in a tablespace. In case a record does not fit in the clustered index leaf page, it may be stored off-page (a linked list of BLOB pages). The SDI B-tree can be present even in file-per-table .ibd files and in any tablespace that can contain DD objects. This will help us in using a common interface for all tablespace types. InnoDB will prevent the writing of SDI only to Temporary tablespaces. It is not expected that SDI will be written to a Global Data Dictionary tablespace, but InnoDB will not prevent it. Undo-Logging & Redo-Logging ---------------------------- There are five types of pages related to tablespace dictionary storage. ① clustered index leaf pages (containing the actual data records) ② BLOB pages (if some record is too long to fit in a leaf page) ③ clustered index node pointer pages (search structure) ④ file segment inode pages (keeping track of allocated pages) ⑤ allocation bitmap pages within the tablespace To keep global DD and tablespace DD in sync, all the changes to the above pages will be done in a *single* user transaction, even though it may have multiple mtr_commit's. We follow the existing semantics of a transaction. i.e. We redo-log all changes to the pages, and we will write an undo-log record before starting to modify a clustered index record. Since the external tool which reads tablespace dictionary expects the SDI pages to be in a sane state without having access to the redo and undo logs, we will have to treat the SDI metadata table specially and flush the data pages (①,②,③) ASAP. The pages ④,⑤ containing allocation meta-data will not need special treatment, as the tool would only be interested in the data pages. New Page Types -------------- The tablespace dictionary will use new page types to differentiate from the normal INDEX pages. FIL_PAGE_SDI /*!< SDI Index page */ FIL_PAGE_SDI_BLOB /*!< Uncompressed SDI BLOB page*/ FIL_PAGE_SDI_ZBLOB /*!< Compressed SDI BLOB page */ 2-Copies of SDI ------------------------ The API described below supports more than 2 copies of each SDI but this worklog task will allow only two. InnoDB will store SDI index root page numbers in pages 0, 1 & 2 and there are only 8 free bytes available in pages 1 & 2 (FIL_PAGE_FILE_FLUSH_LSN). Since we need 4 bytes per root page number, we can only use two SDI BLOB indexes without drastically changing the InnoDB header page format. Storing index root page numbers in multiple pages will protect the SDI from root page corruption. We can still look into pages 1 & 2 to find the root page numbers of SDI Index. Flushing of 2 Copies of SDI in InnoDB ------------------------------------------ We will flush one copy before commit, and the rest after commit, so that in case of a crash, there is some chance that the offline tool, which does not have the benefit of using the REDO and UNDO logs, can see both a copy as it would be after commit, and as it was before commit. There will be no SQL interface to the tablespace dictionary but the conceptual representation of the new API is: ------------------------------------------------ CREATE TABLE data_dictionary_information ( type INT UNSIGNED, id BIGINT UNSIGNED, data BLOB NOT NULL, PRIMARY KEY (type,id)); type - The type of tablespace object (table, index, column, FK, …) id - The id is unique only for specific type. For example, there can be an index type object and a BLOB column type object with the same id. data - The SDI BLOB passed from the MySQL server. PRIMARY KEY - The composite primary key (type, id) will *uniquely* identify every tablespace object. SQL representation of what the API would do: -- listing all tablespace objects SELECT type,id FROM data_dictionary_information; -- fetching the information for a given object SELECT data FROM data_dictionary_information WHERE id=... AND type=...; -- insert new information, or update an existing one REPLACE INTO data_dictionary_information VALUES (id, type, data); -- delete information -- (on DROP TABLE, or for example DROP INDEX when it was the last index tree of the table in this tablespace) DELETE FROM data_dictionary_information WHERE id=... AND type=...; With file-per-table tablespace files, the REPLACE will be an INSERT on CREATE TABLE, and it will be an UPDATE on ALTER TABLE. With file-per-table tablespace files, DROP TABLE need not update the information, because it implicitly does a DROP TABLESPACE. Handlerton API -------------- The reads done by this API should use READ COMMITTED isolation level. This is because we get mismatch when using READ UNCOMMITTED if SDI is read during rollback of DDL transaction in crash recovery. copy_num in the below APIs must be from 0 to dd_get_num_sdi_copies(space_id)-1. If dd_get_num_sdi_copies(space_id) returns 0, the below methods must not be called. If copy_num is out of bounds, the APIs will return non-zero indicating error. Even though tablespace_id in APIs is 64bit, InnoDB will use least significant 32bits. If invalid tablespace_id is passed to APIs, a non-zero value will be returned to indicate 'tablespace does not exist' error. /* The following structure is required to uniquely identify a Serialized Dictionary Info Object */ struct sdi_key { uint64 id; uint32 type; }; typedef struct sdi_key SDI_KEY; The tablespace id can be retrieved from the APIs in the below way: uint32 space_id; if (tablespace.se_private_data().get_uint32("id", &space_id)) { return(true); // error, attribute not found } /** Create SDI in a tablespace. This API should be used when upgrading a tablespace with no SDI. @param[in] tablespace tablespace object @param[in] num_of_copies number of SDI copies @return false on success and true on failure */ bool sdi_create(const dd::Tablespace &tablespace, uint32 num_of_copies); /** Drop SDI in a tablespace. This API should be used only when SDI is corrupted. @param[in] tablespace tablespace object @return false on success and true on failure */ bool sdi_drop(const dd::Tablespace &tablespace); /** Get SDI keys in a tablespace into a vector @param[in] tablespace tablespace object @param[in,out] list list of SDI Keys @param[in] copy_num SDI copy to operate on @return false on success and true on failure */ bool sdi_get_keys_by_id(const dd::Tablespace& tablespace, sdi_vector_t &vector, uint32 copy_num); /* Since the caller of this api will not know the blob length, blob retrieval should be done in the following way. i. Allocate initial memory of some size (Lets say 64KB) ii. Pass the allocated memory to the below api. iii. If passed buffer is sufficient, sdi_get_by_id() copies the sdi to the buffer passed and returns success, else sdi_len is modified with the actual length of the SDI (and returns false on failure). For genuine errors, sdi_len is returned as UINT64_MAX iv. If sdi_len != UINT64_MAX, retry the call after allocating the memory of sdi_len v. Free the memory after using SDI (responsibility of caller) */ /** Retrieve a SDI for a given SDI key @param[in] tablespace tablespace object @param[in] sdi_key SDI key to uniquely identify SDI obj @param[in,out] sdi SDI retrieved from tablespace A non-null pointer must be passed in @param[in,out] sdi_len in: length of the memory allocated out: actual length of SDI @param[in] copy_num SDI copy to operate on @return false on success and true on failure */ bool sdi_get_by_id(const dd::Tablespace &tablespace, const SDI_KEY *sdi_key, void *sdi, uint64 *sdi_len, uint32 copy_num); /** Insert/Update a SDI for a given SDI key @param[in] tablespace tablespace object @param[in] sdi_key SDI key to uniquely identify SDI obj @param[in] sdi SDI to write into the tablespace @param[in] sdi_len length of SDI BLOB returned @return 0 on success and non zero on failure */ bool sdi_set(const dd::Tablespace &tablespace, const SDI_KEY *sdi_key, const void *sdi, uint64 sdi_len); /** Delete SDI for a given SDI key @param[in] tablespace tablespace object @param[in] sdi_key SDI key to uniquely identify SDI obj @return false on success and true on failure */ bool sdi_delete(const dd::Tablespace &tablespace, const SDI_KEY *sdi_key); /** Flush the SDI copies @param[in] tablespace tablespace object @return false on success and true on failure */ bool sdi_flush(const dd::Tablespace &tablespace); /** Return the number of SDI copies stored in tablespace @param[in] tablespace tablespace object */ int sdi_get_num_copies(const dd::Tablespace &tablespace); Note: the intended usage is as follows: if (sdi_set(tablespace_id, ..., 0) == 0 && ... && sdi_set(tablespace_id, ..., 0) == 0) { sdi_flush(tablespace_id); if (sdi_set(tablespace_id, ..., 1) == 0 && ... && sdi_set(tablespace_id, ..., 1) == 0) { ha_commit_trans(); sdi_flush(tablespace_id); return; } } ha_rollback_trans(); sdi_flush(tablespace_id); Note the same applies to sdi_delete() as well */ APIs for IMPORT/EXPORT ---------------------- The APIs sdi_get_ids_by_name(), sdi_get_by_name() are required for IMPORT as we will have only files and don't have space_id available.
table_id for SDI copies ----------------------- table_id = (~copy_num << 32) | space_id And the existing logic which generates table_id is modified to wrap around if it reaches the SDI table_id dict_hdr_get_new_id(): is fixed to wrap around on reaching the lowest possible table_id We will not store SDI table_id in dd.tables.sdi_table_id. It will be calculated on the fly based on space_id & copy_num. Server will not be aware of this table. index_id for SDI copies ------------------------ index_id must be unique only when there is AHI(Adaptive Hash Index) enabled on a table. Since AHI is disabled on SDI tables, a fixed index_id can be used for SDI tables. We will index_id -1UL for copy 0 and -2UL for copy 1 for SDI indexes. dict_index_t & dict_table_t for SDI copies ---------------------------------------------- SDI tables are just like any other tables in InnoDB except that the schema of SDI tables are not stored in SYS_TABLES (or the upcoming global DD). The dict_table_t of SDI table is created in-memory with the below schema. CREATE TABLE data_dictionary_information ( type INT UNSIGNED, id BIGINT UNSIGNED, data BLOB NOT NULL, PRIMARY KEY (type,id)); Each tablespace will either have 0 or 2 SDI tables Each SDI table will have dict_index_t, dict_table_t dict_sdi_create_idx_in_mem(): Creates in-memory SDI table & index as long as the tablespace object (fil_space_t) is already loaded into cache. Note that dict_table_t is added to dict_sys->table_id_hash upon opening a table. We explicity remove this table from cache when the tablespace is dropped. Creation of SDI B-Trees ----------------------- fsp_sdi_create_idx(): Creates two SDI copies in a tablespace. It sets an FSP FLAG to indicate the presence of SDI copies. /** Set SDI Index bit in tablespace flags */ #define FSP_FLAGS_SET_SDI_IDX(flags) \ (flags | (1 << FSP_FLAGS_POS_SDI_IDX)) The two root page numbers (each 4 byte) are combined into one 8byte integer and stored in FIL_PAGE_FILE_FLUSH_LSN of pages 1 & 2. SDI tables are not created in Temporary tablespaces. The memcached APIs can be re-used to do insert/update/delete into SDI tables. sdi_set() --------- 1. Check if tablespace has SDI, if not return 2. Create insert tuple from the arguments 3. Open cursor on the SDI table using SDI table_id 4. form search tuple and check if row exists (ib_cursor_read_row()) 5. If the row exists, we have to update the existing row i. For update, create search tuple and poistion the cursor on the existing row. ii. Create the update vector based on the insert tuple & the tuple read above iii. Use memcached API ib_cursor_udpate_row() to do the update 6. if the row doesn't exist, use ib_cursor_insert_row() to do the insert. 7. Close the cursor sdi_get_keys_by_id() -------------------- 1. Check if tablespace has SDI, if not return 2. Open cursor on the SDI table which positions the cursor on the first user record 3. Use memcached API ib_cursor_read_row() to read the current row pointed by cursor 4. Read the fields 0 & 1 from tuple, create a SDI Key struct and add it to vector 5. Move cursor to next record 6. Repeat until we reach end of index 7. Close cursor sdi_get_by_id() --------------- 1. First the caller allocates some memory assuming SDI will into it (lets say 64 KB) 2. Position the cursor on the row. 3. get the length of SDI from the retrieved row. We take len = min(actual_len_sdi, memory_size passed) and copy that to memory passed. 4. If actual_len is higher then mem size available, we return error and the actual_len so that caller can retry again after allocating memory of size actual_len. sdi_delete() ------------ 1. Check if tablespace has SDI, if not return 2. Create search tuple from the arguments of API 3. Open cursor on the SDI table using SDI table_id 4. Position the cursor on the matching user record using the search tuple 5. If success, use memcached API to delete the record under cursor using ib_cursor_delete_row() 6. Close cursor sdi_get_num_copies(space_id) ---------------------------- 1. Retrieve page_size for the tablespace 2. Retrieve page 0 header 3. Retireve FSP FLAGS from page 0 and check if DD_INFO flag is set 3. If FSP_FLAG for SDI is not set, return 0 4. Else, retrieve root page number of SDI tables from pages 1 & 2 and check that they are not zero 5. Return 2 after the above verification sdi_flush() ----------- 1. For a SDI table, call buf_flush_sync_all_buf_pools() which will do flush to lsn. Debug Code for testing the above APIs ===================================== To test above APIs, Memcached APIs will be modified to call the APIs introduced in this WL. The changes to Memcached APIs are under UNIV_MEMCACHED_SDI. UNIV_MEMCACHED_SDI is enabled only on debug builds. Memcached APIs modified to test SDI APIs ---------------------------------------- 1. innodb_remove() - sdi_delete() 2. innodb_get() - sdi_create(), sdi_get_keys_by_id(), sdi_get_len_by_id() & sdi_get_by_id() 3. innodb_store() - sdi_sdi() Memcached commands for testing SDI APIs: ---------------------------------------- 1. Create SDI Copies in Tablespace get sdi_create_ 2. To insert a ROW in SDI copies set sdi_:id:type 3. To get a ROW from SDI copy get sdi_:id:type:copy_num 4. To delete a ROW from SDI copies delete sdi_:id:type 5. To list all keys in SDI copy get sdi_list_copy_num 6. Drop SDI Copies in Tablespace get sdi_drop_ InnoDB internal tasks (rollback and purge) ========================================== We would like to enforce an invariant that at most one copy of the SDI in a tablespace may be dirty in the buffer pool at any given time. This would ensure that the WL#7066 tool will always have access to at least one non-corrupted copy of the SDI at any time. Unfortunately, we cannot easily implement such an invariant. The reason is that there can be multiple purge worker threads executing operations on both copies of the SDI in the same tablespace at the same time. We will conduct a ‘best effort’, but it will not prevent the following: thread1> [purge something in SDI1]; mtr.commit(); thread2> [purge something in SDI2]; mtr.commit(); thread1> if (table_id == SDI1 || table_id == SDI2) flush(); thread2> if (table_id == SDI1 || table_id == SDI2) flush(); In this scenario, if the purge operation is merging pages (pessimistic operation) and if the hardware crashes during the first call to flush() in such a way that we lose the redo log (and possibly also undo, DD or system tablespaces), we may end up in a situation where both copies of the SDI are corrupted. This is because the flush could have written only a subset of all the pages that were affected by the pessimistic operations. Flushing of SDI copies during purge & rollback ----------------------------------------------- For purge & rollback, for SDI tables, we will try to flush after each operation (not after every mtr_commit()). /* If table is SDI table, we will try to flush as early as possible. */ if (dict_table_is_sdi(node->table->id)) { buf_flush_sync_all_buf_pools(); } Followup actions post this WL is pushed (to DD branch) ====================================================== 1. Remove the supressin added in the innodb_bug16318052.test - (This is because SDI indexes are not recreated after TRUNCATE) - WL#6795 which maps Truncate to DROP+CREATE will solve this problem 2. Fix sdi_create_alter.inc with the TRUNCATE TABLE (after WL#6795) 3. Fix the tablespace id wrap around in dict_hdr_get_new_id (WL#7141) 4. In dict_sdi_create_idx_in_mem(), use only REC_FORMAT_DYNAMIC for SDI tables. Currently system tablespace doesn't allow REC_FORMAT_DYNAMIC ( after WL#7704) 5. In dict_sdi_create_idx_in_mem(),usecommon "index name" for both SDI Indexes. (after WL#7412) 6. Remove space_discarded parameter from dict_sdi_create_idx_in_mem() after IMPORT/EXPORT is implemented in WL#7412. As of now it is needed because the fil_space_t & the .ibd is removed after DISCARD TABLESPACE 7. Remove dict_sdi_get_clust_index() after WL#7412. As of now, it is used only in IMPORT/EXPORT code. 8. Fix fil_ibd_open() to use tablespace flags instead of table flags. dict_tf_to_fsp_flags() should be removed (Bug#????). Two occurences. 9. Currently rollback to savepoint is implemented for sdi_set() & sdi_delete() APIs. To be determined if this is necessary. 10. After TRUNCATE, we do not recreate the SDI indexes. Remove this adjustment in In row_trunc_complete() (row0trun.cc) after WL#6795 11. In fil_node_open_file(), Remove the adjustment and enable the assert. After dict_tf_to_fsp_flags() removal. 12. Add crash recovery test for SDI. (tested manually, crash the server in middle of SDI inserts/updates & restart the server to see the rollback of transaction). The debug crash points do not work from memcached tests. 13. Fix memcached.memc286_sdi_create.test to use actual SDI(when server stores SDI in tablespaces). As of now the test uses SHOW CREATE TABLE output as SDI and TABLE_ID as SDI_KEY & SDI_TYPE
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.