WL#9538: InnoDB_New_DD: Integrating InnoDB SDI with new DD
Affects: Server-8.0 — Status: Complete — Priority: Medium
Main objective of this WL is to integrate new DD with InnoDB new dictionary branch. This WL will handle the below cases: 1. Removal of one SDI copy in InnoDB tablespaces 2. Fix the integration issues. Make sure SDI in InnoDB tablespace is updated after ALTER. 3. On upgrade from 5.7, SDI is created for tablespaces
Functional: ----------- 1. SDI should be present in all InnoDB tablespaces except temporary tablespace(ibtmp*) and undo tablespaces 2. SDI should be updated after ALTERs 3. ibd2sdi should be able extract SDI from InnoDB tablespace when server is offline 4. SDI stored in tablespace is in JSON format and written by Server. So the output from ibd2sdi should be JSON parseable. Non-Functional: --------------- 1. ALTER performance shouldn't be degraded because of SDI.
Reasons for changing design: 1. The original requirement of two copies is not met The original requirement from support is that, we should have two copies of SDI and these two copies should be far away in .ibd file. So even if one SDI is corrupted because of bad disk(page corruption), there are more chances to recover if the SDI copies are far. This far requirement cannot be met. So there is equal chance of corrupting both copies. [Proposed Change]: Remove one copy 2. Increase in .ibd size Each copy will create one root page. Assuming 16K page size, 32K size increase per .ibd file. This might look OK. But with 1 million tables, the overhead is 20G for SDI pages. For file per table tablespaces, there will be only one entry (or at max 2-3 entries). So even one SDI index page is overkill on storage space. We cannot go for further optimization because the SDI storage promised is "transactional". To be transactional, undo logging & redo logging is necessary. And undo logging is not possible without an index page. So, we have to live with one SDI index page even for file_per_table tablespaces. [Proposed Change]: One copy will reduce the size but still 10G overhead. 3. Trying to be too perfect Perfect here means, even after a DDL crash, the aim is to have one copy of SDI as it was before DDL and another copy of SDI after WL. We realized that the main purpose of SDI is to extract data when we only have an .ibd. So as long as we can get the schema for clustered index, we can extract clustered index data and rebuild other indexes. For file per table, when a clustered index is changed, we will always have intermediate .ibd. This intermediate ibd will contain new SDI. [Proposed Change]: See below 4. Flushing tricks assumed in design It is assumed that we could flush one copy of SDI before trx commit and another after trx commit. This is not straight forward. We need more tweaks in flushing to achieve this. Lesser the tweaks, the better [Proposed Change]: Since the changes to SDI pages are undo & redo logged, there is no need to flush SDI pages during ALTER. But it is required only for those cases where undo & redo is not available. SDI will not be consistent without applying undo & redo. For version 1, we will not do any extra flushing of SDI pages. Undo & redo has to be available to make SDI consistent. 5. No version for SDI One mistake in existing design is that, there is no version tag for on-disk SDI. So it is hard to change formats of SDI. [Proposed Change]: Since one copy is removed, we will use that 4 bytes as version number at FIL_PAGE_FILE_FLUSH_LSN offset on pages 1 & 2. Summary(tasks in this WL) ------------------------- 1. Remove one SDI copy. Remove all code which deals with two copies. 2. Use 4 bytes of FIL_PAGE_FILE_FLUSH_LSN offset on pages 1 & 2 as version 1 3. SDI updates should be done in the same trx used for DDL. There will be only single transacation that covers entire DDL. For integration purpose, we will create a internal transaction to do SDI changes. 4. InnoDB will compress the SDI and then store it. This will reduce the size of the SDI to be stored. This applies to uncompressed tablespaces as well. ibd2sdi will decompress the SDI first. To handle compression, we need two extra fields compressed_len and ucompressed_len. So the schema of SDI table becomes CREATE TABLE SDI( type INT UNSIGNED, id BIGINT UNSIGNED, uncompressed_len INT UNSIGNED, compressed_len INT UNSIGNED, compressed_data BLOB NOT NULL, PRIMARY KEY (type,id)); Compression library used is zlib. 5. Enable SDI in mysql tablespace (mysql.ibd) 6. API changes: ------------ i) Change dd::sdi_key_t from (id, type) to (type, id) After this change, objects of same type will be stored together. In future, accessing objects of same type could be made faster, because they are stored continuously Since the format of SDI is different, ibd2sdi tool should handle this. ii)Change return types of SDI APIs from bool to uint32_t. This allows APIs to return more error code. Zero on success, Non-zero on failure iii)Add dd::Table* to SDI APIs Unlike other SE's which stores SDI in files, InnoDB stores SDI in tablespaces. So a dd::Table object can be serialized and stored only after it has valid se_private_id. This is because of presence of valid se_private_id indicates table is created, so a tablespace (implicit) is also created. Similary for partitions, a valid se_private_id of all partitions indicate the tablespaces for partitions are created. So sdi_set() can be called twice for same dd::Table object. Once without valid se_private_id and later with valid se_private_id. SE is responsible for this check and it should avoid storing sdi if dd::Table or Partitions do not have valid se_private_id. Thus, to do these checks, dd::Table* is added to SDI APIs sdi_set() & sdi_delete() Side notes: ------------ For partitions, the SDI is stored as a whole(includes all partition info). For example, if table has 10 partitions, and lets say each partition goes to a separate tablespace(file_per_table), then SDI will be stored 10 times in all the partitioned tablespaces. SDI is with all partitions info, stored in each tablespace. Another example: If there is table with 10 partitions, 4 of them in one tablespace (general tablespace) and remaining 6 as file_per_table tablespaces. In this case, there will be 7 sdi_set(). Once per each tablespace of each partitions. For tablespace where it is already stored, duplicate stores are avoided. In the above case, where 4 partitions are in one general tablespace, there will only one sdi_set(table_blob_with_all_partiton_info); instead of 4.
Copyright (c) 2000, 2018, Oracle Corporation and/or its affiliates. All rights reserved.