WL#7816: InnoDB: Persist the "corrupted" flag in the data dictionary
Affects: Server-8.0 — Status: Complete — Priority: Medium
When InnoDB notices corruption in an index tree, it needs to flag it as corrupted, both in its internal data dictionary cache and in persistent storage. Because Global Data Dictionary is "above" the storage engine in the locking hierarchy, it would be tricky to initiate an update of DD tables from the storage engine. It could easily lead to deadlocks. Also, it could be tricky to propagate an error all the way up the stack, so that the upper layer would know to mark the DD object as corrupted. To persist the attribute, we will piggy-back on the InnoDB redo log, which can be written to even from low-level code and background threads, such as page I/O handlers.
FR1: We will write a redo log but not write back to DD tables when we find an index is corrupted. By this way, we can record a corrupted index nearly anytime and anywhere. The redo log would be flushed right after the mtr commits. FR2: We will introduce an engine-private system table, which would be called 'DD buffer table' later to store the corruption bits of indexes before every checkpoint. We should make sure the same corruption bit would not be written to this table over and over. FR3: During recovery, we will read the corruption bits from both redo logs and DD buffer table, and merge the results to mark the in-memory table and index objects. FR4: The data will be written to DD tables. The write-back is supposed to happen when: 1) Slow shutdown (shutdown after SET GLOBAL innodb_fast_shutdown=0) 2) EXPORT TABLESPACE WL#7816 will only write data to the redo log and the DD buffer table.
Overall ======= The rough idea of this worklog is that we will persist corruption bits of an index via redo logs, and finally write them back to DD tables. There are 3 steps for this persisting: 1. Once we find a corrupted index, we will write the bit to redo log 2. On every checkpoint, we will write all in-memory corruption bits to the DD buffer table if they haven't 3. We will finally write these corruption bits back to DD tables when: 3.1 EXPORT TABLESPACE 3.2 Slow shutdown As said in FR4, step 3 is not included in this WL. Some notes: On shutdown, all the MDL_EXCLUSIVE requests will trivially succeed, because all user transactions have been kicked out. On EXPORT TABLESPACE, we must afford to hold MDL_EXCLUSIVE at some part of the operation. It can be downgraded to MDL_SHARED once the data has been written. This assumes that no fast-changing metadata will be changing as a result of read activity that might run concurrently with the EXPORT. This mechanism is now designed for this Worklog only, but it can be easily applied to other to be persisted fast-changing bits, such as update_time, auto_inc, COUNT(*) etc. We have to consider the compatibility of this mechanism, so I will sometimes use 'fast-changing metadata' instead of 'corruption bit' later. redo log ======== We record the corrupted flag into redo log, so that we don't access DD tables that often and persistence of corrupted flag is crash safe. After the crash, we will scan redo logs to collect the corruption information during restart. A new mtr log type named MLOG_CORRUPTED_INDEX would be introduced for this purpose. In order to know which index is corrupted, we need to log pair<table_id, <space_id, index_id>>. We don't have any lookup by index_id, so table_id is a must. Also, we might make index_id only unique within a tablespace, no longer unique InnoDB-wide, so we also log the space_id here in advance. Considering the crash issue, we had better flush the corrupted index redo log after related mtr.commit(), by invoking log_write_up_to(mtr.commit_lsn(), true); Due to this, in most cases(except for a very soon crash) we will persist the corrupted index id in the redo log. Refer to LLD 4.1.1 about the checkpoint issue. dirty_status in table ===================== A flag indicating any fast-changing metadata need to be written back to either DD buffer table or DD tables, because we don't want to write back over and over. This dirty_status should be table level(in dict_table_t). This flag should have one of 3 status values: 1. dirty: if any fast-changing metadata is only marked in memory, without any write-back(So we have to write it back to DD buffer table on next checkpoint, and to DD table on EXPORT or slow shutdown). 2. buffered: if any fast-changing metadata is marked both in memory and in DD buffer table(We have to write back to DD table on EXPORT and slow shutdown). 3. clean: The DD table is up to date, and the DD buffer table is empty. DD buffer table =============== It should be an engine-private system table, which is created along with other system tables and never discarded. Table structure: CREATE TABLE DD_buffer_table ( table_id BIGINT NOT NULL PRIMARY KEY, fast_changing_metadata BLOB NOT NULL) ENGINE=InnoDB; We will write all fast-changing metadata into the blob field, in the key-value pair way so that we can write different metadata for different tables dynamically. We introduce this table to avoid 'copy across checkpoint'. In the original design, all corruption bits should be re-written on every checkpoint, which leads to possibly writing the same corruption bits over and over because we rarely write them back to DD tables when server is running. It's difficult to write back the corruption bits to the Global DD during checkpoint, becaue of the lock issue which would conflict with user DMLs. At first, the DD buffer table is empty. On checkpoint, we will issue some REPLACE statements to write back dirty fast-changing metadata to it, if and only if dirty_status is dirty. After we write back the metadata to DD table, we can delete the corresponding rows in DD buffer table. For 'DROP/ALTER/TRUNCATE', which would modify the table_id of a table, we should update the table_id field accordingly: 1. ha_innobase::delete_table() (DROP TABLE, ALTER TABLE…DROP PARTITION, and ALTER TABLE…ALGORITHM=COPY): We can simply try to delete the corresponding row in DD buffer table. When the DDL transaction gets committed, the data will be gone. 2. ha_innobase::commit_inplace_alter_table(commit=true) (ALTER TABLE…ALGORITHM=INPLACE, OPTIMIZE TABLE) If the table is being rebuilt, we will delete the DD buffer table entry for the old table. 3. ha_innobase::truncate() (TRUNCATE TABLE, ALTER TABLE…TRUNCATE PARTITION): Later, truncate() will invoke to create() and delete_table(). The DD buffer table entry for the old table_id will be deleted in delete_table(). With the DD buffer table, the setting and converting of the status follows: 1. At first, the status should be 'clean' (but see 5 below). 2. Once we write a redo log for any fast-changing metadata, we would set the corresponding status as 'dirty'. 3. On checkpoint, if we found a status is 'dirty', we should write/update the metadata back to DD buffer table and set the status to 'buffered'. 4. On EXPORT or slow shutdown, if we find a status is either 'dirty' or 'buffered', we should write it back to DD tables and set it to 'clean' again. After the true write-back, we can delete the corresponding row in DD buffer table. 5. On opening a table, we have to check the DD buffer table first, if there is a row of current table, we should read the metadata, update the table object in cache and set status to 'buffered'. 6. On removing a table object from cache, we should write/update the metadata back to DD buffer table. 5 and 6 are due to the fact we don't do write-back when evicting a table from cache. Write-back ========== We can only write back to DD tables on EXPORT or slow shutdown while we write back to DD buffer table on every checkpoint if necessary. In order to do write-back efficiently, we introduce a list called dirty_tables to link all tables whose dirty_status is not 'clean'. After we mark an index as corrupted, we should add the table into this list, and only remove the table once it becomes 'clean' again. We also introduce a mutex dict_dirty_tables_mutex to protect the dirty tables list. Write-back to DD buffer table should be done in log_checkpoint, which is covered by a new background transaction. Recovery and Startup ==================== 1. Scan During restart, we have to merge the metadata from both redo log (after last checkpoint) and DD buffer table. We will read redo logs first, and then read metadata from DD buffer table. If we read nothing from either place, then it could be that no metadata was changed after the latest slow shutdown. We need one structure to store persistent data of a table. Currently we only need an array to store IDs of corrupted indexes, but we may need it to store other fast-changing metadata later. Let's call this struct recv_table_t. We also need a map to maintain the relation between a table and its persistent data. The map would looks like std::map<table_id_t, recv_table_t>. During scanning redo logs, when we find a MLOG_CORRUPTED_INDEX log with pair<table_id_t, <space_id, index_id>>, first we need to find if table_id exists in above map, if exists, we should add <space_id, index_id> to this entry of table_id, if not, we create a new entry in the map and add <space_id, index_id> to this new entry. After redo logs, we have to scan the DD buffer table. For every single row with a table_id which is already in above map, then we can simply discard it. If the table_id didn't exist in above map, we should create a new entry for this row and add it to above map. In particular, if the DD buffer table is empty, we can say that all corruption bits or fast-changing metadata have been written back to DD table. 2. Apply After the scan step, we might have several entries in the map, then we can iterate the map and mark all in-memory indexes(in dict_index_t), which appear in table_persistence_t, as corrupted. We could drop this map when we finish marking all the in-memory indexes. We don't need to write back the corruption flags to the DD table right now, they would be written according to above rules. If any corruptedflag was changed after recovery, we still need to write a redo log for it, and in next recovery we can handle it correctly. Corrupted tablespace ==================== Currently we will try to mark a tablespace as corrupted in background IO threads, for example, in buf_page_io_complete, if the table space is not the system tablespace. There are some issues here: 1. If it's in recovery process and dictionary system hasn't been booted, we were not able to mark the table space and would crash the server. 2. We only know the space id of the corrupted page, but not the table id or index id. Since we support multi-table tablespace, marking the whole space as corrupted is not a good strategy here. Besides, if the corrupted page is from a secondary index, we don't have to mark the whole tablespace as corrupted no matter which type of tablespace it is. In general, it seems we could not mark index or space as corrupted properly in buf_page_io_complete, until buf_page_get_gen() starts taking dict_index_t* as a parameter. So we would prevent marking anything as corrupted in this function and just print the error messages then return FALSE when we found a corrupted page. We will try to mark the index to which the page belongs as corrupted later in upper layer.
1. new SYS_TABLE_INFO_BUFFER table In fsp_header_init(), we will create the cluster btree for 'SYS_TABLE_INFO_BUFFER'. Add a new class DDTableBuffer to manipulate the 'SYS_TABLE_INFO_BUFFER' table, which has functions like replace/get/truncate/remove, etc. In its constructor, we will create the 'SYS_TABLE_INFO_BUFFER' in-memory index object and initialize it. All functions here use innodb low level functions to access the system table, as this should be a table with only cluster index, no delete-marked records, no locking, no undo logging, no purge, no adaptive hash index. Something special in truncate() is that we introduce btr_truncate() and btr_truncate_recover() to do the low level truncate without changing the root page. In btr_truncate(), we will mark and log the PAGE_MAX_TRX_ID of cluster index to -1ULL and start the low level truncate, after finish, we use page_create to reset the mark and log. When we construct the DDTableBuffer, we have to call btr_truncate_recover() to check the cluster index, which will first read the PAGE_MAX_TRX_ID from root page to see if it's -1ULL, if so, we call btr_truncate() again to finish the truncate. In dict_boot(), we will create the DDTableBuffer object In dict_close(), we will delete the DDTableBuffer object In row_drop_table_for_mysql(), we have to delete the corresponding row in DDTableBuffer. In dict_load_table_one(), we should read corrupted indexes from system table, and set them to in-memory table object. We add a low level function dict_table_read_dd_table_buffer_on_load(). In commit_inplace_alter_table(), if it's a rebuild op, we can remove the row of table in system table. In log_checkpoint(), we would persist all dirty tables to the system table. In dict_table_remove_from_cache_low(), we will write back the metadata to system table if it has some dirty metadata. 2. dirty status Add a new enum named table_dirty_status, which has 3 status: METADATA_DIRTY, METADATA_BUFFERED, METADATA_CLEAN. We also introduce dirty_status in dict_table_t. In dict_table_read_dd_table_buffer_on_load(), we may set it to BUFFERED if we can read the row of table in DD Table Buffer, otherwise, we set it to CLEAN. In dict_set_corrupted(), we would set it to DIRTY if it's not of DIRTY. In dict_table_persist_to_dd_table_buffer_low(), we woull set it to BUFFERED. In MetadataRecover::apply(), we may set it as DIRTY if it was CLEAN but get DIRTY after applying metadata. 3. dirty_dict_tables This will store tables of un-clean status there. We also introduce a debug variable called in_dirty_dict_tables_list in dict_table_t for assertions. In dict_set_corrupted(), if the table hasn't been in this list, we will add it. In dict_load_table_one(), if we found there is a row in the DD Table Buffer, we will add the table to this list. In dict_table_remove_from_cache_low(), commit_inplace_alter_table() and row_drop_table_for_mysql(), if the table is in this list, we will remove it away. 4. persisting 4.1 overall The main idea here is that, when we want to write back metadata to system table or redo log, we initialize a meatadata object for every table, then we can iterate over the Persisters to get different persisters to serialize different kind of metadata for write-back, or just pick the persister to write redo log. The format should be the same in both cases. We can support other persisters for other metadata easily. Every persister will write down its own mark byte to distinguish from others. When writing redo logs, we introduce a new mlog id called MLOG_TABLE_METADATA for all persistent metadata. So for other metadata we don't introduce more ids. 4.1.1 Checkpoint issue We want to write back dirty metadata from in-memory table objects to SYS_TABLE_INFO_BUFFER on each checkpoint, and we should also make sure that there would be no more redo logs for fast-changing persistent metadata between the write-back and acquiring log mutex, or else we might lost these redo logs when recovery. Let's suppose we have done a checkpoint like: 1. finish write-back dirty metadata to SYS_TABLE_INFO_BUFFER 2. some threads still write redo logs for more persistent metadata 3. we acquire the log mutex, get the oldest_lsn as checkpoint lsn and write MLOG_CHECKPOINT mark later 4. finish the whole checkpoint 5. Server crashes On recovery, we will miss the redo logs logged in step 2, since we indeed found the MLOG_CHECKPOINT mark, we will skip the redo logs before. To prevent missing redo logs, we will introduce a new rw-lock on checkpoint, we should hold this rw-lock in X mode before we do write-back and release it after we have already held the log mutex. For every thread which want to write redo logs for dirty persistent metadata, acquiring the rw-lock in S mode first is must. So we can make sure there won't be any redo logs rushing in ini step 2. 4.2 new classes/structures Add a manager class named Persisters which might have several persisters to persist metadata. It has some interfaces to add/del/get persisters. Add an interface Persist to serialize/deserialize and log the persistent metadata. It has some interfaces like write/write_size/read and a defined function write_log to write the metadata redo logs. Then add a class CorruptedIndexPersister implements the interface Persist for corrupted indexes. Add a new structure dict_persist_t containing the mentioned rw-lock, dirty_dict_tables list, DDTableBuffer object, and the Persisters manager class. Also we introduce a mutex to protect things in this structure, plus the dirty_status and in_dirty_dict_tables_list in dict_table. Please also refer to the comments in this structure, describing the latch level issue. Add a class PersistentMetadata for all fast-changing persistent metadata, currently it's only used for corrupted indexes. Add an enum persistent_type_t for the names of persisters so that we can look for them. 4.3 functions We will initialize/close persisters by dict_persisters_init() and dict_persisters_close() in innodb start and shutdown function. Add init_metadata() to initialize metadata for a table. Add dict_table_read_metadata() to read metadata info from a buffer. Add mlog_write_initial_dict_log_record() to initialize the metadata logs. To use mach_u64_write_much_compressed() in this function, we have to introduce some related functions like mach_parse_u64_much_compressed() and mlog_parse_initial_dict_log_record() etc. Add dict_table_persist_to_dd_table_buffer() to write back the metadata to system table for a table. Add dict_persist_to_dd_table_buffer() to write back metadata of all tables with dirty metadata to system table, used on log_checkpoint(). The only trouble here is that we should check if current thread owns the dict_sys mutex or not. Add dict_table_persist_to_dd_table_buffer_low() to do the true write-back, we will iterate over Persisters here and set table as METADATA_BUFFERED Remove dict_set_corrupted_index_cache_only(). Rewrite dict_set_corrupted(), we will set it to METADATA_DIRTY if not set. Replaced existing dict_set_corrupted and dict_set_corrupted_index_cache_only() with current dict_set_corrupted. 5. Recovery The main idea here is we will collect redo logs about corrupted index ids during recovery. There could be several ids belonging to the same table, we will do a merge here. After we get all persistent metadata, we will apply them to the in-memory table object. When applying for one table, we will open the table first, then all buffered metadata of this table would be loaded to the in-memory table object. So we just simply apply the merged metadata directly to the table object. Add MetadataRecover class for parsing and merging metadata read from redo logs, then applying them to in-memory tables. The instance of it is defined in recv_sys_t. Add dict_table_apply_persistent_metadata() to do the apply from metadata object to in-memory table object. In recv_parse_log_rec(), we should ask MetadataRecover object to parse the redo log of type MLOG_PERSISTENT_METADATA. Add recv_apply_table_metadata() to do the apply, which would be called when innodb starts, just after we finished the redo log parsing. 6. marking space as corrupted in buffer Remove buf_mark_space_corrupt() and dict_find_table_by_space() since we shouldn't do so if there are more than one tables in the same space. Move buf_read_page_handle_error() from buf0rea.cc to buf0buf.cc since it would be called there. In buf_page_io_complete(), we don't mark the space as corrupted but just handle the error internally.
Copyright (c) 2000, 2019, Oracle Corporation and/or its affiliates. All rights reserved.