WL#6501: InnoDB: Make Internal InnoDB TRUNCATE TABLE statement to be atomic for single tablespace
Status: Complete
RATIONALE ========= * To make Internal InnoDB TRUNCATE TABLE statement to be atomic, reinitialize the original tablespace header with the same space id and then physically truncate its .ibd file during the truncation for single tablespace. SUMMARY ======= 1. Introduce a new redo log entry type named MLOG_FILE_TRUNCATE. 2. Recreate the tablespace and table indexes by applying MLOG_FILE_TRUNCATE redo log entry during recovery when failing to create complete indexes during normal truncation, instead of dropping the table directly by flagging DICT_OP_TABLE for single tablespace. 3. Reinitialize the original tablespace header with the same space id and then physically truncate its .ibd file during the normal truncation for single tablespace, instead of dropping original tablespace and creating a new one with different space id. LIMITATIONS =========== 1. Don't support to create FTS index when recreating the tablespace and table indexes by applying MLOG_FILE_TRUNCATE redo log entry during recovery as the data dictionary can't be accessed in that phase. 2. Don't resolve issues around FOREIGN KEY constraint on the truncating table, which is another separate problem. 3. Truncate is not atomic for partitioned table as partition create more than one ibd file. 4. Truncate is not atomic for master-slave architecture as MySQL bin-logging still doesn't support atomic DDL. This means on crash of master during truncate operation, table on master can be out-of-sync from slave table copy. User needs to note this and fixup the things accordingly if master crashes on truncate. CONTEXT ======= Internal InnoDB TRUNCATE TABLE statement is not atomic. Since it firstly delete the original tablespace and then create a new tablespace with different space id. So the space id mismatch or missing tablespace problem is caused if a crash happened after the .ibd file is deleted, and before the new .ibd file is created. USECASES ======== There is a main point to the solution, a checkpoint will be written down upon the truncation finish successfully. So 1. To failed truncate, The TRUNCATE redo log record will be applied during recovery. 2. To successful truncate, the TRUNCATE redo log record will not be applied during recovery. To the following case: create table t1(a int not null, b int not null, c int not null, primary key (a), index (b), index (c)) engine=innodb; # INSERT1 let $n=10000; while ($n) { eval insert into t1 values($n,$n,$n); dec $n; } # TRUNCATION truncate table t1; # INSERT2 let $n=5000; while ($n) { eval insert into t1 values($n,$n,$n); dec $n; } The above TRUNCATE and INSERTs will generate the following operations on the redo log: 1. write log records for INSERT1 (The ibd data file is extended) 2. write MLOG_FILE_TRUNCATE for TRUNCATE 3. write log records for freeing indexes 4. write log records for creating indexes 5. log checkpoint 6. write log records for INSERT2 (The ibd data file is extended) If the server crashed after step 2, and before step 5, a new tablespace and indexes will be created when applying the TRUNCATE redo log record, and every hashed log record which belongs to the tablespace will not be applied to pages of the tablespace. The real size of the tablespace can be got from node size of the ibd file for judging whether the hashed log record is outside bounds of the tablespace or not. If the server crashed during step 6, The recovery will start from the last checkpoint written down by the successful truncation. So the TRUNCATE redo log record will be not applied. Every hashed log record from step 6 will be applied into pages of the tablespace normally. The ibd data file is extended by INSERT2. So the real size of the tablespace can be got from node size of the ibd file during recovery, and no one page offset is outside the real size of the tablespace. EXAMPLES ======== The innodb internal TRUNCATE TABLE statement can always truncate a table successfully when recovery from a crash during truncation after the feature. set global innodb_file_per_table=on; TRUNCATE TABLE t1;
The design will influence recovery subsystem, a TRUNCATE redo log entry is written into redo log and it is executed to recreate a tablespace in scan phase during recovery. A checkpoint is introduced during normal truncation for discarding pages which are outside the bounds of the tablespace during recovery. Requirements ============ RECOVERY CHANGED for single tablespace: ----------------------------------------------------------------------------- 1. Parse the TRUNCATE redo log entry in the scanning phase, without inserting the log record into the hash table. 2. Truncate the original tablespace with space id, table name, data directory path and tablespace flags taken out from the TRUNCATE redo log record. 3. Initialize the new tablespace header with the same space id. 4. Create all new index trees with table format, index ids and index types taken out from the TRUNCATE redo log record. 5. Write these created pages into ibd file handle and flush it to disc, in case I/o-handler thread deletes the bitmap page from buffer after recovery. 6. Don't apply these obsolete hashed log records to the pages of the truncated tablespace, which is created by applying the TRUNCATE redo log record. 7. Discard the page if it is outside the bounds of the tablespace, which is created by applying the TRUNCATE redo log record. NORMAL TRUNCATION CHANGED for single tablespace: ----------------------------------------------------------------------------- 1. Create a TRUNCATE redo log record with MLOG_FILE_TRUNCATE type, and then write space id, table name, data directory path, tablespace flags, table format, index ids, index types, number of index fields and index field information of the table. 2. Free all the index trees associated with rows in SYS_INDEXES table. 3. Invalidate in the buffer pool all pages belonging to the tablespace. 4. Remove all insert buffer entries for the tablespace. 5. Reinitialize the original tablespace header with the same space id. 6. Create all new index trees associated with rows in SYS_INDEXES table. 7. Write down checkpoint to redo log after commit the truncation. 8. Truncate the .ibd file physically for the single tablespace.
I. The RECOVERY part is designed by the following steps: 1. Parse the TRUNCATE redo log entry in the scanning phase, without inserting the log record into the hash table. a. Read TRUNCATE redo log record into buffer from log file segment by log_group_read_log_seg(...). b. Scan and store the TRUNCATE redo log record into parsing buffer by recv_scan_log_recs(...). c. Parse the TRUNCATE redo log record as file redo log record written about an .ibd file operation by fil_op_log_parse_or_replay(...) in recv_parse_log_recs(...). 2. Truncate the original tablespace with space id, table name, data directory path and tablespace flags taken out from the TRUNCATE redo log record by invoking fil_truncate_tablespace(...). a. Get the path of the .ibd file by table name and data directory path if the table have a remote data directory by invoking os_file_make_remote_pathname(...) or fil_make_ibd_name(...). b. Get the space and node of the tablespace by space id by invoking fil_space_get_by_id(...) and UT_LIST_GET_FIRST(...) c. Physically truncate the .ibd file of the tablespace and set the size of space and node with truncated size in pages. 3. Initialize the new tablespace header with the same space id. a. Initialize the first extent descriptor page for the tablespace by fsp_header_init(...). b. Initialize the second bitmap page for the tablespace by ibuf_bitmap_page_init(...). 4. Create all new index trees with table format, index ids and index types taken out from the TRUNCATE redo log entry. a. Create the root node for a new clustered index tree with index id in the fourth page of the tablespace if index type contains DICT_CLUSTERED bit. b. Create the root node for a new secondary index tree with index id in the (4+n_secondary_indexes)th page of the tablespace if index type does not contain DICT_CLUSTERED bit. c. Create a new file segment inode in the third page of the tablespace. 5. Write these created pages into .ibd file handle and flush it to disc, in case I/o-handler thread deletes the bitmap page from buffer after recovery. a. Fetch these created pages from buffer by space id and page no, and write them into opened ibd file handle. b. Invoke fil_flush(...) to sync the ibd file to disc. 6. Don't apply these obsolete hashed log records to the pages of the tablespace, which is created by applying the TRUNCATE redo log entry. a. Created a set for table spaces which will be created by applying the TRUNCATE redo log entries. b. Clear the set of truncated table spaces at the begin of the recovery. c. Add the table space to the set at the begin of applying the TRUNCATE redo log entry, we may need apply multiple TRUNCATE redo log entries. d. Clear the set of truncated table spaces at the end of the recovery. 7. Discard the page if it is outside the bounds of the tablespace, which is created by applying the TRUNCATE redo log entry. a. To failed truncate, the file node size of the new created tablespace is 4. So some obsolete hashed log records will be applied into a page, the offset of the page is bigger than 4, which indicates that the page is outside the bounds of the tablespace. b. Firstly unfix and unlatch the page, then remove it from page_hash and LRU if it is outside the bounds of the tablespace. (To successful truncate, no page is outside the bounds of the tablespace. So every hashed log record can be normally applied into pages of the tablespace.) II. NORMAL TRUNCATION part is designed by the following steps: 1. Collects space id, table name, data directory path, tablespace flags, table format, index ids, index types, number of index fields and index field information of the table from SYS_INDEXES and then write these information into the TRUNCATE log record with the following format: /* The TRUNCATE struct is used when applying the TRUNCATE log record during recovery */ struct truncate_log_t { header: ulint type; /* redo log type MLOG_FILE_TRUNCATE used for locating the TRUNCATE log record */ ulint space_id; /* space id used for initializing the header of the tablespace */ ulint format_flags; /* page format used for checking if create a compact page*/ body: ulint flags; /* tablespace flags for checking if the table is compressed and using a remote data directory */ const char* dir_path; /* remote data directory used for locating the .ibd file when truncating it */ const char* tablename, /* table name used for making path of the .ibd file when truncating it */ ulint index_num; /* number of indexes used for creating index pages */ const index_id_t* index_id_buf; /* all the index ids used for creating index pages */ const ulint* type_buf; /* types for indexes used for creating index pages */ const ulint* n_fields_buf; /* number of index fields used for creating compressed index pages */ const ulint* field_len_buf; /* the length of index field used for creating compressed index pages */ const byte* fields_buf; /* index field information used for creating compressed index pages */ } 2. Free all the pages but the root page first, then we free the root page. a. Free all the pages but the root page first by invoking btr_free_but_not_root(...). b. Then we free the root page, the mini-transaction marks the B-tree totally freed by btr_free_root(...). 3. Invalidate in the buffer pool all pages belonging to the tablespace. a. Drop all page hash index entries belonging to the table in batch by invoking buf_LRU_flush_or_remove_pages(...) 4. Remove all insert buffer entries for the tablespace. a. Deletes all entries in the insert buffer for the given space id by invoking ibuf_delete_for_discarded_space(...). 5. Reinitialize the original tablespace header with the same space id. a. Initialize the first extent descriptor page for the tablespace by fsp_header_init(...). b. Initialize the second bitmap page for the tablespace by ibuf_bitmap_page_init(...). 6. Create all new index trees associated with rows in SYS_INDEXES table by invoking dict_create_index_tree(..). a. Create the root node for a new clustered index tree with index id in the fourth page of the tablespace if index type contains DICT_CLUSTERED bit. b. Create the root node for a new secondary index tree with index id in the (4+n_secondary_indexes)th page of the tablespace if index type does not contain DICT_CLUSTERED bit. c. Create a new file segment inode in the third page of the tablespace. 7. Write down a checkpoint to redo log by invoking log_make_checkpoint_at(...) after commit the truncation, and before physically truncate the single-table tablespace. 8. Truncate the .ibd file physically for the single tablespace by invoking fil_truncate_tablespace(...). a. Get the path of the .ibd file by table name and data directory path if the table have a remote data directory by invoking os_file_make_remote_pathname(...) or fil_make_ibd_name(...). b. Get the space and node of the tablespace by space id by invoking fil_space_get_by_id(...) and UT_LIST_GET_FIRST(...) c. Physically truncate the .ibd file of the tablespace and set the size of space and node with truncated size in pages.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.