WL#6501: InnoDB: Make Internal InnoDB TRUNCATE TABLE statement to be atomic for single tablespace
Status: Complete
RATIONALE
=========
* To make Internal InnoDB TRUNCATE TABLE statement to be atomic, reinitialize
the original tablespace header with the same space id and then physically
truncate its .ibd file during the truncation for single tablespace.
SUMMARY
=======
1. Introduce a new redo log entry type named MLOG_FILE_TRUNCATE.
2. Recreate the tablespace and table indexes by applying MLOG_FILE_TRUNCATE
redo log entry during recovery when failing to create complete indexes
during normal truncation, instead of dropping the table directly by
flagging DICT_OP_TABLE for single tablespace.
3. Reinitialize the original tablespace header with the same space id and
then physically truncate its .ibd file during the normal truncation for
single tablespace, instead of dropping original tablespace and creating
a new one with different space id.
LIMITATIONS
===========
1. Don't support to create FTS index when recreating the tablespace
and table indexes by applying MLOG_FILE_TRUNCATE redo log entry
during recovery as the data dictionary can't be accessed in that phase.
2. Don't resolve issues around FOREIGN KEY constraint on the truncating
table, which is another separate problem.
3. Truncate is not atomic for partitioned table as partition create more than
one ibd file.
4. Truncate is not atomic for master-slave architecture as MySQL bin-logging
still doesn't support atomic DDL. This means on crash of master during truncate
operation, table on master can be out-of-sync from slave table copy. User needs
to note this and fixup the things accordingly if master crashes on truncate.
CONTEXT
=======
Internal InnoDB TRUNCATE TABLE statement is not atomic. Since it firstly
delete the original tablespace and then create a new tablespace with
different space id. So the space id mismatch or missing tablespace
problem is caused if a crash happened after the .ibd file is deleted,
and before the new .ibd file is created.
USECASES
========
There is a main point to the solution, a checkpoint will be written down
upon the truncation finish successfully. So
1. To failed truncate, The TRUNCATE redo log record will be applied
during recovery.
2. To successful truncate, the TRUNCATE redo log record will not be
applied during recovery.
To the following case:
create table t1(a int not null, b int not null, c int not null,
primary key (a), index (b), index (c)) engine=innodb;
# INSERT1
let $n=10000;
while ($n)
{
eval insert into t1 values($n,$n,$n);
dec $n;
}
# TRUNCATION
truncate table t1;
# INSERT2
let $n=5000;
while ($n)
{
eval insert into t1 values($n,$n,$n);
dec $n;
}
The above TRUNCATE and INSERTs will generate the following operations
on the redo log:
1. write log records for INSERT1 (The ibd data file is extended)
2. write MLOG_FILE_TRUNCATE for TRUNCATE
3. write log records for freeing indexes
4. write log records for creating indexes
5. log checkpoint
6. write log records for INSERT2 (The ibd data file is extended)
If the server crashed after step 2, and before step 5, a new tablespace
and indexes will be created when applying the TRUNCATE redo log record,
and every hashed log record which belongs to the tablespace will not
be applied to pages of the tablespace. The real size of the tablespace
can be got from node size of the ibd file for judging whether the hashed
log record is outside bounds of the tablespace or not.
If the server crashed during step 6, The recovery will start from the last
checkpoint written down by the successful truncation. So the TRUNCATE redo
log record will be not applied. Every hashed log record from step 6 will
be applied into pages of the tablespace normally. The ibd data file is
extended by INSERT2. So the real size of the tablespace can be got from
node size of the ibd file during recovery, and no one page offset is
outside the real size of the tablespace.
EXAMPLES
========
The innodb internal TRUNCATE TABLE statement can always truncate
a table successfully when recovery from a crash during truncation
after the feature.
set global innodb_file_per_table=on;
TRUNCATE TABLE t1;
The design will influence recovery subsystem, a TRUNCATE redo log entry is written into redo log and it is executed to recreate a tablespace in scan phase during recovery. A checkpoint is introduced during normal truncation for discarding pages which are outside the bounds of the tablespace during recovery. Requirements ============ RECOVERY CHANGED for single tablespace: ----------------------------------------------------------------------------- 1. Parse the TRUNCATE redo log entry in the scanning phase, without inserting the log record into the hash table. 2. Truncate the original tablespace with space id, table name, data directory path and tablespace flags taken out from the TRUNCATE redo log record. 3. Initialize the new tablespace header with the same space id. 4. Create all new index trees with table format, index ids and index types taken out from the TRUNCATE redo log record. 5. Write these created pages into ibd file handle and flush it to disc, in case I/o-handler thread deletes the bitmap page from buffer after recovery. 6. Don't apply these obsolete hashed log records to the pages of the truncated tablespace, which is created by applying the TRUNCATE redo log record. 7. Discard the page if it is outside the bounds of the tablespace, which is created by applying the TRUNCATE redo log record. NORMAL TRUNCATION CHANGED for single tablespace: ----------------------------------------------------------------------------- 1. Create a TRUNCATE redo log record with MLOG_FILE_TRUNCATE type, and then write space id, table name, data directory path, tablespace flags, table format, index ids, index types, number of index fields and index field information of the table. 2. Free all the index trees associated with rows in SYS_INDEXES table. 3. Invalidate in the buffer pool all pages belonging to the tablespace. 4. Remove all insert buffer entries for the tablespace. 5. Reinitialize the original tablespace header with the same space id. 6. Create all new index trees associated with rows in SYS_INDEXES table. 7. Write down checkpoint to redo log after commit the truncation. 8. Truncate the .ibd file physically for the single tablespace.
I. The RECOVERY part is designed by the following steps:
1. Parse the TRUNCATE redo log entry in the scanning phase, without inserting
the log record into the hash table.
a. Read TRUNCATE redo log record into buffer from log file segment
by log_group_read_log_seg(...).
b. Scan and store the TRUNCATE redo log record into parsing buffer
by recv_scan_log_recs(...).
c. Parse the TRUNCATE redo log record as file redo log record
written about an .ibd file operation by
fil_op_log_parse_or_replay(...) in recv_parse_log_recs(...).
2. Truncate the original tablespace with space id, table name, data
directory path and tablespace flags taken out from the TRUNCATE
redo log record by invoking fil_truncate_tablespace(...).
a. Get the path of the .ibd file by table name and data directory
path if the table have a remote data directory by invoking
os_file_make_remote_pathname(...) or fil_make_ibd_name(...).
b. Get the space and node of the tablespace by space id by invoking
fil_space_get_by_id(...) and UT_LIST_GET_FIRST(...)
c. Physically truncate the .ibd file of the tablespace and set
the size of space and node with truncated size in pages.
3. Initialize the new tablespace header with the same space id.
a. Initialize the first extent descriptor page for the tablespace
by fsp_header_init(...).
b. Initialize the second bitmap page for the tablespace by
ibuf_bitmap_page_init(...).
4. Create all new index trees with table format, index ids and index types
taken out from the TRUNCATE redo log entry.
a. Create the root node for a new clustered index tree with index id
in the fourth page of the tablespace if index type contains
DICT_CLUSTERED bit.
b. Create the root node for a new secondary index tree with index id
in the (4+n_secondary_indexes)th page of the tablespace if index
type does not contain DICT_CLUSTERED bit.
c. Create a new file segment inode in the third page of the tablespace.
5. Write these created pages into .ibd file handle and flush it to disc,
in case I/o-handler thread deletes the bitmap page from buffer after
recovery.
a. Fetch these created pages from buffer by space id and page no, and
write them into opened ibd file handle.
b. Invoke fil_flush(...) to sync the ibd file to disc.
6. Don't apply these obsolete hashed log records to the pages of the
tablespace, which is created by applying the TRUNCATE redo log entry.
a. Created a set for table spaces which will be created by applying
the TRUNCATE redo log entries.
b. Clear the set of truncated table spaces at the begin of the recovery.
c. Add the table space to the set at the begin of applying the TRUNCATE
redo log entry, we may need apply multiple TRUNCATE redo log entries.
d. Clear the set of truncated table spaces at the end of the recovery.
7. Discard the page if it is outside the bounds of the tablespace, which
is created by applying the TRUNCATE redo log entry.
a. To failed truncate, the file node size of the new created tablespace
is 4. So some obsolete hashed log records will be applied into a page,
the offset of the page is bigger than 4, which indicates that the page
is outside the bounds of the tablespace.
b. Firstly unfix and unlatch the page, then remove it from page_hash and
LRU if it is outside the bounds of the tablespace.
(To successful truncate, no page is outside the bounds of the tablespace.
So every hashed log record can be normally applied into pages of the
tablespace.)
II. NORMAL TRUNCATION part is designed by the following steps:
1. Collects space id, table name, data directory path, tablespace flags,
table format, index ids, index types, number of index fields and index
field information of the table from SYS_INDEXES and then write these
information into the TRUNCATE log record with the following format:
/* The TRUNCATE struct is used when applying the TRUNCATE log record
during recovery */
struct truncate_log_t {
header:
ulint type; /* redo log type MLOG_FILE_TRUNCATE
used for locating the TRUNCATE
log record */
ulint space_id; /* space id used for initializing
the header of the tablespace */
ulint format_flags; /* page format used for checking
if create a compact page*/
body:
ulint flags; /* tablespace flags for checking
if the table is compressed and
using a remote data directory */
const char* dir_path; /* remote data directory used for
locating the .ibd file when
truncating it */
const char* tablename, /* table name used for making path of
the .ibd file when truncating it */
ulint index_num; /* number of indexes used for
creating index pages */
const index_id_t* index_id_buf; /* all the index ids used for
creating index pages */
const ulint* type_buf; /* types for indexes used for
creating index pages */
const ulint* n_fields_buf; /* number of index fields used for
creating compressed index pages */
const ulint* field_len_buf; /* the length of index field used for
creating compressed index pages */
const byte* fields_buf; /* index field information used for
creating compressed index pages */
}
2. Free all the pages but the root page first, then we free the root page.
a. Free all the pages but the root page first by invoking
btr_free_but_not_root(...).
b. Then we free the root page, the mini-transaction marks the
B-tree totally freed by btr_free_root(...).
3. Invalidate in the buffer pool all pages belonging to the tablespace.
a. Drop all page hash index entries belonging to the table in batch
by invoking buf_LRU_flush_or_remove_pages(...)
4. Remove all insert buffer entries for the tablespace.
a. Deletes all entries in the insert buffer for the given space id
by invoking ibuf_delete_for_discarded_space(...).
5. Reinitialize the original tablespace header with the same space id.
a. Initialize the first extent descriptor page for the tablespace
by fsp_header_init(...).
b. Initialize the second bitmap page for the tablespace by
ibuf_bitmap_page_init(...).
6. Create all new index trees associated with rows in SYS_INDEXES table
by invoking dict_create_index_tree(..).
a. Create the root node for a new clustered index tree with index id
in the fourth page of the tablespace if index type contains
DICT_CLUSTERED bit.
b. Create the root node for a new secondary index tree with index id
in the (4+n_secondary_indexes)th page of the tablespace if index
type does not contain DICT_CLUSTERED bit.
c. Create a new file segment inode in the third page of the tablespace.
7. Write down a checkpoint to redo log by invoking log_make_checkpoint_at(...)
after commit the truncation, and before physically truncate the
single-table tablespace.
8. Truncate the .ibd file physically for the single tablespace by invoking
fil_truncate_tablespace(...).
a. Get the path of the .ibd file by table name and data directory
path if the table have a remote data directory by invoking
os_file_make_remote_pathname(...) or fil_make_ibd_name(...).
b. Get the space and node of the tablespace by space id by invoking
fil_space_get_by_id(...) and UT_LIST_GET_FIRST(...)
c. Physically truncate the .ibd file of the tablespace and set
the size of space and node with truncated size in pages.
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.