WL#6501: InnoDB: Make Internal InnoDB TRUNCATE TABLE statement to be atomic for single tablespace

Status: Complete

Description
High Level Architecture
Low Level Design

RATIONALE
=========
* To make Internal InnoDB TRUNCATE TABLE statement to be atomic, reinitialize
  the original tablespace header with the same space id and then physically
  truncate its .ibd file during the truncation for single tablespace.


SUMMARY
=======
1. Introduce a new redo log entry type named MLOG_FILE_TRUNCATE.

2. Recreate the tablespace and table indexes by applying MLOG_FILE_TRUNCATE
   redo log entry during recovery when failing to create complete indexes
   during normal truncation, instead of dropping the table directly by
   flagging DICT_OP_TABLE for single tablespace.

3. Reinitialize the original tablespace header with the same space id and
   then physically truncate its .ibd file during the normal truncation for
   single tablespace, instead of dropping original tablespace and creating
   a new one with different space id.


LIMITATIONS
===========
1. Don't support to create FTS index when recreating the tablespace
   and table indexes by applying MLOG_FILE_TRUNCATE redo log entry
   during recovery as the data dictionary can't be accessed in that phase.

2. Don't resolve issues around FOREIGN KEY constraint on the truncating
   table, which is another separate problem.

3. Truncate is not atomic for partitioned table as partition create more than
one ibd file. 

4. Truncate is not atomic for master-slave architecture as MySQL bin-logging
still doesn't support atomic DDL. This means on crash of master during truncate
operation, table on master can be out-of-sync from slave table copy. User needs
to note this and fixup the things accordingly if master crashes on truncate.


CONTEXT
=======
Internal InnoDB TRUNCATE TABLE statement is not atomic. Since it firstly
delete the original tablespace and then create a new tablespace with
different space id. So the space id mismatch or missing tablespace
problem is caused if a crash happened after the .ibd file is deleted,
and before the new .ibd file is created.


USECASES
========

There is a main point to the solution, a checkpoint will be written down
upon the truncation finish successfully. So
    1. To failed truncate, The TRUNCATE redo log record will be applied
       during recovery.
    2. To successful truncate, the TRUNCATE redo log record will not be
       applied during recovery.

To the following case:
create table t1(a int not null, b int not null, c int not null,
                primary key (a), index (b), index (c)) engine=innodb;

# INSERT1
let $n=10000;
while ($n)
{
  eval insert into t1 values($n,$n,$n);
  dec $n;
}

# TRUNCATION
truncate table t1;

# INSERT2
let $n=5000;
while ($n)
{
  eval insert into t1 values($n,$n,$n);
  dec $n;
}

The above TRUNCATE and INSERTs will generate the following operations
on the redo log:
1. write log records for INSERT1 (The ibd data file is extended)
2. write MLOG_FILE_TRUNCATE for TRUNCATE
3. write log records for freeing indexes
4. write log records for creating indexes
5. log checkpoint
6. write log records for INSERT2 (The ibd data file is extended)

If the server crashed after step 2, and before step 5, a new tablespace
and indexes will be created when applying the TRUNCATE redo log record,
and every hashed log record which belongs to the tablespace will not
be applied to pages of the tablespace. The real size of the tablespace
can be got from node size of the ibd file for judging whether the hashed
log record is outside bounds of the tablespace or not.

If the server crashed during step 6, The recovery will start from the last
checkpoint written down by the successful truncation. So the TRUNCATE redo
log record will be not applied. Every hashed log record from step 6 will
be applied into pages of the tablespace normally. The ibd data file is
extended by INSERT2. So the real size of the tablespace can be got from
node size of the ibd file during recovery, and no one page offset is
outside the real size of the tablespace.


EXAMPLES
========
The innodb internal TRUNCATE TABLE statement can always truncate
a table successfully when recovery from a crash during truncation
after the feature.
set global innodb_file_per_table=on;
TRUNCATE TABLE t1;

The design will influence recovery subsystem, a TRUNCATE redo log entry
is written into redo log and it is executed to recreate a tablespace
in scan phase during recovery. A checkpoint is introduced during normal
truncation for discarding pages which are outside the bounds of the
tablespace during recovery.

Requirements
============

RECOVERY CHANGED for single tablespace:
-----------------------------------------------------------------------------
1. Parse the TRUNCATE redo log entry in the scanning phase, without inserting
   the log record into the hash table.

2. Truncate the original tablespace with space id, table name, data
   directory path and tablespace flags taken out from the TRUNCATE
   redo log record.

3. Initialize the new tablespace header with the same space id.

4. Create all new index trees with table format, index ids and index types
   taken out from the TRUNCATE redo log record.

5. Write these created pages into ibd file handle and flush it to disc,
   in case I/o-handler thread deletes the bitmap page from buffer after
   recovery.

6. Don't apply these obsolete hashed log records to the pages of the
   truncated tablespace, which is created by applying the TRUNCATE
   redo log record.

7. Discard the page if it is outside the bounds of the tablespace, which
   is created by applying the TRUNCATE redo log record.


NORMAL TRUNCATION CHANGED for single tablespace:
-----------------------------------------------------------------------------
1. Create a TRUNCATE redo log record with MLOG_FILE_TRUNCATE type, and then
   write space id, table name, data directory path, tablespace flags, table
   format, index ids, index types, number of index fields and index field
   information of the table.

2. Free all the index trees associated with rows in SYS_INDEXES table.

3. Invalidate in the buffer pool all pages belonging to the tablespace.

4. Remove all insert buffer entries for the tablespace.

5. Reinitialize the original tablespace header with the same space id.

6. Create all new index trees associated with rows in SYS_INDEXES table.

7. Write down checkpoint to redo log after commit the truncation.

8. Truncate the .ibd file physically for the single tablespace.

I. The RECOVERY part is designed by the following steps:


1. Parse the TRUNCATE redo log entry in the scanning phase, without inserting
   the log record into the hash table.

    a. Read TRUNCATE redo log record into buffer from log file segment
       by log_group_read_log_seg(...).

    b. Scan and store the TRUNCATE redo log record into parsing buffer
       by recv_scan_log_recs(...).

    c. Parse the TRUNCATE redo log record as file redo log record
       written about an .ibd file operation by
       fil_op_log_parse_or_replay(...) in recv_parse_log_recs(...).

2. Truncate the original tablespace with space id, table name, data
   directory path and tablespace flags taken out from the TRUNCATE
   redo log record by invoking fil_truncate_tablespace(...).

    a. Get the path of the .ibd file by table name and data directory
       path if the table have a remote data directory by invoking
       os_file_make_remote_pathname(...) or fil_make_ibd_name(...).

    b. Get the space and node of the tablespace by space id by invoking
       fil_space_get_by_id(...) and UT_LIST_GET_FIRST(...)

    c. Physically truncate the .ibd file of the tablespace and set
       the size of space and node with truncated size in pages.

3. Initialize the new tablespace header with the same space id.

    a. Initialize the first extent descriptor page for the tablespace
       by fsp_header_init(...).

    b. Initialize the second bitmap page for the tablespace by
       ibuf_bitmap_page_init(...).

4. Create all new index trees with table format, index ids and index types
   taken out from the TRUNCATE redo log entry.

    a. Create the root node for a new clustered index tree with index id
       in the fourth page of the tablespace if index type contains
       DICT_CLUSTERED bit.

    b. Create the root node for a new secondary index tree with index id
       in the (4+n_secondary_indexes)th page of the tablespace if index
       type does not contain DICT_CLUSTERED bit.

    c. Create a new file segment inode in the third page of the tablespace.

5. Write these created pages into .ibd file handle and flush it to disc,
   in case I/o-handler thread deletes the bitmap page from buffer after
   recovery.

    a. Fetch these created pages from buffer by space id and page no, and
       write them into opened ibd file handle.

    b. Invoke fil_flush(...) to sync the ibd file to disc.

6. Don't apply these obsolete hashed log records to the pages of the
   tablespace, which is created by applying the TRUNCATE redo log entry.

    a. Created a set for table spaces which will be created by applying
       the TRUNCATE redo log entries.

    b. Clear the set of truncated table spaces at the begin of the recovery.

    c. Add the table space to the set at the begin of applying the TRUNCATE
       redo log entry, we may need apply multiple TRUNCATE redo log entries.

    d. Clear the set of truncated table spaces at the end of the recovery.

7. Discard the page if it is outside the bounds of the tablespace, which
   is created by applying the TRUNCATE redo log entry.

    a. To failed truncate, the file node size of the new created tablespace
       is 4. So some obsolete hashed log records will be applied into a page,
       the offset of the page is bigger than 4, which indicates that the page
       is outside the bounds of the tablespace.

    b. Firstly unfix and unlatch the page, then remove it from page_hash and
       LRU if it is outside the bounds of the tablespace.

    (To successful truncate, no page is outside the bounds of the tablespace.
     So every hashed log record can be normally applied into pages of the
     tablespace.)




II. NORMAL TRUNCATION part is designed by the following steps:

1. Collects space id, table name, data directory path, tablespace flags,
   table format, index ids, index types, number of index fields and index
   field information of the table from SYS_INDEXES and then write these
   information into the TRUNCATE log record with the following format:

/* The TRUNCATE struct is used when applying the TRUNCATE log record
during recovery */
struct truncate_log_t {
header:
	ulint		type;		/* redo log type MLOG_FILE_TRUNCATE
					used for locating the TRUNCATE
					log record */
	ulint		space_id;	/* space id used for initializing
					the header of the tablespace */
	ulint		format_flags;	/* page format used for checking
					if create a compact page*/
body:
	ulint		flags;		/* tablespace flags for checking
					if the table is compressed and
					using a remote data directory */
	const char*	dir_path;	/* remote data directory used for
					locating the .ibd file when
					truncating it */
	const char*	tablename,	/* table name used for making path of
					the .ibd file when truncating it */
	ulint		index_num;	/* number of indexes used for
					creating index pages */
	const index_id_t* index_id_buf; /* all the index ids used for
					creating index pages */
	const ulint*	type_buf;	/* types for indexes used for
					creating index pages */
	const ulint*	n_fields_buf;	/* number of index fields used for
					creating compressed index pages */
	const ulint*	field_len_buf;	/* the length of index field used for
					creating compressed index pages */
	const byte*	fields_buf;	/* index field information used for
					creating compressed index pages */
}

2. Free all the pages but the root page first, then we free the root page.

    a. Free all the pages but the root page first by invoking
       btr_free_but_not_root(...).

    b. Then we free the root page, the mini-transaction marks the
       B-tree totally freed by btr_free_root(...).

3. Invalidate in the buffer pool all pages belonging to the tablespace.

    a. Drop all page hash index entries belonging to the table in batch
       by invoking buf_LRU_flush_or_remove_pages(...)

4. Remove all insert buffer entries for the tablespace.

    a. Deletes all entries in the insert buffer for the given space id
       by invoking ibuf_delete_for_discarded_space(...).

5. Reinitialize the original tablespace header with the same space id.

    a. Initialize the first extent descriptor page for the tablespace
       by fsp_header_init(...).

    b. Initialize the second bitmap page for the tablespace by
       ibuf_bitmap_page_init(...).

6. Create all new index trees associated with rows in SYS_INDEXES table
   by invoking dict_create_index_tree(..).

    a. Create the root node for a new clustered index tree with index id
       in the fourth page of the tablespace if index type contains
       DICT_CLUSTERED bit.

    b. Create the root node for a new secondary index tree with index id
       in the (4+n_secondary_indexes)th page of the tablespace if index
       type does not contain DICT_CLUSTERED bit.

    c. Create a new file segment inode in the third page of the tablespace.

7. Write down a checkpoint to redo log by invoking log_make_checkpoint_at(...)
   after commit the truncation,  and before physically truncate the
   single-table tablespace.

8. Truncate the .ibd file physically for the single tablespace by invoking
   fil_truncate_tablespace(...).

    a. Get the path of the .ibd file by table name and data directory
       path if the table have a remote data directory by invoking
       os_file_make_remote_pathname(...) or fil_make_ibd_name(...).

    b. Get the space and node of the tablespace by space id by invoking
       fil_space_get_by_id(...) and UT_LIST_GET_FIRST(...)

    c. Physically truncate the .ibd file of the tablespace and set
       the size of space and node with truncated size in pages.