WL#7816: InnoDB: Persist the "corrupted" flag in the data dictionary

Affects: Server-8.0   —   Status: Complete

When InnoDB notices corruption in an index tree, it needs to flag it as
corrupted, both in its internal data dictionary cache and in persistent storage.

Because Global Data Dictionary is "above" the storage engine in the locking
hierarchy, it would be tricky to initiate an update of DD tables from the
storage engine. It could easily lead to deadlocks. Also, it could be tricky
to propagate an error all the way up the stack, so that the upper layer would
know to mark the DD object as corrupted.

To persist the attribute, we will piggy-back on the InnoDB redo log,
which can be written to even from low-level code and background threads,
such as page I/O handlers.
FR1: We will write a redo log but not write back to DD tables when we find
     an index is corrupted. By this way, we can record a corrupted index
     nearly anytime and anywhere. The redo log would be flushed right after
     the mtr commits.

FR2: We will introduce an engine-private system table, which would be called
     'DD buffer table' later to store the corruption bits of indexes before
     every checkpoint. We should make sure the same corruption bit would not
     be written to this table over and over.

FR3: During recovery, we will read the corruption bits from both redo logs
     and DD buffer table, and merge the results to mark the in-memory
     table and index objects.

FR4: The data will be written to DD tables.
     The write-back is supposed to happen when:
     1) Slow shutdown (shutdown after SET GLOBAL innodb_fast_shutdown=0)
     2) EXPORT TABLESPACE
     WL#7816 will only write data to the redo log and the DD buffer table.

Overall
=======

The rough idea of this worklog is that we will persist corruption bits of
an index via redo logs, and finally write them back to DD tables. There
are 3 steps for this persisting:

1. Once we find a corrupted index, we will write the bit to redo log
2. On every checkpoint, we will write all in-memory corruption bits to the
   DD buffer table if they haven't
3. We will finally write these corruption bits back to DD tables when:
   3.1 EXPORT TABLESPACE
   3.2 Slow shutdown

As said in FR4, step 3 is not included in this WL.

Some notes:

On shutdown, all the MDL_EXCLUSIVE requests will trivially succeed, because
all user transactions have been kicked out. On EXPORT TABLESPACE, we must
afford to hold MDL_EXCLUSIVE at some part of the operation. It can be
downgraded to MDL_SHARED once the data has been written. This assumes that
no fast-changing metadata will be changing as a result of read activity
that might run concurrently with the EXPORT. 

This mechanism is now designed for this Worklog only, but it can be easily
applied to other to be persisted fast-changing bits, such as update_time,
auto_inc, COUNT(*)  etc. We have to consider the compatibility of this
mechanism, so I will sometimes use 'fast-changing metadata' instead of
'corruption bit' later.


redo log
========

We record the corrupted flag into redo log, so that we don't access DD
tables that often and persistence of corrupted flag is crash safe.
After the crash, we will scan redo logs to collect the corruption
information during restart. A new mtr log type named MLOG_CORRUPTED_INDEX
would be introduced for this purpose.

In order to know which index is corrupted, we need to log
pair>. We don't have any lookup by
index_id, so table_id is a must. Also, we might make index_id
only unique within a tablespace, no longer unique InnoDB-wide, so we also log
the space_id here in advance.

Considering the crash issue, we had better flush the corrupted index redo log
after related mtr.commit(), by invoking
	log_write_up_to(mtr.commit_lsn(), true);
Due to this, in most cases(except for a very soon crash)
we will persist the corrupted index id in the redo log.

Refer to LLD 4.1.1 about the checkpoint issue.

dirty_status in table
=====================

A flag indicating any fast-changing metadata need to be written back to
either DD buffer table or DD tables, because we don't want to write back
over and over. This dirty_status should be table level(in dict_table_t).

This flag should have one of 3 status values:
1. dirty: if any fast-changing metadata is only marked in memory, without
          any write-back(So we have to write it back to DD buffer table on
          next checkpoint, and to DD table on EXPORT or slow shutdown).
2. buffered: if any fast-changing metadata is marked both in memory and
             in DD buffer table(We have to write back to DD table on
             EXPORT and slow shutdown).
3. clean: The DD table is up to date, and the DD buffer table is empty.


DD buffer table
===============
It should be an engine-private system table, which is created along with
other system tables and never discarded.

Table structure:
CREATE TABLE DD_buffer_table (
    table_id BIGINT NOT NULL PRIMARY KEY,
    fast_changing_metadata BLOB NOT NULL) ENGINE=InnoDB;

We will write all fast-changing metadata into the blob field, in the
key-value pair way so that we can write different metadata for different
tables dynamically.

We introduce this table to avoid 'copy across checkpoint'. In the original
design, all corruption bits should be re-written on every checkpoint,
which leads to possibly writing the same corruption bits over and over
because we rarely write them back to DD tables when server is running.
It's difficult to write back the corruption bits to the Global DD
during checkpoint, becaue of the lock issue which would conflict with user
DMLs.

At first, the DD buffer table is empty. On checkpoint, we will issue some
REPLACE statements to write back dirty fast-changing metadata to it,
if and only if dirty_status is dirty. After we write back the metadata
to DD table, we can delete the corresponding rows in DD buffer table.

For 'DROP/ALTER/TRUNCATE', which would modify the table_id of a table,
we should update the table_id field accordingly:
1. ha_innobase::delete_table()
(DROP TABLE, ALTER TABLE…DROP PARTITION, and ALTER TABLE…ALGORITHM=COPY):
   We can simply try to delete the corresponding row in DD buffer table.
   When the DDL transaction gets committed, the data will be gone.
2. ha_innobase::commit_inplace_alter_table(commit=true)
(ALTER TABLE…ALGORITHM=INPLACE, OPTIMIZE TABLE)
   If the table is being rebuilt, we will delete the DD buffer table entry
   for the old table.
3. ha_innobase::truncate()
(TRUNCATE TABLE, ALTER TABLE…TRUNCATE PARTITION):
   Later, truncate() will invoke to create() and delete_table().
   The DD buffer table entry for the old table_id will be deleted in
   delete_table().

With the DD buffer table, the setting and converting of the status follows:
1. At first, the status should be 'clean' (but see 5 below).
2. Once we write a redo log for any fast-changing metadata, we would set
   the corresponding status as 'dirty'.
3. On checkpoint, if we found a status is 'dirty', we should write/update
   the metadata back to DD buffer table and set the status to 'buffered'.
4. On EXPORT or slow shutdown, if we find a status is either 'dirty' or
   'buffered', we should write it back to DD tables and set it to 'clean'
   again. After the true write-back, we can delete the corresponding
   row in DD buffer table.
5. On opening a table, we have to check the DD buffer table first, if
   there is a row of current table, we should read the metadata, update
   the table object in cache and set status to 'buffered'.
6. On removing a table object from cache, we should write/update the
   metadata back to DD buffer table.

5 and 6 are due to the fact we don't do write-back when evicting a table
from cache.


Write-back
==========

We can only write back to DD tables on EXPORT or slow shutdown while we
write back to DD buffer table on every checkpoint if necessary.

In order to do write-back efficiently, we introduce a list called
dirty_tables to link all tables whose dirty_status is not 'clean'. After
we mark an index as corrupted, we should add the table into this list,
and only remove the table once it becomes 'clean' again. We also
introduce a mutex dict_dirty_tables_mutex to protect the dirty tables list.

Write-back to DD buffer table should be done in log_checkpoint, which is
covered by a new background transaction.

Recovery and Startup
====================

1. Scan

During restart, we have to merge the metadata from both redo log (after last
checkpoint) and DD buffer table. We will read redo logs first, and then read
metadata from DD buffer table. If we read nothing from either place, then
it could be that no metadata was changed after the latest slow shutdown.

We need one structure to store persistent data of a table. Currently we only
need an array to store IDs of corrupted indexes, but we may need it to store
other fast-changing metadata later. Let's call this struct recv_table_t.

We also need a map to maintain the relation between a table and its persistent
data. The map would looks like std::map.

During scanning redo logs, when we find a MLOG_CORRUPTED_INDEX log with 
pair>, first we need to find if table_id
exists in above map, if exists, we should add  to this
entry of table_id, if not, we create a new entry in the map and add
 to this new entry.

After redo logs, we have to scan the DD buffer table. For every single row
with a table_id which is already in above map, then we can simply discard
it. If the table_id didn't exist in above map, we should create a new entry
for this row and add it to above map.

In particular, if the DD buffer table is empty, we can say that all corruption
bits or fast-changing metadata have been written back to DD table.

2. Apply

After the scan step, we might have several entries in the map, then we can
iterate the map and mark all in-memory indexes(in dict_index_t),
which appear in table_persistence_t, as corrupted. We could drop this map
when we finish marking all the in-memory indexes.

We don't need to write back the corruption flags to the DD table right now,
they would be written according to above rules. If any corruptedflag was
changed after recovery, we still need to write a redo log for it, and in
next recovery we can handle it correctly.


Corrupted tablespace
====================

Currently we will try to mark a tablespace as corrupted in background IO
threads, for example, in buf_page_io_complete, if the table space is not the
system tablespace. There are some issues here:
1. If it's in recovery process and dictionary system hasn't been booted,
   we were not able to mark the table space and would crash the server.
2. We only know the space id of the corrupted page, but not the table id or
   index id. Since we support multi-table tablespace, marking the whole space
   as corrupted is not a good strategy here. Besides, if the corrupted page is
   from a secondary index, we don't have to mark the whole tablespace as
   corrupted no matter which type of tablespace it is.

In general, it seems we could not mark index or space as corrupted properly in
buf_page_io_complete, until buf_page_get_gen() starts taking dict_index_t*
as a parameter. So we would prevent marking anything as corrupted in
this function and just print the error messages then return FALSE when we found
a corrupted page. We will try to mark the index to which the page belongs as
corrupted later in upper layer.
1. new SYS_TABLE_INFO_BUFFER table

In fsp_header_init(), we will create the cluster btree for
'SYS_TABLE_INFO_BUFFER'.

Add a new class DDTableBuffer to manipulate the 'SYS_TABLE_INFO_BUFFER' table,
which has functions like replace/get/truncate/remove, etc. In its constructor,
we will create the 'SYS_TABLE_INFO_BUFFER' in-memory index object and
initialize it.

All functions here use innodb low level functions to access the system table,
as this should be a table with only cluster index, no delete-marked records,
no locking, no undo logging, no purge, no adaptive hash index.

Something special in truncate() is that we introduce btr_truncate() and
btr_truncate_recover() to do the low level truncate without changing the
root page. In btr_truncate(), we will mark and log the PAGE_MAX_TRX_ID of
cluster index to -1ULL and start the low level truncate, after finish,
we use page_create to reset the mark and log. When we construct the
DDTableBuffer, we have to call btr_truncate_recover() to check the cluster
index, which will first read the PAGE_MAX_TRX_ID from root page to see if it's
-1ULL, if so, we call btr_truncate() again to finish the truncate.

In dict_boot(), we will create the DDTableBuffer object

In dict_close(), we will delete the DDTableBuffer object

In row_drop_table_for_mysql(), we have to delete the corresponding row
in DDTableBuffer.

In dict_load_table_one(), we should read corrupted indexes from system table,
and set them to in-memory table object. We add a low level function
dict_table_read_dd_table_buffer_on_load().

In commit_inplace_alter_table(), if it's a rebuild op, we can remove the row
of table in system table.

In log_checkpoint(), we would persist all dirty tables to the system table.

In dict_table_remove_from_cache_low(), we will write back the metadata to
system table if it has some dirty metadata.

2. dirty status

Add a new enum named table_dirty_status, which has 3 status: METADATA_DIRTY,
METADATA_BUFFERED, METADATA_CLEAN. We also introduce dirty_status in
dict_table_t.

In dict_table_read_dd_table_buffer_on_load(), we may set it to
BUFFERED if we can read the row of table in DD Table Buffer, otherwise, we set
it to CLEAN.

In dict_set_corrupted(), we would set it to DIRTY if it's not of DIRTY.

In dict_table_persist_to_dd_table_buffer_low(), we woull set it to BUFFERED.

In MetadataRecover::apply(), we may set it as DIRTY if it was CLEAN but
get DIRTY after applying metadata.


3. dirty_dict_tables

This will store tables of un-clean status there. We also introduce a debug
variable called in_dirty_dict_tables_list in dict_table_t for assertions.

In dict_set_corrupted(), if the table hasn't been in this list, we will add
it.

In dict_load_table_one(), if we found there is a row in the DD Table Buffer,
we will add the table to this list.

In dict_table_remove_from_cache_low(), commit_inplace_alter_table() and
row_drop_table_for_mysql(), if the table is in this list, we will remove it
away.


4. persisting

4.1 overall

The main idea here is that, when we want to write back metadata to system
table or redo log, we initialize a meatadata object for every table, then we
can iterate over the Persisters to get different persisters to serialize
different kind of metadata for write-back, or just pick the persister to
write redo log. The format should be the same in both cases. We can support
other persisters for other metadata easily.

Every persister will write down its own mark byte to distinguish from others.

When writing redo logs, we introduce a new mlog id called
MLOG_TABLE_METADATA for all persistent metadata. So for other metadata
we don't introduce more ids.

4.1.1 Checkpoint issue

We want to write back dirty metadata from in-memory table objects to
SYS_TABLE_INFO_BUFFER on each checkpoint, and we should also make sure that
there would be no more redo logs for fast-changing persistent metadata between
the write-back and acquiring log mutex, or else we might lost these redo logs
when recovery.

Let's suppose we have done a checkpoint like:
1. finish write-back dirty metadata to SYS_TABLE_INFO_BUFFER
2. some threads still write redo logs for more persistent metadata
3. we acquire the log mutex, get the oldest_lsn as checkpoint lsn and write
   MLOG_CHECKPOINT mark later
4. finish the whole checkpoint
5. Server crashes

On recovery, we will miss the redo logs logged in step 2, since we indeed found
the MLOG_CHECKPOINT mark, we will skip the redo logs before.

To prevent missing redo logs, we will introduce a new rw-lock on checkpoint,
we should hold this rw-lock in X mode before we do write-back and release it 
after we have already held the log mutex. For every thread which want to write
redo logs for dirty persistent metadata, acquiring the rw-lock in S mode first
is must. So we can make sure there won't be any redo logs rushing in ini step 2.

4.2 new classes/structures

Add a manager class named Persisters which might have several persisters to
persist metadata. It has some interfaces to add/del/get persisters.

Add an interface Persist to serialize/deserialize and log the persistent
metadata.
It has some interfaces like write/write_size/read and a defined function
write_log to write the metadata redo logs. Then add a class
CorruptedIndexPersister implements the interface Persist for corrupted indexes.

Add a new structure dict_persist_t containing the mentioned rw-lock,
dirty_dict_tables list, DDTableBuffer object, and the Persisters manager class.
Also we introduce a mutex to protect things in this structure, plus the
dirty_status and in_dirty_dict_tables_list in dict_table. Please also refer to
the comments in this structure, describing the latch level issue.

Add a class PersistentMetadata for all fast-changing persistent metadata,
currently it's only used for corrupted indexes.

Add an enum persistent_type_t for the names of persisters so that we can
look for them.

4.3 functions

We will initialize/close persisters by dict_persisters_init() and
dict_persisters_close() in innodb start and shutdown function.

Add init_metadata() to initialize metadata for a table.

Add dict_table_read_metadata() to read metadata info from a buffer.

Add mlog_write_initial_dict_log_record() to initialize the metadata logs. To use
mach_u64_write_much_compressed() in this function, we have to introduce some
related functions like mach_parse_u64_much_compressed() and
mlog_parse_initial_dict_log_record() etc.

Add dict_table_persist_to_dd_table_buffer() to write back the metadata to
system table for a table.

Add dict_persist_to_dd_table_buffer() to write back metadata of all tables
with dirty metadata to system table, used on log_checkpoint(). The only
trouble here is that we should check if current thread owns the dict_sys
mutex or not.

Add dict_table_persist_to_dd_table_buffer_low() to do the true write-back,
we will iterate over Persisters here and set table as METADATA_BUFFERED

Remove dict_set_corrupted_index_cache_only().

Rewrite dict_set_corrupted(), we will set it to METADATA_DIRTY if not set.

Replaced existing dict_set_corrupted and dict_set_corrupted_index_cache_only()
with current dict_set_corrupted.

5. Recovery

The main idea here is we will collect redo logs about corrupted index ids
during recovery. There could be several ids belonging to the same table,
we will do a merge here. After we get all persistent metadata, we will
apply them to the in-memory table object. When applying for one table,
we will open the table first, then all buffered metadata of this table
would be loaded to the in-memory table object. So we just simply apply
the merged metadata directly to the table object.

Add MetadataRecover class for parsing and merging metadata read from
redo logs, then applying them to in-memory tables. The instance of it is
defined in recv_sys_t.

Add dict_table_apply_persistent_metadata() to do the apply from metadata
object to in-memory table object.

In recv_parse_log_rec(), we should ask MetadataRecover object to parse the
redo log of type MLOG_PERSISTENT_METADATA.

Add recv_apply_table_metadata() to do the apply, which would be called
when innodb starts, just after we finished the redo log parsing.


6. marking space as corrupted in buffer

Remove buf_mark_space_corrupt() and dict_find_table_by_space() since we
shouldn't do so if there are more than one tables in the same space.

Move buf_read_page_handle_error() from buf0rea.cc to buf0buf.cc since it
would be called there.

In buf_page_io_complete(), we don't mark the space as corrupted but just
handle the error internally.