WL#6204: InnoDB persistent max value for autoinc columns

Affects: Server-8.0   —   Status: Complete

See BUG#199 on MySQL bugs. Currently InnoDB does the following when a table
is opened: SELECT MAX(c) FROM t; where c is the AUTOINC column name.
This step is used to initialise the column's next autoinc value and
allocation of autoinc values starts from this point. InnoDB also does this
when it executes 'ALTER TABLE AUTO_INCREMENT=N;'.

Suppose there is a table like this:

(1) CREATE TABLE t(a BIGINT PRIMARY KEY AUTO_INCREMENT) ENGINE=InnoDB
AUTO_INCREMENT=42;
(2) BEGIN; INSERT INTO t VALUES(0),(0),(0); -- inserts 42,43,44
(3) COMMIT; -- let us assume that this will make a checkpoint
(4) BEGIN; INSERT INTO t VALUES(0); -- 45;

Then there would be some different scenarios here:

Scenario 1:
(5) -- do nothing here
(6) -- kill server and restart

Scenario 2:
(5) ROLLBACK;
(6) -- shutdown the server normally then restart, or crash and restart

Scenario 3:
(5) COMMIT; -- let's assume this would flush the redo logs
(6) -- shutdown the server normally then restart, or crash and restart

After server restarts in all these scenaiors, we do the following:
(7) INSERT INTO t VALUES(0); -- should insert 46
(8) SELECT * FROM t; -- should return 42,43,44,46

In scenario 1 and 2, the INSERT at step#7 would currently start the allocation
from 45(ignoring the offset setting), not from whatever was the last value
from step#4. Only in scenario 3, it would start from 46 and that's what we
expect. InnoDB should keep track of the maximum value and on restart preserve
that max value and start from there.

InnoDB should at least make sure the result from scenario 2 is the same with
scenario 3, because there is no crash. And InnoDB should try to make
the result from scenario 1 the same as scenario 3. InnoDB can't guarantee
the later because we should consider the performance issue, which would be
mentioned later.

So we plan to implement following features in this worklog:
1. The AUTOINC counters would get persisted through redo logs
2. All AUTOINC counters would be collected from redo logs and applied to
in-memory counters if they're bigger.
3. There won't be any rollback of AUTOINC counters.
4. 'SELECT MAX(c) FROM t' would not be needed except when IMPORT tablespace.
5. The largest updated counter will be logged and will not change with reboot.
6. Considering performance, we could write some extra redo logs, but no
new MTRs.
FR1: Counters of AUTO_INCREMENT should be strictly incremental, as it works now,
when the server is running. Rollback of transaction won't revert the counter.
'ALTER TABLE' would not change the counter to a smaller value too.

FR2: Allocated counters won't be reused after the server restart normally,
this must be guaranteed.

FR3: Allocated counters won't be reused after the server restart from a crash,
this can NOT be guaranteed.

FR4: Redo logs for AUTO_INCREMENT counters would be written once they're
changed. During recovery, InnoDB will scan the redo logs to collect counter
changes and apply them to in-memory table object. There is no need to flush the
redo logs immediately.

FR5: Once the counter of a table gets changed, it will be written back to the
'DDTableBuffer' on checkpoint.

FR6: This worklog only write counters to redo logs and DDTableBuffer.

FR7: SELECT max(counter) is not needed neither at initialization nor during
'ALTER TABLE'. But it should be done if we want to import a tablespace without
a correct config file, in this case, we could only scan the index to get current
max counter.

FR8: There is a behaviour change that if we update the counter to some
larger value than current next value, we will remember the larger value and
allocate counter from it next time. This one has several advantages, which will
be mentioned later.


NFR1: We should guarantee that there won't be significant regression after
this worklog. We need to acquire new rw-locks, write redo logs(currently no need
to flush immediately) when we want to update the in-memory counter, but these
should not affect the performance too much. We should not introduce new mtr in
this worklog since mtr_commit is time consuming.
The word 'counter' would be used for short to represent the
'auto_increment counter'.


How to persist
==============

First of all, considering the performance, we should not introduce a new mtr
to write redo logs. mtr_commit would be time consuming and affect performance.
So we will write redo logs for the counter by the mtr started in modification
of clustered index(both insert and update). This means we should log counter
for every row, but not in batch.

In fact, the counter is logged in existing page redo logs, when
we insert the row to clustered index and(or) the counter to related indexes.

Theoretically, we can re-use these page redo logs for our purpose of
persisting counter, besides, we can persist the counter like the way
implemented in WL#7816. Let's compare how these 2 approaches work.

1. The redo log approach like WL#7816

By this way, for every new PM_TABLE_AUTO_INC log introduced for counter,
we have to log the metadata type, table id and the counter, it consumes us
1 + 5(should be 1-5) + 1 + 5(should be 1-5) = 12 bytes(at most, in most cases).
This log is required when we have to update the counter, Either before INSERT
statement or before some UPDATE which introduces a bigger counter than current
one.

This would be easier and clearer.

2. Re-use existing page redo logs

If we want to use existing page redo logs, we have to add some more info to
current redo logs. Since there is no table id there, we have to log it
additionally, so that during recovery, we can know which counter belonges to
which table.

Besides, we don't know which field in the logged record is the counter,
we have to record this in one byte. This would affect both cluster index
and secondary index, and maybe affect redo logs for tables without AUTO_INC.

In current recovery, we start recovery from checkpoint first and then boot
the dict system, so we can't get the table information before scanning
redo logs, which leads to the impossibility that we can learn the field
information from table definition.

Considering these, it would cost us extra
5(should be 1-5) + 1 = 6 bytes(at most, in most cases)
for INSERT redo logs on cluster index. But I guess the byte indicating
if it's a cluster is a must for every index. Besides, we have many insert
logs for copying entries from old page to new page, the extra byte is needed
too.

So approach 1 could be a better choice. This design will be based on it.


Persist by redo logs
====================

We will use the redo logs approach for persisting the auto_inc counter as
what we do in WL#7816. The counter columns would be maintained by InnoDB
in main memory. However, every changes of the counter would be written to
the redo logs. There are 3 steps for the persisting:

1. Every time we write an insert log(maybe an update log) for clustered index,
we will write current updated counter to redo log. We will log the absolute
value instead of the increment.
2. On every checkpoint, we will write all in-memory changed counters to the
DDTableBuffer if they haven't.
3. We will finally write these counters back to DD tables when:
   3.1 EXPORT TABLESPACE
   3.2 Slow shutdown

Once again, step 3 is not included in this worklog.

Since this worklog is based on the mechanism introduced by WL#7816, we will
have the same stategy of writing redo logs, marking dict_table_t::dirty_status,
adding the table to dirty tables list, writing dirty tables back to
DDTableBuffer, reading and analyzing redo logs during recovery and
finally applying redo logs to in-memory table objects. Some exceptions
would be mentioned later.

Since we will write redo logs in low-level functions, we have to remember the
index of the counter field in the clustered index. We will remember the index
of field when we create the table, and get it updated when 'ALTER TABLE' if
the sequence changes. After we IMPORT a tablespace, we have to set it again.
We can get the index value from server in these cases. On INSERT/UPDATE,
we only read the index value directly and retrieve the corresponding counter
value from record of clustered index.

Continuing the scenario 1 in High-Level Description:

(1) CREATE TABLE: Without WL#7813, InnoDB can't write back the counter to
DD tables, so InnoDB just writes redo logs for the initial counter in InnoDB.
InnoDB needs to mark dirty_status of this table as dirty and add this table to
dirty tables list.

(2) INSERT: InnoDB updates the counter in memory, and write redo logs
accordingly(no MLOG type will be introduced, by just some type in
persistent_type_t), to signal that the next counter value will be 45.
The dirty_status of this table would be marked as METADATA_DIRTY, and added
to the dirty_tables list if it hasn't.

(3) COMMIT: The transaction undo log is marked as committed.
If innodb_flush_log_at_trx_commit = 1, then redo logs for counters will be
flushed and no going back again, otherwise, it depends on if the logs are
flushed to disk.

(4) BEGIN; INSERT: Again, InnoDB writes a redo log for it to update
the counter to 46, among other things.

(5) The same as (3) if there is a ROLLBACK or COMMIT, otherwise nothing to do.

(6) The server crashes and restarts
(6.1) InnoDB finds the latest checkpoint.
(6.2) InnoDB will start to collect counters from redo logs and put them into
a map, where every table with a counter has an entry there, since the
checkpoint.
(6.3) After scanning and collecting redo logs, InnoDB will apply the
collected counters to in-memory tables. At this step, InnoDB will open
the in-memory table and read counters from DDTableBuffer first, then apply
the counters from redo logs to the table object.
(6.4) InnoDB rolls back any recovered transactions, rolling back the insert to
a=45. There is no rollback of the counter value, the hole (a=45) will not be
reused.
(7) INSERT: InnoDB uses the recovered counter value of 46, and write redo logs,
etc. similar to step (2).

The metatdata of tables in METADATA_DIRTY status would be written back to
DDTableBuffer on every checkpoint and the status would be marked as
METADATA_BUFFERED. Updating the persisted AUTO_INC counter in DDTableBuffer
should be in-place in most cases, unless the counter needs more bytes to store
or we need to mark some index as corrupted at the mean time.

In scenario 2 and 3, we should have similar steps. The big difference is that
in step (6), InnoDB don't need to collect counters from redo logs since
there was a checkpoint before shutdown. So InnoDB will get the latest counter
from DDTableBuffer when startup, and the hole will not be reused as well.

In steps of INSERT, we don't flush the redo logs immediately due to
performance. Flush redo logs will slow down the server significantly.
That's why we can't guarantee that in scenario 1, no allocated counters
would be reused. 

In step 3, we don't force a flush on COMMIT/ROLLBACK, as we don't want to
change the behaviour of what innodb_flush_log_at_trx_commit indicates.
This wouldn't affect row-based replication, since the redo logs of counter
should always flush before page redo logs.

In step 6.3, before we apply counters from redo logs to in-memory table
object, we should compare the two counters, one from redo logs and one from
table object. Only if the counter from redo logs is bigger, we need to
apply it to table object. It's due to we don't write redo logs for counter
incrementally and larger counter could be written eariler.

In ha_innobase::truncate() and ha_innobase::delete_table(), we just remove
the entry in DDTableBuffer if it exists as current implementation.
In ha_innobase::commit_inplace_alter_table(commit=true), if it's rebuilt,
we should write-back the metadata of the newly created table to DDTableBuffer
after we delete the entry for old table.


UPDATE
======

Currently we don't remember the updated counter which is larger than current
counter. There are several reasons we should remember the largest counter:
1. To prevent the possible DUPLICATE KEY error after update on counters,
which is documented well.
2. To keep behaviour consistency in InnoDB itself. Current behaviour is that
we don't remember the update max counter and don't allocate from it. But server
would allocate counters from the max counter if it restarts. We should make
them the same.
3. To keep behaviour in InnoDB the same with MyISAM. MyISAM will remember
the max counter after update.
4. It would be easier to transform FTS_DOC_ID to autoinc datatype if the ID
is not reused. 

Now we want to introduce FR8, so the updated biggest counter would always
be remembered. We will remember the biggest counter in the same way described
above, writing redo logs within the existing mtr when updating the clustered
index. We should only write redo logs for UPDATE when the counter is
updated to bigger. Besides, we have to update the in-memory counter accordingly.

INSERT won't be affected by this. If a larger counter is inserted, the
in-memory counter would be updated accordingly.


ALTER TABLE
===========

Since the behaviours for both ALGORITHM = INPLACE and COPY are expected to
be the same, we have to scan the index where the counter is to get the
max counter for INPLACE operations. Now that we will always remember the
biggest counter in UPDATE/INSERT, we don't need to scan here any more. 

For the ALTER TABLE with ALGORITHM=COPY, rows of the table would be written
one by one. Thus redo logs for counter would be written time and time again
if the table contains lots of rows. This can be prevented by skipping redo logs
for those tables with temporary table name. The real counter can be set after
the new table is renamed from temporary table, and we have to write redo log
for the new table(non-temporary one)'s counter. We should make the counter
persisted in DDTableBuffer, so that we won't miss any previous updates in this
table. The old entry in DDTableBuffer if exists would be deleted when the
old table is removed.

Also we can issue 'ALTER TABLE t AUTO_INCREMENT = N' to reset the counter of
a table. One existing restriction(feature) here is we must only reset the
counter to a bigger one. These features should work as is and we will log
the new counter as long as it gets changed.


IMPORT TABLESPACE
=================

Counters in the tablespace have already been remembered in the config file,
and we should be able to read them there directly and start from it, if the
config file was not missing.

But if the config file was missing, we have no way to get the correct counter,
apart from scanning the index on the autoinc column and get the max value.
This is why we can skip the SELECT MAX(c) thoroughly.
The framework and mechanism of this worklog is fairly similar to WL#7816,
except following modifications:

1. In-memory table

We introduced 3 variables: autoinc_persisted, autoinc_persisted_mutex and
autoinc_index in dict_table_t.

        /** Mutex protecting the persisted autoincrement counter. */
        ib_mutex_t*                             autoinc_persisted_mutex;

        /** Autoinc counter value that has been persisted in redo logs or
        DDTableBuffer. It's mainly used when we want to write counter back
        to DDTableBuffer.
        This is different from the 'autoinc' above, which could be bigger
        than this one, because 'autoinc' will get updated right after
        some counters are allocated, but we will write the counter to redo
        logs and update this counter later. Once all allocated counters
        have been written to redo logs, 'autoinc' should be exact the next
        counter of this persisted one.
        We want this counter because when we need to write the counter back
        to DDTableBuffer, we had better keep it consistency with the counter
        that has been written to redo logs. Besides, we can't read the 'autoinc'
        directly easily, because the autoinc_lock is required and there could
        be a deadlock.
        This variable is protected by autoinc_persisted_mutex. */
        ib_uint64_t                             autoinc_persisted;

        /** The index of autoinc counter field in clustered index. This would
        be set when CREATE/ALTER/OPEN TABLE and IMPORT TABLESPACE, and used in
        modifications to clustered index, such as INSERT/UPDATE. There should
        be no conflict to access it, so no protection is needed. */
        ulint                                   autoinc_index;


2. Persistent classes

AutoIncPersister is a new class implements Persister, which writes and
reads the autoinc counter for a table.
In PersistentTableMetadata, we can set&get the persisted autoinc counter.

3. Writing redo logs

If counter exists in the table, we write redo logs of counters when:
1) Some DDL that will modify the counter
2) Insert entries into clustered index
3) Update entries on clustered index if the counter is updated to higher one 

3.1 DDLs('ALTER TABLE' and 'TRUNCATE TABLE')
We have to handle 'ALTER TABLE', 'RENAME TABLE' and 'TRUNCATE TABLE' specially.
There are several different cases:
a) Reset the auto_increment to 0(row_truncate_table_for_mysql)
b) Reset to some smaller one, other than
0(ha_innobase::commit_inplace_alter_table())
c) Reset to bigger or equal one(ha_innobase::commit_inplace_alter_table() or
row_rename_table_for_mysql())

For a), we will just set it to 0.
For b), if we find it's going to be set to smaller than the
table::autoinc_persisted, we should check what's that biggest counter in
the table in commit_get_autoinc(), so that we won't set the counter to one
that's smaller than existing biggest counter. This is kept as is and the
tree search is necessary. For example, the autoinc_persisted is 10, and
the biggest counter in table is 8 since we have deleted 9 and 10, reseting
the counter to 8 or 9 is allowable, but any value less than 8 is prohibited
and we will use 8 instead.
For c), we should be able to set it directly, without any search.

We introduce dict_table_set_and_persist_autoinc() to write the logs. When
we try to apply the counters found in redo logs to in-memory table, we would
basically compare the counter in redo log with the one in in-memory table,
if the former is bigger than we apply the counter. This would make resetting
the counter to smaller one impossible. So one solution is to reset the counter
smaller, we have to write the counter to DDTableBuffer first, and then write
a redo log with counter of value 0 to indicate that if we found counter of
value 0 during recovery, the recovered counter before this point should be
discarded, so that the counter read from DDTableBuffer, which could be smaller
than the discarded one, would not be over-written.

In row_rename_table_for_mysql(), we just handle the case that a temp table is
renamed to a non-temp one, which happens during
'ALTER TALBE ... ALGORITHM = COPY'. Since we don't write redo logs for every
inserted entry during copy data from old table, we have to mark the counter
at last.

3.2 DMLs like INSERT and UPDATE
We will re-use the existing mtr to do the logging. So in
row_ins_clust_index_entry_low(), we will write log if insertion succeeds,
or there is a duplicated key since the counter could still be set to bigger.
In row_upd_clust_rec(), we will check if the update to counter is bigger,
if so the counter should be logged.

row_log_autoinc_counter() is introduced to write the redo logs. We only write
logs when counter is 0 or is bigger than table::autoinc_persisted. We will
read the table::dirty_status bit directly without mutex to get better
performance, since it should be safe as commented.

There're new functions like row_get_autoinc_counter(),
row_parse_int_from_field(), row_parse_int(), etc. to read the counter from
entry.

4. Open

During open, we will calculate the next counter if current table::autoinc is
zero, or equal to table::autoinc_persisted, which means we just recover the
counter from redo log and DDTableBuffer.

5. Update table::autoinc

In update_row(), we have to check if there is a counter and it's to be updated
to a bigger one, if so, we will update the table::autoinc to the bigger one,
this is new behaviour in this worklog.