WL#6204: InnoDB persistent max value for autoinc columns
Affects: Server-8.0 — Status: Complete — Priority: Medium
See BUG#199 on MySQL bugs. Currently InnoDB does the following when a table is opened: SELECT MAX(c) FROM t; where c is the AUTOINC column name. This step is used to initialise the column's next autoinc value and allocation of autoinc values starts from this point. InnoDB also does this when it executes 'ALTER TABLE AUTO_INCREMENT=N;'. Suppose there is a table like this: (1) CREATE TABLE t(a BIGINT PRIMARY KEY AUTO_INCREMENT) ENGINE=InnoDB AUTO_INCREMENT=42; (2) BEGIN; INSERT INTO t VALUES(0),(0),(0); -- inserts 42,43,44 (3) COMMIT; -- let us assume that this will make a checkpoint (4) BEGIN; INSERT INTO t VALUES(0); -- 45; Then there would be some different scenarios here: Scenario 1: (5) -- do nothing here (6) -- kill server and restart Scenario 2: (5) ROLLBACK; (6) -- shutdown the server normally then restart, or crash and restart Scenario 3: (5) COMMIT; -- let's assume this would flush the redo logs (6) -- shutdown the server normally then restart, or crash and restart After server restarts in all these scenaiors, we do the following: (7) INSERT INTO t VALUES(0); -- should insert 46 (8) SELECT * FROM t; -- should return 42,43,44,46 In scenario 1 and 2, the INSERT at step#7 would currently start the allocation from 45(ignoring the offset setting), not from whatever was the last value from step#4. Only in scenario 3, it would start from 46 and that's what we expect. InnoDB should keep track of the maximum value and on restart preserve that max value and start from there. InnoDB should at least make sure the result from scenario 2 is the same with scenario 3, because there is no crash. And InnoDB should try to make the result from scenario 1 the same as scenario 3. InnoDB can't guarantee the later because we should consider the performance issue, which would be mentioned later. So we plan to implement following features in this worklog: 1. The AUTOINC counters would get persisted through redo logs 2. All AUTOINC counters would be collected from redo logs and applied to in-memory counters if they're bigger. 3. There won't be any rollback of AUTOINC counters. 4. 'SELECT MAX(c) FROM t' would not be needed except when IMPORT tablespace. 5. The largest updated counter will be logged and will not change with reboot. 6. Considering performance, we could write some extra redo logs, but no new MTRs.
FR1: Counters of AUTO_INCREMENT should be strictly incremental, as it works now, when the server is running. Rollback of transaction won't revert the counter. 'ALTER TABLE' would not change the counter to a smaller value too. FR2: Allocated counters won't be reused after the server restart normally, this must be guaranteed. FR3: Allocated counters won't be reused after the server restart from a crash, this can NOT be guaranteed. FR4: Redo logs for AUTO_INCREMENT counters would be written once they're changed. During recovery, InnoDB will scan the redo logs to collect counter changes and apply them to in-memory table object. There is no need to flush the redo logs immediately. FR5: Once the counter of a table gets changed, it will be written back to the 'DDTableBuffer' on checkpoint. FR6: This worklog only write counters to redo logs and DDTableBuffer. FR7: SELECT max(counter) is not needed neither at initialization nor during 'ALTER TABLE'. But it should be done if we want to import a tablespace without a correct config file, in this case, we could only scan the index to get current max counter. FR8: There is a behaviour change that if we update the counter to some larger value than current next value, we will remember the larger value and allocate counter from it next time. This one has several advantages, which will be mentioned later. NFR1: We should guarantee that there won't be significant regression after this worklog. We need to acquire new rw-locks, write redo logs(currently no need to flush immediately) when we want to update the in-memory counter, but these should not affect the performance too much. We should not introduce new mtr in this worklog since mtr_commit is time consuming.
The word 'counter' would be used for short to represent the 'auto_increment counter'. How to persist ============== First of all, considering the performance, we should not introduce a new mtr to write redo logs. mtr_commit would be time consuming and affect performance. So we will write redo logs for the counter by the mtr started in modification of clustered index(both insert and update). This means we should log counter for every row, but not in batch. In fact, the counter is logged in existing page redo logs, when we insert the row to clustered index and(or) the counter to related indexes. Theoretically, we can re-use these page redo logs for our purpose of persisting counter, besides, we can persist the counter like the way implemented in WL#7816. Let's compare how these 2 approaches work. 1. The redo log approach like WL#7816 By this way, for every new PM_TABLE_AUTO_INC log introduced for counter, we have to log the metadata type, table id and the counter, it consumes us 1 + 5(should be 1-5) + 1 + 5(should be 1-5) = 12 bytes(at most, in most cases). This log is required when we have to update the counter, Either before INSERT statement or before some UPDATE which introduces a bigger counter than current one. This would be easier and clearer. 2. Re-use existing page redo logs If we want to use existing page redo logs, we have to add some more info to current redo logs. Since there is no table id there, we have to log it additionally, so that during recovery, we can know which counter belonges to which table. Besides, we don't know which field in the logged record is the counter, we have to record this in one byte. This would affect both cluster index and secondary index, and maybe affect redo logs for tables without AUTO_INC. In current recovery, we start recovery from checkpoint first and then boot the dict system, so we can't get the table information before scanning redo logs, which leads to the impossibility that we can learn the field information from table definition. Considering these, it would cost us extra 5(should be 1-5) + 1 = 6 bytes(at most, in most cases) for INSERT redo logs on cluster index. But I guess the byte indicating if it's a cluster is a must for every index. Besides, we have many insert logs for copying entries from old page to new page, the extra byte is needed too. So approach 1 could be a better choice. This design will be based on it. Persist by redo logs ==================== We will use the redo logs approach for persisting the auto_inc counter as what we do in WL#7816. The counter columns would be maintained by InnoDB in main memory. However, every changes of the counter would be written to the redo logs. There are 3 steps for the persisting: 1. Every time we write an insert log(maybe an update log) for clustered index, we will write current updated counter to redo log. We will log the absolute value instead of the increment. 2. On every checkpoint, we will write all in-memory changed counters to the DDTableBuffer if they haven't. 3. We will finally write these counters back to DD tables when: 3.1 EXPORT TABLESPACE 3.2 Slow shutdown Once again, step 3 is not included in this worklog. Since this worklog is based on the mechanism introduced by WL#7816, we will have the same stategy of writing redo logs, marking dict_table_t::dirty_status, adding the table to dirty tables list, writing dirty tables back to DDTableBuffer, reading and analyzing redo logs during recovery and finally applying redo logs to in-memory table objects. Some exceptions would be mentioned later. Since we will write redo logs in low-level functions, we have to remember the index of the counter field in the clustered index. We will remember the index of field when we create the table, and get it updated when 'ALTER TABLE' if the sequence changes. After we IMPORT a tablespace, we have to set it again. We can get the index value from server in these cases. On INSERT/UPDATE, we only read the index value directly and retrieve the corresponding counter value from record of clustered index. Continuing the scenario 1 in High-Level Description: (1) CREATE TABLE: Without WL#7813, InnoDB can't write back the counter to DD tables, so InnoDB just writes redo logs for the initial counter in InnoDB. InnoDB needs to mark dirty_status of this table as dirty and add this table to dirty tables list. (2) INSERT: InnoDB updates the counter in memory, and write redo logs accordingly(no MLOG type will be introduced, by just some type in persistent_type_t), to signal that the next counter value will be 45. The dirty_status of this table would be marked as METADATA_DIRTY, and added to the dirty_tables list if it hasn't. (3) COMMIT: The transaction undo log is marked as committed. If innodb_flush_log_at_trx_commit = 1, then redo logs for counters will be flushed and no going back again, otherwise, it depends on if the logs are flushed to disk. (4) BEGIN; INSERT: Again, InnoDB writes a redo log for it to update the counter to 46, among other things. (5) The same as (3) if there is a ROLLBACK or COMMIT, otherwise nothing to do. (6) The server crashes and restarts (6.1) InnoDB finds the latest checkpoint. (6.2) InnoDB will start to collect counters from redo logs and put them into a map, where every table with a counter has an entry there, since the checkpoint. (6.3) After scanning and collecting redo logs, InnoDB will apply the collected counters to in-memory tables. At this step, InnoDB will open the in-memory table and read counters from DDTableBuffer first, then apply the counters from redo logs to the table object. (6.4) InnoDB rolls back any recovered transactions, rolling back the insert to a=45. There is no rollback of the counter value, the hole (a=45) will not be reused. (7) INSERT: InnoDB uses the recovered counter value of 46, and write redo logs, etc. similar to step (2). The metatdata of tables in METADATA_DIRTY status would be written back to DDTableBuffer on every checkpoint and the status would be marked as METADATA_BUFFERED. Updating the persisted AUTO_INC counter in DDTableBuffer should be in-place in most cases, unless the counter needs more bytes to store or we need to mark some index as corrupted at the mean time. In scenario 2 and 3, we should have similar steps. The big difference is that in step (6), InnoDB don't need to collect counters from redo logs since there was a checkpoint before shutdown. So InnoDB will get the latest counter from DDTableBuffer when startup, and the hole will not be reused as well. In steps of INSERT, we don't flush the redo logs immediately due to performance. Flush redo logs will slow down the server significantly. That's why we can't guarantee that in scenario 1, no allocated counters would be reused. In step 3, we don't force a flush on COMMIT/ROLLBACK, as we don't want to change the behaviour of what innodb_flush_log_at_trx_commit indicates. This wouldn't affect row-based replication, since the redo logs of counter should always flush before page redo logs. In step 6.3, before we apply counters from redo logs to in-memory table object, we should compare the two counters, one from redo logs and one from table object. Only if the counter from redo logs is bigger, we need to apply it to table object. It's due to we don't write redo logs for counter incrementally and larger counter could be written eariler. In ha_innobase::truncate() and ha_innobase::delete_table(), we just remove the entry in DDTableBuffer if it exists as current implementation. In ha_innobase::commit_inplace_alter_table(commit=true), if it's rebuilt, we should write-back the metadata of the newly created table to DDTableBuffer after we delete the entry for old table. UPDATE ====== Currently we don't remember the updated counter which is larger than current counter. There are several reasons we should remember the largest counter: 1. To prevent the possible DUPLICATE KEY error after update on counters, which is documented well. 2. To keep behaviour consistency in InnoDB itself. Current behaviour is that we don't remember the update max counter and don't allocate from it. But server would allocate counters from the max counter if it restarts. We should make them the same. 3. To keep behaviour in InnoDB the same with MyISAM. MyISAM will remember the max counter after update. 4. It would be easier to transform FTS_DOC_ID to autoinc datatype if the ID is not reused. Now we want to introduce FR8, so the updated biggest counter would always be remembered. We will remember the biggest counter in the same way described above, writing redo logs within the existing mtr when updating the clustered index. We should only write redo logs for UPDATE when the counter is updated to bigger. Besides, we have to update the in-memory counter accordingly. INSERT won't be affected by this. If a larger counter is inserted, the in-memory counter would be updated accordingly. ALTER TABLE =========== Since the behaviours for both ALGORITHM = INPLACE and COPY are expected to be the same, we have to scan the index where the counter is to get the max counter for INPLACE operations. Now that we will always remember the biggest counter in UPDATE/INSERT, we don't need to scan here any more. For the ALTER TABLE with ALGORITHM=COPY, rows of the table would be written one by one. Thus redo logs for counter would be written time and time again if the table contains lots of rows. This can be prevented by skipping redo logs for those tables with temporary table name. The real counter can be set after the new table is renamed from temporary table, and we have to write redo log for the new table(non-temporary one)'s counter. We should make the counter persisted in DDTableBuffer, so that we won't miss any previous updates in this table. The old entry in DDTableBuffer if exists would be deleted when the old table is removed. Also we can issue 'ALTER TABLE t AUTO_INCREMENT = N' to reset the counter of a table. One existing restriction(feature) here is we must only reset the counter to a bigger one. These features should work as is and we will log the new counter as long as it gets changed. IMPORT TABLESPACE ================= Counters in the tablespace have already been remembered in the config file, and we should be able to read them there directly and start from it, if the config file was not missing. But if the config file was missing, we have no way to get the correct counter, apart from scanning the index on the autoinc column and get the max value. This is why we can skip the SELECT MAX(c) thoroughly.
The framework and mechanism of this worklog is fairly similar to WL#7816, except following modifications: 1. In-memory table We introduced 3 variables: autoinc_persisted, autoinc_persisted_mutex and autoinc_index in dict_table_t. /** Mutex protecting the persisted autoincrement counter. */ ib_mutex_t* autoinc_persisted_mutex; /** Autoinc counter value that has been persisted in redo logs or DDTableBuffer. It's mainly used when we want to write counter back to DDTableBuffer. This is different from the 'autoinc' above, which could be bigger than this one, because 'autoinc' will get updated right after some counters are allocated, but we will write the counter to redo logs and update this counter later. Once all allocated counters have been written to redo logs, 'autoinc' should be exact the next counter of this persisted one. We want this counter because when we need to write the counter back to DDTableBuffer, we had better keep it consistency with the counter that has been written to redo logs. Besides, we can't read the 'autoinc' directly easily, because the autoinc_lock is required and there could be a deadlock. This variable is protected by autoinc_persisted_mutex. */ ib_uint64_t autoinc_persisted; /** The index of autoinc counter field in clustered index. This would be set when CREATE/ALTER/OPEN TABLE and IMPORT TABLESPACE, and used in modifications to clustered index, such as INSERT/UPDATE. There should be no conflict to access it, so no protection is needed. */ ulint autoinc_index; 2. Persistent classes AutoIncPersister is a new class implements Persister, which writes and reads the autoinc counter for a table. In PersistentTableMetadata, we can set&get the persisted autoinc counter. 3. Writing redo logs If counter exists in the table, we write redo logs of counters when: 1) Some DDL that will modify the counter 2) Insert entries into clustered index 3) Update entries on clustered index if the counter is updated to higher one 3.1 DDLs('ALTER TABLE' and 'TRUNCATE TABLE') We have to handle 'ALTER TABLE', 'RENAME TABLE' and 'TRUNCATE TABLE' specially. There are several different cases: a) Reset the auto_increment to 0(row_truncate_table_for_mysql) b) Reset to some smaller one, other than 0(ha_innobase::commit_inplace_alter_table()) c) Reset to bigger or equal one(ha_innobase::commit_inplace_alter_table() or row_rename_table_for_mysql()) For a), we will just set it to 0. For b), if we find it's going to be set to smaller than the table::autoinc_persisted, we should check what's that biggest counter in the table in commit_get_autoinc(), so that we won't set the counter to one that's smaller than existing biggest counter. This is kept as is and the tree search is necessary. For example, the autoinc_persisted is 10, and the biggest counter in table is 8 since we have deleted 9 and 10, reseting the counter to 8 or 9 is allowable, but any value less than 8 is prohibited and we will use 8 instead. For c), we should be able to set it directly, without any search. We introduce dict_table_set_and_persist_autoinc() to write the logs. When we try to apply the counters found in redo logs to in-memory table, we would basically compare the counter in redo log with the one in in-memory table, if the former is bigger than we apply the counter. This would make resetting the counter to smaller one impossible. So one solution is to reset the counter smaller, we have to write the counter to DDTableBuffer first, and then write a redo log with counter of value 0 to indicate that if we found counter of value 0 during recovery, the recovered counter before this point should be discarded, so that the counter read from DDTableBuffer, which could be smaller than the discarded one, would not be over-written. In row_rename_table_for_mysql(), we just handle the case that a temp table is renamed to a non-temp one, which happens during 'ALTER TALBE ... ALGORITHM = COPY'. Since we don't write redo logs for every inserted entry during copy data from old table, we have to mark the counter at last. 3.2 DMLs like INSERT and UPDATE We will re-use the existing mtr to do the logging. So in row_ins_clust_index_entry_low(), we will write log if insertion succeeds, or there is a duplicated key since the counter could still be set to bigger. In row_upd_clust_rec(), we will check if the update to counter is bigger, if so the counter should be logged. row_log_autoinc_counter() is introduced to write the redo logs. We only write logs when counter is 0 or is bigger than table::autoinc_persisted. We will read the table::dirty_status bit directly without mutex to get better performance, since it should be safe as commented. There're new functions like row_get_autoinc_counter(), row_parse_int_from_field(), row_parse_int(), etc. to read the counter from entry. 4. Open During open, we will calculate the next counter if current table::autoinc is zero, or equal to table::autoinc_persisted, which means we just recover the counter from redo log and DDTableBuffer. 5. Update table::autoinc In update_row(), we have to check if there is a counter and it's to be updated to a bigger one, if so, we will update the table::autoinc to the bigger one, this is new behaviour in this worklog.
Copyright (c) 2000, 2018, Oracle Corporation and/or its affiliates. All rights reserved.