WL#5526: InnoDB: Online ADD INDEX
Affects: Server-Prototype Only — Status: Complete — Priority: Medium
Currently, InnoDB supports fast add/drop indexes - create or drop indexes without copying the contents of the entire table. See http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-create-index-overview.html for details, This worklog is to track online ADD INDEX and DROP INDEX Limitations of current implementation in MySQL 5.1 InnoDB Plugin and MySQL 5.5: • Does not copy history (old versions of records) → only transactions that started after CREATE INDEX may use the index • Does not allow concurrent writes to the table (but allows reads) This feature will only remove the latter limitation, that is, allow the table to be modified while secondary index(es) are being created, by creating a temporary log file. The size of this log file is a multiple of innodb_sort_buf_size, and it can be limited by a new parameter, innodb_online_alter_log_max_size. The following limitations will remain for now: • CREATE PRIMARY KEY and CREATE FULLTEXT INDEX will lock the table (no online operation) • ADD COLUMN, and DROP COLUMN require a table copy until WL#5854 • DROP FOREIGN KEY, RENAME INDEX, RENAME COLUMN will be in WL#5545
== Replacing the ALTER TABLE API (WL#5534, WL#5548) == Replace add_index(), prepare_drop_index() and final_drop_index() with the generic DDL methods implemented in WL#5534. This allows us to support things like ADD INDEX i, DROP INDEX i without problems. If a successful index creation needs to be rolled back because the meta-data lock upgrade failed in WL#5534, we cannot immediately drop the index(es), because the table will typically be in use by other threads (this is what timed out the MDL upgrade in the first place). Instead, we mark the index as ‘aborted’ and will attempt to drop it when we get the chance (reference count has dropped to zero, or the index is being evicted from the data dictionary cache). == Online index creation == The algorithm: (1) Create temporary file(s) for index change log(s) • one file for all indexes being created, or one log file per index • a logical variant of the InnoDB change (insert) buffer • record format: (operation, index tuple) • operation: INSERT, DELETE, PURGE • written by row_log_online_op() • written by DML and purge, just like the InnoDB change buffer • Update (INSERT,new_value), (DELETE,old_value) (2) Scan the clustered index of the table • For every index being created, write an index entry to a merge sort file (3) Merge sort the buffers (one for every index being created) (4) Insert the sorted entries to the new index B-trees (5) Start applying the change logs to the index B-trees: row_log_apply() (6) X-lock, sync and delete the logs, update SYS_INDEXES (7) upgrade the meta-data lock, do commit_inplace_alter_table(commit=true) rename the .frm and the SYS_INDEXES Step (1), (5), and (6) are modifications to the current algorithm. Mark Callaghan at Facebook describes this algorithm in his blog post about an application-level online schema change implementation: http://www.facebook.com/note.php?note_id=430801045932 There is an anomaly. Between steps (6) and (7), a DML operation can fail due to a uniqueness violation in a created UNIQUE INDEX. The index exists in InnoDB at that point, but not in MySQL. Thus, we cannot report the index name to MySQL. It is somewhat wrong to report a duplicate key for DML before the DDL has fully finished. Other types of uniqueness violation would be reported to the DDL thread. For ‘aborted’ (lazily dropped) indexes, DML threads will invoke row_log_online_op() as if the index was still being created online. That function would do nothing, returning ‘it was buffered’ to the DML thread. If online index creation completes, the function would return ‘it was not buffered’, and the DML thread would insert to the B-tree as usual. Online index creation can also be aborted when the log file written by row_log_online_op() exceeds innodb_online_alter_log_max_size (a new parameter). In this case, DML threads will continue business as usual, and the DDL operation will fail. == Memory usage == Index creation uses two large data structures for every index that is being created: (a) Buffers for merge sort. * There is no limit for the temporary files, other than what is imposed by the file system (parameter 'tmpdir'). * 3 in-memory buffers will be allocated for the merge sort, per ALTER TABLE or CREATE INDEX statement. The buffer size in bytes is 'innodb_sort_buffer_size'. * Fulltext index creation uses a different configuration, due to parallel sort ('innodb_ft_sort_pll_degree'). (b) Logical change buffer (for online index creation). This is analogous to the buffer-pool-based change buffer. Instead of buffering (space_id,page_no,operation,record) for secondary index leaf pages, it records (operation,record). * Two memory buffers of 'innodb_sort_buf_size' will be allocated for writing and reading the log file. When a log block is about to be written to a temporary file, the file size will be checked against 'innodb_online_alter_log_max_size' (in bytes). If the maximum file size would be exceeded, the index creation will fail with DB_ONLINE_LOG_TOO_BIG.
InnoDB: WL#5526 online ADD INDEX, WL#5548 online DROP INDEX === First part: Replace the old ALTER TABLE API (WL#5534). Replace add_index(), prepare_drop_index() and final_drop_index() with the generic DDL methods implemented in WL#5534. This allows us to support things like ADD INDEX i(i), DROP INDEX i without problems. The new methods are roughly used as follows: ha_innobase::check_if_supported_inplace_alter_table(): Check if the operation is at all supported, and what kind of lock is needed. Creating a PRIMARY KEY or a FULLTEXT INDEX requires the table to be X-locked. All other supported operations can be done 'online', allowing concurrent modification of the table. ha_innobase::prepare_inplace_alter_table(): For index creation, start a dictionary transaction and create the data dictionary objects. For all operations, check if it is allowed. ha_innobase::inplace_alter_table(): Create any indexes, if index creation was requested. Otherwise, do nothing. This method will be invoked for both online and offline index creation. If the operation is online, allocate a modification log for the secondary index(es) that will be created. If the operation is offline for any reason, do not allocate modification logs. This method will also notice and report any duplicate key errors. ha_innobase::commit_inplace_alter_table(): Commit or roll back the operation. Rollback may be initiated after a failed operation, or after a successfull operation when MySQL fails to upgrade the meta-data lock for rewriting the .frm file. If commit=true, drop the indexes that were requested to be dropped (WL#5548). Before dropping, rename the indexes to start with "\377" so that crash recovery will flag it as an incomplete index and drop it. If a successful index creation needs to be rolled back, we cannot immediately drop the index(es), because the table will typically be in use by other threads (this is what timed out the MDL upgrade in the first place). Instead, we mark the index as 'aborted' or 'zombie' and will attempt to drop it when we get the chance (reference count has dropped to zero, or the index is being evicted from the data dictionary cache). innobase_alter_table_flags(), check_column_being_renamed(), column_is_being_renamed(), foreign_key_column_is_being_renamed(): Remove. Remove the column rename changes from ha_innobase::check_if_incompatible_data(). All these checks will be performed as part of check_if_supported_inplace_alter_table(). add_index(), final_add_index(), prepare_drop_index(), final_drop_index(): Remove. These are replaced by the prepare_inplace_alter_table(), inplace_alter_table(), and commit_inplace_alter_table(). innodb_online_alter_log_max_size (srv_online_max_size): New parameter for limiting the amount of modification log that is allowed to accumulate during online index creation. Add DB_ONLINE_LOG_TOO_BIG to enum db_err for exceeding this. === Second part: online index creation (WL#5534). The algorithm: (1) row_log_allocate() creates a temporary file for each index being created * a logical variant of the InnoDB change (insert) buffer * row_log_online_op(index, tuple, trx_id, enum row_op op) * enum row_op: INSERT, DELETE_MARK, DELETE_UNMARK, PURGE * invoked by DML, rollback, and purge, just like the InnoDB change buffer * update invokes (INSERT,new_value), (DELETE_MARK,old_value) (2) row_merge_read_clustered_index() scans the clustered index of the table * For every index being created, write an index entry to a merge sort file (3) row_merge_sort() the buffers (one for every index being created) (4) Insert the sorted entries to the new index B-tree(s) (5) row_log_apply() the change logs to the index B-tree(s) * this will 'publish' the index inside InnoDB (6) MySQL upgrades the meta-data lock (7) commit_inplace_alter_table(commit=true) will rename the created index to non-temporary name, and drop any indexes that were requested to be dropped Steps (2), (3), (4) are unaffected by this change. There is an anomaly. Between steps (5) and (7), a DML operation can fail due to a uniqueness violation in a created UNIQUE INDEX. The index exists in InnoDB at that point, but not in MySQL. Thus, we cannot report the index name to MySQL. It is somewhat wrong to report a duplicate key for DML before the DDL has fully finished. Other types of uniqueness violation observed during index creation would be reported to the DDL thread. Another anomaly is that when step (6) fails to upgrade the meta-data lock, MySQL will invoke commit_inplace_alter_table(commit=false) to drop any created indexes. These cannot be dropped immediately, because failure to upgrade the meta-data lock means that other threads must be operating on the table, and potentially accessing the index trees. Therefore, we must merely flag the indexes ONLINE_INDEX_ABORTED or ONLINE_INDEX_ABORTED_DROPPED. For such indexes, DML threads will invoke row_log_online_op() as if the index was still being created online. That function would do nothing, returning 'it was buffered' to the DML thread. If online index creation completed successfully, the function would return 'it was not buffered', and the DML thread would insert to the B-tree as usual. Online index creation can also be aborted when the log file written by row_log_online_op() exceeds the new parameter innodb_online_alter_log_max_size (a new error DB_ONLINE_LOG_TOO_BIG). In this case, DML threads will continue business as usual, and the DDL operation will fail. === Third part: new counters for INFORMATION_SCHEMA.INNODB_METRICS ddl_background_drop_indexes Number of indexes waiting to be dropped after failed index creation ddl_online_create_index Number of indexes being created online ddl_pending_alter_table Number of ALTER TABLE, CREATE INDEX, DROP INDEX in progress === Detailed change description dict_index_get_online_status(), dict_index_set_online_status(): New functions, for determining the status of index creation (enum online_index_status): ONLINE_INDEX_COMPLETE: the index is complete and ready for access ONLINE_INDEX_CREATION: the index is being created online (allowing concurrent modifications, not allowing index lookups) ONLINE_INDEX_ABORTED: online index creation was aborted and the index should be dropped as soon as index->table->n_ref_count reaches 0 ONLINE_INDEX_ABORTED_DROPPED: the online index creation was aborted, the index was dropped from the data dictionary and the tablespace, and it should be dropped from the data dictionary cache as soon as index->table->n_ref_count reaches 0 dict_index_is_online_ddl(): Determine if the index is or was being created online (TRUE) or it is useable for lookups (FALSE). dict_table_any_online_ddl(): Determine if a table contains any indexes that are or were being created online. dict_index_online_log(): Wrapper for row_log_online_op(), to resolve circular include file dependencies. dict_index_online_trylog(): Try logging an operation during online index creation. If the index is complete, return FALSE so that the operation will be performed directly on the index. dict_index_struct: Remove to_be_dropped, and add online_status. Add a union around search_info and a new member, online_log. The adaptive hash index will not be used during online index creation. dict_table_struct: Add the field drop_aborted, for noting that the table may contain 'aborted' or 'zombie' indexes that have to be dropped as soon as possible. btr_root_raise_and_insert(), btr_page_split_and_insert(), btr_attach_half_pages(), btr_insert_on_non_leaf_level(): Add undo logging and locking flags. Add the flag BTR_CREATE_FLAG, which allows operations to bypass row_log_online_op() when an index is being created online. btr_validate_index(): Skip indexes that are being created online. btr_cur_latch_leaves(): Add the latch_mode BTR_MODIFY_TREE_APPLY_LOG, to be invoked from row_log_apply(). It can skip most of the latching, because the log will be applied by a single thread. enum btr_latch_mode: Add BTR_MODIFY_TREE_APPLY_LOG and BTR_MODIFY_LEAF_APPLY_LOG, exclusively reserved for row_log_apply(), which is single-threaded for any given index that is being created online. btr_cur_search_to_nth_level(): Add the latch_mode BTR_MODIFY_TREE_APPLY_LOG and BTR_MODIFY_LEAF_APPLY_LOG. Do not update the adaptive hash index for indexes that are being built online. btr_cur_open_at_index_side(), btr_cur_open_at_rnd_pos(): Disallow the latch_mode BTR_MODIFY_TREE_APPLY_LOG and BTR_MODIFY_LEAF_APPLY_LOG. These functions are not to be called during online index creation. btr_cur_ins_lock_and_undo(), btr_cur_optimistic_insert(), btr_cur_pessimistic_insert(), btr_cur_upd_lock_and_undo(), btr_cur_update_in_place(), btr_cur_optimistic_update(), btr_cur_pessimistic_update(), btr_cur_optimistic_delete(), btr_cur_pessimistic_delete() : Assert that the index is not being built online, or the BTR_CREATE_FLAG is being passed (from row_log_apply()). btr_search_drop_page_hash_index(), btr_search_build_page_hash_index(), btr_search_get_info(): Assert that the index is not being created online. dict_build_index_def_step(): Record only the first table_id created in the transaction. Crash recovery would drop that table if the data dictionary transaction is found to be incomplete. dict_table_try_drop_aborted(), dict_table_try_drop_aborted_and_mutex_exit(): Try to drop any 'aborted' or 'zombie' indexes. dict_table_close(), dict_table_open_on_id(), dict_table_open_on_name_low(), dict_table_open_on_name(), dict_table_open_on_name_no_stats(): Add the parameter try_drop, for trying to drop incomplete indexes when dict_locked=FALSE. dict_table_remove_from_cache_low(): Try to drop 'aborted' or 'zombie' indexes. dict_index_add_to_cache(): Assert that the index is not being created online. The flag would be set later. dict_index_remove_from_cache_low(): Clean up after aborted online index creation. dict_table_get_foreign_constraint(): Consider both referencing and referenced indexes. dict_foreign_find_index(): Add const qualifiers. Remove the reference to index->to_be_dropped. This will be checked elsewhere. dict_foreign_find_equiv_index(): Replaced by dict_foreign_find_index(). dict_table_replace_index_in_foreign_list(): Renamed to dict_foreign_replace_index(). This will not work properly until WL#6049 (meta-data locking for foreign key checks) has been implemented. dict_table_check_for_dup_indexes(): Replace the parameter ibool tmp_ok with enum check_name: CHECK_ALL_COMPLETE, CHECK_ABORTED_OK, CHECK_PARTIAL_OK. dict_lru_validate(), dict_lru_find_table(): Make static. dict_load_columns(): Check errors from dict_load_column_low() a little earlier. dict_stats_update_transient_for_index(): Refactored from dict_stats_update_transient(). We need to be able to update the statistics for a particular index, once the index has been created. dict_stats_update_persistent(): Skip indexes that are corrupted or being created online. dict_stats_fetch_from_ps_for_index(): dict_stats_update_for_index(): New functions, for updating index statistics after index creation. dict_stats_delete_index_stats(): Take the table and index name as the parameter, instead of taking a dict_index_t. We will drop the statistics after the object has been freed. innobase_index_reserve_name: A global constant for the predefined name GEN_CLUST_INDEX. convert_error_code_to_mysql(): Make static in ha_innodb.cc. The ALTER TABLE code in handler0alter.cc will invoke a new function my_error_innodb() instead, so that my_error() will be invoked exactly once for each error. ha_innobase::info_low(): Ignore indexes that are being created online. ha_innobase::check(): Ignore indexes that are being created or dropped. my_error_innodb(): Error reporting for most conditions in DDL operations (except old_alter_table=1 or CREATE TABLE or DROP TABLE). The errors DB_DUPLICATE_KEY, DB_TABLESPACE_ALREADY_EXISTS, and DB_ONLINE_LOG_TOO_BIG must be handled by the caller. innobase_check_index_keys(): Replace key_info, num_of_keys with Alter_inplace_info. innobase_create_index_field_def(): Add const qualifiers. innobase_create_index_def(): Add const qualifiers. Rename key_primary to key_clustered, because we do create a new clustered index also when creating the FTS_DOC_ID column. innobase_copy_index_field_def(): Remove. When creating the clustered index (and rebuilding the table), all index definitions will be copied from the MySQL data dictionary, not from the InnoDB data dictionary. innobase_fts_check_doc_id_col(), innobase_fts_check_doc_id_index_in_def(): Add const qualifiers. innobase_fts_check_doc_id_index(): Add ha_alter_info, for checking indexes that are to be added. innobase_create_key_def(): Rename to innobase_create_key_defs(). Move some handling of full-text index creation to the caller. innobase_check_column_length(): Change the return type from int to bool. innobase_find_equiv_index(): Similar to_foreign_find_index(), but searches the to-be-added indexes instead of existing ones. innobase_create_fts_doc_id_idx(): Add the parameter new_clustered. innobase_add_index_cleanup(): Remove. online_retry_drop_indexes_low(), online_retry_drop_indexes(): Drop 'aborted' or 'zombie' indexes. Invoked by prepare_inplace_alter_table() while sufficient locks are being held. prepare_inplace_alter_table_dict(): Prepare the data dictionary for inplace ALTER TABLE (or CREATE INDEX or DROP INDEX). i_s_fts_index_table_fill_one_index(), i_s_fts_config_fill(): Assert that the index is not being created online. Fulltext indexes are never being created online. row_upd_build_sec_rec_difference_binary(): Take the rec offsets as a parameter, to avoid having to recompute it. Remove the parameter trx. ins_node_create_entry_list(): Make static. struct trx_struct: Correct the comment of trx->table_id. It was wrong already when index creation was implemented in the InnoDB Plugin. lock_clust_rec_cons_read_sees(), lock_rec_create(), lock_rec_enqueue_waiting(), lock_rec_add_to_queue(), lock_rec_lock_fast(), lock_rec_lock_slow(), lock_rec_lock(), lock_rec_queue_validate(), lock_rec_convert_impl_to_expl(), lock_sec_rec_read_check_and_lock(), lock_clust_rec_read_check_and_lock(), lock_get_table(), lock_rec_get_index(), lock_rec_get_index_name(), lock_table_locks_lookup(): Assert that the index is not being created online. These assertions should not be reached for DML threads, because they should be buffering the changes with row_log_online_op(). The row_log_apply() thread will be passing BTR_NO_LOCKING_FLAG, skipping the locking. lock_rec_insert_check_and_lock(), lock_sec_rec_modify_check_and_lock(): Assert that the index is not being created online, or BTR_CREATE_FLAG is being passed. lock_rec_insert_check_and_lock(): Remove a bogus assertion about LOCK_S during index creation. Index creation is passing the BTR_NO_LOCKING_FLAG, skipping locking altogether. opt_calc_index_goodness(): Ignore indexes that are being created online. row_ins_must_modify_rec(): Update a comment about the uniqueness of node pointers. row_purge_parse_undo_rec(), row_undo_mod_upd_exist_sec(), row_upd(): Proceed if there are any indexes being created online. row_ins_index_entry(), row_purge_remove_sec_if_poss(), row_undo_ins_remove_sec_rec(), row_undo_mod_del_mark_or_remove_sec(), row_undo_mod_del_mark_or_remove_sec(), row_undo_mod_del_unmark_sec_and_undo_update(): Invoke dict_index_online_trylog(). row_upd_sec_online(): Auxiliary function for logging the update or delete of a record whose index is being created online. Invoked by row_upd_sec_step(). row_create_table_for_mysql(), row_drop_table_for_mysql(): Assert that at most one table is being created or dropped per transaction, or the table is an auxiliary table for full-text search index. rec_offs_any_null_extern(): Make available in non-debug builds. This will be called when scanning all rows during index creation, in row_merge_read_clustered_index(). row_log_allocate(), row_log_free(), row_log_online_op(), row_log_get_max_trx(), row_log_apply(): The modification log for buffering changes during online index creation. enum row_op: Index record modification operations buffered by row_log_online_op(): ROW_OP_INSERT, ROW_OP_DELETE_MARK, ROW_OP_DELETE_UNMARK, ROW_OP_PURGE, ROW_OP_DELETE_PURGE. merge_index_def_struct: Add key_number for the MySQL key number that is being created, or ULINT_UNDEFINED if none. row_merge_dup_report(), row_merge_file_create_low(), row_merge_file_destroy_low(): Make public, so that these can be called from row0log.cc. row_merge_drop_index(): Replace with row_merge_drop_indexes_dict() and row_merge_drop_indexes(). row_merge_rename_index_to_add(), row_merge_rename_index_to_drop(): New functions, used in commit_inplace_alter_table() to guarantee somewhat satisfactory crash recovery. row_merge_build_indexes(): Add the flag 'online'. Add key_numbers. If online index creation fails, flag all created indexes as 'aborted' or 'zombie'. row_merge_buf_encode(): Refactored from row_merge_buf_write(). row_merge_insert_index_tuples(): Replace trx with trx_id. Remove the parameter 'table'. Remove the dummy query graph and invoke the b-tree functions directly. row_merge_drop_temp_indexes(): Use direct SQL to drop all temporary indexes. row_merge_read_clustered_index(): Do not commit the mini-transaction when switching pages except when there is a lock wait on the clustered index tree lock. MONITOR_MUTEX_INC(), MONITOR_MUTEX_DEC(): New macros, to be used when the mutex protecting the counter is to be acquired and released. MONITOR_ATOMIC_INC(), MONITOR_ATOMIC_DEC(): Define these for non-atomic builds as well. Use a new mutex (monitor_mutex) for protection in that case. row_ins_index_entry_big_rec(): A new function, for inserting the externally stored fields (off-page columns) of a clustered index entry. rb:854 approved by Jimmy Yang
Copyright (c) 2000, 2018, Oracle Corporation and/or its affiliates. All rights reserved.