WL#5526: InnoDB: Online ADD INDEX

Affects: Server-Prototype Only — Status: Complete

Currently, InnoDB supports fast add/drop indexes - create or drop indexes
without copying the contents of the entire table. See 
http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-create-index-overview.html
for details,

This worklog is to track online ADD INDEX and DROP INDEX

Limitations of current implementation in MySQL 5.1 InnoDB Plugin and MySQL 5.5:

• Does not copy history (old versions of records)
 → only transactions that started after CREATE INDEX may use the index
• Does not allow concurrent writes to the table (but allows reads)

This feature will only remove the latter limitation, that is, allow the table to
be modified while secondary index(es) are being created, by creating a temporary
log file. The size of this log file is a multiple of innodb_sort_buf_size, and
it can be limited by a new parameter, innodb_online_alter_log_max_size.

The following limitations will remain for now:

• CREATE PRIMARY KEY and CREATE FULLTEXT INDEX will lock the table (no online
operation)
• ADD COLUMN, and DROP COLUMN require a table copy until WL#5854
• DROP FOREIGN KEY, RENAME INDEX, RENAME COLUMN will be in WL#5545

== Replacing the ALTER TABLE API (WL#5534, WL#5548) ==

Replace add_index(), prepare_drop_index() and final_drop_index() with the
generic DDL methods implemented in WL#5534. This allows us to support things
like ADD INDEX i, DROP INDEX i without problems.

If a successful index creation needs to be rolled back because the meta-data
lock upgrade failed in WL#5534, we cannot immediately drop the index(es),
because the table will typically be in use by other threads (this is what timed
out the MDL upgrade in the first place). Instead, we mark the index as ‘aborted’
and will attempt to drop it when we get the chance (reference count has dropped
to zero, or the index is being evicted from the data dictionary cache).

== Online index creation ==

The algorithm:

(1) Create temporary file(s) for index change log(s)
 • one file for all indexes being created, or one log file per index
 • a logical variant of the InnoDB change (insert) buffer
 • record format: (operation, index tuple)
  • operation: INSERT, DELETE, PURGE
  • written by row_log_online_op()
 • written by DML and purge, just like the InnoDB change buffer
  • Update (INSERT,new_value), (DELETE,old_value)
(2) Scan the clustered index of the table
 • For every index being created, write an index entry to a merge sort file
(3) Merge sort the buffers (one for every index being created)
(4) Insert the sorted entries to the new index B-trees
(5) Start applying the change logs to the index B-trees: row_log_apply()
(6) X-lock, sync and delete the logs, update SYS_INDEXES
(7) upgrade the meta-data lock, do commit_inplace_alter_table(commit=true)
rename the .frm and the SYS_INDEXES

Step (1), (5), and (6) are modifications to the current algorithm.

Mark Callaghan at Facebook describes this algorithm in his blog post about an
application-level online schema change implementation:
http://www.facebook.com/note.php?note_id=430801045932

There is an anomaly. Between steps (6) and (7), a DML operation can fail due to
a uniqueness violation in a created UNIQUE INDEX. The index exists in InnoDB at
that point, but not in MySQL. Thus, we cannot report the index name to MySQL. It
is somewhat wrong to report a duplicate key for DML before the DDL has fully
finished. Other types of uniqueness violation would be reported to the DDL thread.

For ‘aborted’ (lazily dropped) indexes, DML threads will invoke
row_log_online_op() as if the index was still being created online. That
function would do nothing, returning ‘it was buffered’ to the DML thread. If
online index creation completes, the function would return ‘it was not
buffered’, and the DML thread would insert to the B-tree as usual.

Online index creation can also be aborted when the log file written by
row_log_online_op() exceeds innodb_online_alter_log_max_size (a new parameter).
In this case, DML threads will continue business as usual, and the DDL operation
will fail.

== Memory usage ==

Index creation uses two large data structures for every index that is being created:

(a) Buffers for merge sort.
* There is no limit for the temporary files, other than what is imposed by the
file system (parameter 'tmpdir').
* 3 in-memory buffers will be allocated for the merge sort, per ALTER TABLE or
CREATE INDEX statement. The buffer size in bytes is 'innodb_sort_buffer_size'.
* Fulltext index creation uses a different configuration, due to parallel sort
('innodb_ft_sort_pll_degree').
(b) Logical change buffer (for online index creation). This is analogous to the
buffer-pool-based change buffer. Instead of buffering
(space_id,page_no,operation,record) for secondary index leaf pages, it records
(operation,record).
* Two memory buffers of 'innodb_sort_buf_size' will be allocated for writing and
reading the log file. When a log block is about to be written to a temporary
file, the file size will be checked against 'innodb_online_alter_log_max_size'
(in bytes). If the maximum file size would be exceeded, the index creation will
fail with DB_ONLINE_LOG_TOO_BIG.

InnoDB: WL#5526 online ADD INDEX, WL#5548 online DROP INDEX

=== First part: Replace the old ALTER TABLE API (WL#5534).

Replace add_index(), prepare_drop_index() and final_drop_index() with
the generic DDL methods implemented in WL#5534. This allows us to
support things like ADD INDEX i(i), DROP INDEX i without problems. The
new methods are roughly used as follows:

ha_innobase::check_if_supported_inplace_alter_table(): Check if the
operation is at all supported, and what kind of lock is
needed. Creating a PRIMARY KEY or a FULLTEXT INDEX requires the table
to be X-locked. All other supported operations can be done 'online',
allowing concurrent modification of the table.

ha_innobase::prepare_inplace_alter_table(): For index creation, start
a dictionary transaction and create the data dictionary objects. For
all operations, check if it is allowed.

ha_innobase::inplace_alter_table(): Create any indexes, if index
creation was requested.  Otherwise, do nothing. This method will be
invoked for both online and offline index creation. If the operation
is online, allocate a modification log for the secondary index(es)
that will be created. If the operation is offline for any reason, do
not allocate modification logs. This method will also notice and
report any duplicate key errors.

ha_innobase::commit_inplace_alter_table(): Commit or roll back the
operation. Rollback may be initiated after a failed operation, or
after a successfull operation when MySQL fails to upgrade the
meta-data lock for rewriting the .frm file.

If commit=true, drop the indexes that were requested to be dropped
(WL#5548). Before dropping, rename the indexes to start with "\377" so
that crash recovery will flag it as an incomplete index and drop it.

If a successful index creation needs to be rolled back, we cannot
immediately drop the index(es), because the table will typically be in
use by other threads (this is what timed out the MDL upgrade in the
first place). Instead, we mark the index as 'aborted' or 'zombie' and
will attempt to drop it when we get the chance (reference count has
dropped to zero, or the index is being evicted from the data
dictionary cache).

innobase_alter_table_flags(), check_column_being_renamed(),
column_is_being_renamed(), foreign_key_column_is_being_renamed():
Remove. Remove the column rename changes from
ha_innobase::check_if_incompatible_data(). All these checks will be
performed as part of check_if_supported_inplace_alter_table().

add_index(), final_add_index(), prepare_drop_index(),
final_drop_index(): Remove. These are replaced by the
prepare_inplace_alter_table(), inplace_alter_table(), and
commit_inplace_alter_table().

innodb_online_alter_log_max_size (srv_online_max_size): New parameter
for limiting the amount of modification log that is allowed to
accumulate during online index creation. Add DB_ONLINE_LOG_TOO_BIG to
enum db_err for exceeding this.

=== Second part: online index creation (WL#5534).

The algorithm:

(1) row_log_allocate() creates a temporary file for each index being created
 * a logical variant of the InnoDB change (insert) buffer
 * row_log_online_op(index, tuple, trx_id, enum row_op op)
  * enum row_op: INSERT, DELETE_MARK, DELETE_UNMARK, PURGE
  * invoked by DML, rollback, and purge, just like the InnoDB change buffer
  * update invokes (INSERT,new_value), (DELETE_MARK,old_value)
(2) row_merge_read_clustered_index() scans the clustered index of the table
 * For every index being created, write an index entry to a merge sort file
(3) row_merge_sort() the buffers (one for every index being created)
(4) Insert the sorted entries to the new index B-tree(s)
(5) row_log_apply() the change logs to the index B-tree(s)
 * this will 'publish' the index inside InnoDB
(6) MySQL upgrades the meta-data lock
(7) commit_inplace_alter_table(commit=true) will rename the created index to
non-temporary name, and drop any indexes that were requested to be dropped

Steps (2), (3), (4) are unaffected by this change.

There is an anomaly. Between steps (5) and (7), a DML operation can
fail due to a uniqueness violation in a created UNIQUE INDEX. The
index exists in InnoDB at that point, but not in MySQL. Thus, we
cannot report the index name to MySQL. It is somewhat wrong to report
a duplicate key for DML before the DDL has fully finished. Other types
of uniqueness violation observed during index creation would be
reported to the DDL thread.

Another anomaly is that when step (6) fails to upgrade the meta-data
lock, MySQL will invoke commit_inplace_alter_table(commit=false) to
drop any created indexes. These cannot be dropped immediately, because
failure to upgrade the meta-data lock means that other threads must be
operating on the table, and potentially accessing the index trees.
Therefore, we must merely flag the indexes ONLINE_INDEX_ABORTED or
ONLINE_INDEX_ABORTED_DROPPED. For such indexes, DML threads will
invoke row_log_online_op() as if the index was still being created
online. That function would do nothing, returning 'it was buffered' to
the DML thread. If online index creation completed successfully, the
function would return 'it was not buffered', and the DML thread would
insert to the B-tree as usual.

Online index creation can also be aborted when the log file written by
row_log_online_op() exceeds the new parameter
innodb_online_alter_log_max_size (a new error DB_ONLINE_LOG_TOO_BIG).
In this case, DML threads will continue business as usual, and the DDL
operation will fail.

=== Third part: new counters for INFORMATION_SCHEMA.INNODB_METRICS

ddl_background_drop_indexes
  Number of indexes waiting to be dropped after failed index creation
ddl_online_create_index
  Number of indexes being created online
ddl_pending_alter_table
  Number of ALTER TABLE, CREATE INDEX, DROP INDEX in progress

=== Detailed change description

dict_index_get_online_status(), dict_index_set_online_status(): New functions,
for determining the status of index creation (enum online_index_status):
ONLINE_INDEX_COMPLETE:
  the index is complete and ready for access
ONLINE_INDEX_CREATION:
  the index is being created online
  (allowing concurrent modifications, not allowing index lookups)
ONLINE_INDEX_ABORTED:
  online index creation was aborted and the index
  should be dropped as soon as index->table->n_ref_count reaches 0
ONLINE_INDEX_ABORTED_DROPPED:
  the online index creation was aborted, the index was dropped from
  the data dictionary and the tablespace, and it should be dropped
  from the data dictionary cache as soon as index->table->n_ref_count
  reaches 0

dict_index_is_online_ddl(): Determine if the index is or was being
created online (TRUE) or it is useable for lookups (FALSE).

dict_table_any_online_ddl(): Determine if a table contains any indexes
that are or were being created online.

dict_index_online_log(): Wrapper for row_log_online_op(), to resolve
circular include file dependencies.

dict_index_online_trylog(): Try logging an operation during online
index creation. If the index is complete, return FALSE so that the
operation will be performed directly on the index.

dict_index_struct: Remove to_be_dropped, and add online_status. Add a
union around search_info and a new member, online_log. The adaptive
hash index will not be used during online index creation.

dict_table_struct: Add the field drop_aborted, for noting that the
table may contain 'aborted' or 'zombie' indexes that have to be
dropped as soon as possible.

btr_root_raise_and_insert(), btr_page_split_and_insert(),
btr_attach_half_pages(), btr_insert_on_non_leaf_level(): Add undo
logging and locking flags. Add the flag BTR_CREATE_FLAG, which allows
operations to bypass row_log_online_op() when an index is being
created online.

btr_validate_index(): Skip indexes that are being created online.

btr_cur_latch_leaves(): Add the latch_mode BTR_MODIFY_TREE_APPLY_LOG,
to be invoked from row_log_apply(). It can skip most of the latching,
because the log will be applied by a single thread.

enum btr_latch_mode: Add BTR_MODIFY_TREE_APPLY_LOG and
BTR_MODIFY_LEAF_APPLY_LOG, exclusively reserved for row_log_apply(),
which is single-threaded for any given index that is being created
online.

btr_cur_search_to_nth_level(): Add the latch_mode
BTR_MODIFY_TREE_APPLY_LOG and BTR_MODIFY_LEAF_APPLY_LOG. Do not update
the adaptive hash index for indexes that are being built online.

btr_cur_open_at_index_side(), btr_cur_open_at_rnd_pos(): Disallow the
latch_mode BTR_MODIFY_TREE_APPLY_LOG and
BTR_MODIFY_LEAF_APPLY_LOG. These functions are not to be called during
online index creation.

btr_cur_ins_lock_and_undo(), btr_cur_optimistic_insert(),
btr_cur_pessimistic_insert(),
btr_cur_upd_lock_and_undo(), btr_cur_update_in_place(),
btr_cur_optimistic_update(), btr_cur_pessimistic_update(),
btr_cur_optimistic_delete(), btr_cur_pessimistic_delete() : Assert that
the index is not being built online, or the BTR_CREATE_FLAG is being
passed (from row_log_apply()).

btr_search_drop_page_hash_index(), btr_search_build_page_hash_index(),
btr_search_get_info(): Assert that the index is not being created online.

dict_build_index_def_step(): Record only the first table_id created in
the transaction. Crash recovery would drop that table if the
data dictionary transaction is found to be incomplete.

dict_table_try_drop_aborted(), dict_table_try_drop_aborted_and_mutex_exit():
Try to drop any 'aborted' or 'zombie' indexes.

dict_table_close(), dict_table_open_on_id(),
dict_table_open_on_name_low(), dict_table_open_on_name(),
dict_table_open_on_name_no_stats(): Add the parameter try_drop, for
trying to drop incomplete indexes when dict_locked=FALSE.

dict_table_remove_from_cache_low(): Try to drop 'aborted' or 'zombie'
indexes.

dict_index_add_to_cache(): Assert that the index is not being created
online. The flag would be set later.

dict_index_remove_from_cache_low(): Clean up after aborted online
index creation.

dict_table_get_foreign_constraint(): Consider both referencing and
referenced indexes.

dict_foreign_find_index(): Add const qualifiers. Remove the reference
to index->to_be_dropped. This will be checked elsewhere.

dict_foreign_find_equiv_index(): Replaced by dict_foreign_find_index().

dict_table_replace_index_in_foreign_list(): Renamed to
dict_foreign_replace_index(). This will not work properly until
WL#6049 (meta-data locking for foreign key checks) has been
implemented.

dict_table_check_for_dup_indexes(): Replace the parameter ibool tmp_ok
with enum check_name: CHECK_ALL_COMPLETE, CHECK_ABORTED_OK,
CHECK_PARTIAL_OK.

dict_lru_validate(), dict_lru_find_table(): Make static.

dict_load_columns(): Check errors from dict_load_column_low() a little
earlier.

dict_stats_update_transient_for_index(): Refactored from
dict_stats_update_transient(). We need to be able to update the
statistics for a particular index, once the index has been created.

dict_stats_update_persistent(): Skip indexes that are corrupted or
being created online.

dict_stats_fetch_from_ps_for_index(): dict_stats_update_for_index():
New functions, for updating index statistics after index creation.

dict_stats_delete_index_stats(): Take the table and index name as the
parameter, instead of taking a dict_index_t. We will drop the
statistics after the object has been freed.

innobase_index_reserve_name: A global constant for the predefined name
GEN_CLUST_INDEX.

convert_error_code_to_mysql(): Make static in ha_innodb.cc. The ALTER
TABLE code in handler0alter.cc will invoke a new function
my_error_innodb() instead, so that my_error() will be invoked exactly
once for each error.

ha_innobase::info_low(): Ignore indexes that are being created online.

ha_innobase::check(): Ignore indexes that are being created or dropped.

my_error_innodb(): Error reporting for most conditions in DDL
operations (except old_alter_table=1 or CREATE TABLE or DROP TABLE).
The errors DB_DUPLICATE_KEY, DB_TABLESPACE_ALREADY_EXISTS, and
DB_ONLINE_LOG_TOO_BIG must be handled by the caller.

innobase_check_index_keys(): Replace key_info[], num_of_keys with
Alter_inplace_info.

innobase_create_index_field_def(): Add const qualifiers.

innobase_create_index_def(): Add const qualifiers. Rename key_primary
to key_clustered, because we do create a new clustered index also when
creating the FTS_DOC_ID column.

innobase_copy_index_field_def(): Remove. When creating the clustered
index (and rebuilding the table), all index definitions will be copied
from the MySQL data dictionary, not from the InnoDB data dictionary.

innobase_fts_check_doc_id_col(),
innobase_fts_check_doc_id_index_in_def(): Add const qualifiers.

innobase_fts_check_doc_id_index(): Add ha_alter_info, for checking
indexes that are to be added.

innobase_create_key_def(): Rename to innobase_create_key_defs(). Move
some handling of full-text index creation to the caller.

innobase_check_column_length(): Change the return type from int to bool.

innobase_find_equiv_index(): Similar to_foreign_find_index(), but
searches the to-be-added indexes instead of existing ones.

innobase_create_fts_doc_id_idx(): Add the parameter new_clustered.

innobase_add_index_cleanup(): Remove.

online_retry_drop_indexes_low(), online_retry_drop_indexes(): Drop
'aborted' or 'zombie' indexes.  Invoked by
prepare_inplace_alter_table() while sufficient locks are being held.

prepare_inplace_alter_table_dict(): Prepare the data dictionary for
inplace ALTER TABLE (or CREATE INDEX or DROP INDEX).

i_s_fts_index_table_fill_one_index(), i_s_fts_config_fill(): Assert
that the index is not being created online. Fulltext indexes are never
being created online.

row_upd_build_sec_rec_difference_binary(): Take the rec offsets as a
parameter, to avoid having to recompute it. Remove the parameter trx.

ins_node_create_entry_list(): Make static.

struct trx_struct: Correct the comment of trx->table_id.  It was wrong
already when index creation was implemented in the InnoDB Plugin.

lock_clust_rec_cons_read_sees(), lock_rec_create(),
lock_rec_enqueue_waiting(), lock_rec_add_to_queue(),
lock_rec_lock_fast(), lock_rec_lock_slow(), lock_rec_lock(),
lock_rec_queue_validate(), lock_rec_convert_impl_to_expl(),
lock_sec_rec_read_check_and_lock(), lock_clust_rec_read_check_and_lock(),
lock_get_table(), lock_rec_get_index(), lock_rec_get_index_name(),
lock_table_locks_lookup(): Assert that the index is not being created
online. These assertions should not be reached for DML threads,
because they should be buffering the changes with row_log_online_op().
The row_log_apply() thread will be passing BTR_NO_LOCKING_FLAG,
skipping the locking.

lock_rec_insert_check_and_lock(),
lock_sec_rec_modify_check_and_lock(): Assert that the index is not
being created online, or BTR_CREATE_FLAG is being passed.

lock_rec_insert_check_and_lock(): Remove a bogus assertion about
LOCK_S during index creation. Index creation is passing the
BTR_NO_LOCKING_FLAG, skipping locking altogether.

opt_calc_index_goodness(): Ignore indexes that are being created online.

row_ins_must_modify_rec(): Update a comment about the uniqueness of
node pointers.

row_purge_parse_undo_rec(), row_undo_mod_upd_exist_sec(), row_upd():
Proceed if there are any indexes being created online.

row_ins_index_entry(), row_purge_remove_sec_if_poss(),
row_undo_ins_remove_sec_rec(), row_undo_mod_del_mark_or_remove_sec(),
row_undo_mod_del_mark_or_remove_sec(),
row_undo_mod_del_unmark_sec_and_undo_update(): Invoke
dict_index_online_trylog().

row_upd_sec_online(): Auxiliary function for logging the update or
delete of a record whose index is being created online. Invoked by
row_upd_sec_step().

row_create_table_for_mysql(), row_drop_table_for_mysql(): Assert that
at most one table is being created or dropped per transaction, or the
table is an auxiliary table for full-text search index.

rec_offs_any_null_extern(): Make available in non-debug builds. This
will be called when scanning all rows during index creation, in
row_merge_read_clustered_index().

row_log_allocate(), row_log_free(), row_log_online_op(),
row_log_get_max_trx(), row_log_apply(): The modification log for
buffering changes during online index creation.

enum row_op: Index record modification operations buffered by
row_log_online_op(): ROW_OP_INSERT, ROW_OP_DELETE_MARK,
ROW_OP_DELETE_UNMARK, ROW_OP_PURGE, ROW_OP_DELETE_PURGE.

merge_index_def_struct: Add key_number for the MySQL key number that
is being created, or ULINT_UNDEFINED if none.

row_merge_dup_report(), row_merge_file_create_low(),
row_merge_file_destroy_low(): Make public, so that these can be called
from row0log.cc.

row_merge_drop_index(): Replace with row_merge_drop_indexes_dict() and
row_merge_drop_indexes().

row_merge_rename_index_to_add(), row_merge_rename_index_to_drop(): New
functions, used in commit_inplace_alter_table() to guarantee somewhat
satisfactory crash recovery.

row_merge_build_indexes(): Add the flag 'online'. Add key_numbers[].
If online index creation fails, flag all created indexes as 'aborted'
or 'zombie'.

row_merge_buf_encode(): Refactored from row_merge_buf_write().

row_merge_insert_index_tuples(): Replace trx with trx_id. Remove the
parameter 'table'. Remove the dummy query graph and invoke the b-tree
functions directly.

row_merge_drop_temp_indexes(): Use direct SQL to drop all temporary
indexes.

row_merge_read_clustered_index(): Do not commit the mini-transaction
when switching pages except when there is a lock wait on the clustered
index tree lock.

MONITOR_MUTEX_INC(), MONITOR_MUTEX_DEC(): New macros, to be used when
the mutex protecting the counter is to be acquired and released.

MONITOR_ATOMIC_INC(), MONITOR_ATOMIC_DEC(): Define these for
non-atomic builds as well. Use a new mutex (monitor_mutex) for
protection in that case.

row_ins_index_entry_big_rec(): A new function, for inserting the
externally stored fields (off-page columns) of a clustered index entry.

rb:854 approved by Jimmy Yang