WL#6326: InnoDB: fix index->lock contention
Status: Complete
The next problem to be solved for scalability of RW workload is index->lock contention. The lock represents whole of the index tree structure. It can be split into each block->lock in the tree.
The acquisition of node pointer page latches is protected by index->lock latch. Before WL#6326, index->lock protected all node pointer pages either in S or X mode, and no individual block->lock were acquired on node pointer pages. After WL#6326, node pointer pages are protected by individual block->lock S-latch or X-latch. The acquisition of node pointer page latches is covered by index->lock, for preventing deadlocks. (0) Definition: B-tree level. (0.1) The leaf pages of the B-tree are at level 0. (0.2) The parent of a page at level L has level L+1. (The level of the root page is equal to the tree height.) (0.3) The B-tree lock (index->lock) has level=HEIGHT+1, i.e., it is the parent of the root page. There are 3 locking modes of index->lock: (1) When holding index->lock S-latch: (1.1) Page latches must be acquired in descending order of tree level. (1.2) When acquiring the first node pointer page latch at a given B-tree level, we must hold the parent latch (at level+1). (1.3) If there is a node pointer page already latched on the same level, we can only acquire its right sibling page latch. (1.4) Node pointer page latches must be released in child-to-parent order. (This intends to prevent deadlocks with the index->lock SX-lock case.) (1.4.1) A node pointer page latch at level L can only be released when not holding any latches at levellock SX-latch: Rules (1.2) and (1.3) are relaxed and merged into (2.2), and rule (1.4) is removed. This means that we can skip page latch acquisition at some tree levels, and can acquire latches in a less restricted order. (2.1) [same as (1.1)]: Page latches must be acquired in descending order of tree level. (2.2) When acquiring a node pointer page latch at level L, we must hold the left sibling page latch (at level L) or some ancestor latch (at level>L). (2.3) [derived from (2.1), (2.2)] The first node pointer page latch acquired can be any node pointer page. (3) When holding index->lock X-latch: We can acquire node pointer page latches in any order. NOTE: WL#6326 does not affect the latching rules of leaf pages: Reads need index->lock S-latch for the node pointer traversal. Once the leaf level is reached, we can release index->lock (and with the WL#6326 changes, all node pointer latches). Once we are on a leaf page, we can safely acquire the right sibling leaf page latch, release the old page latch, and do a left-to-right index scan. Modifications on a single leaf page (BTR_MODIFY_LEAF) are covered by index->lock S-latch throughout the operation. Operations involving page splits or merges (BTR_MODIFY_TREE) and page allocations are covered by index->lock X-latch. EXAMPLES: [1] More than one non-leaf block->lock latch order for each index->lock level which the thread already has already lock S> latched case (*allows other lock S> and lock SX> case) (1st latch page) Latching page should be started from root page Graph 1 a | +---+---+ | | b c | | +-+-+ +-+-+ | | | | d e f g - the 1st latching should be 'a' (latching lower level pages) 1st latch page on the level needs its parent page's lock Graph 2 (a) | +---+---+ | | b c | | +-+-+ +-+-+ | | | | d e f g * "()" means latched - in this case, only 'b' or 'c' as next latch - cannot skip to level 1 ('d'~'g') (right side pages) After 2nd page on the same level should be contiguous right page of the latched page. "And its parent page should be locked already also." Graph 3 (a) | +---+---+ | | (b) c | | +-+-+ +-+-+ | | | | (d) e f g * "()" means latched - in this case, at level 1, only 'e' as next latch - if we want to latch 'f' also, 'e' and 'c' should be latched before. (left side pages) prohibited Graph 4 (a) | +---+---+ | | (b) c | | +-+-+ +-+-+ | | | | d (e) f g * "()" means latched - in this case, we cannot latch 'd' - if want to latch 'd', need to release 'e' already lock SX> latched case (*allow other lock S> case only) (1st) 1st page in the tree can be every page, because no other index->lock SX that is allowed not to start from root page. Graph 5 a | +---+---+ | | b c | | +-+-+ +-+-+ | | | | d e f g - all are OK as 1st latch (latching lower level pages) 1st page on the level needs its ancestor page's lock And should not have descendant page's lock already (downward only) Graph 6 (a) | +---+---+ | | b c | | +-+-+ +-+-+ | | | | d e f g * "()" means latched - in this case, not only level 2 but also level 1 as next latch - but we cannot latch 'b' after 'd' latched (downward only) (right side pages) : (*important for validate level, stats_analyze etc..) After 2nd page on the same level should be contiguous right page of the latched page. "And no need its parent or ancestor page's lock" Graph 7 a | +---+---+ | | (b) c | | +-+-+ +-+-+ | | | | (d) e f g * "()" means latched - we can latch 'e'->'f'->'g', without latching 'a' or 'c' (left side pages) : (*important for BTR_MODIFY_TREE to modify next_page value of the left page) if common ancestor page is locked, left side contiguous page can be latched * it is enough to exclude (right side) of lock S> above. solved at the ancestor page's level Graph 8 (a) | +---+---+ | | b (c) | | +-+-+ +-+-+ | | | | d e (f) g * "()" means latched - we can latch 'e' next, because ancestor 'a' is already latched already lock X> latched case (*exclude all others) Every order is OK lock> Single block only ------- * The key point of the rule set is "how to allow to latch left side pages". - allowed thread should be only 1 --> index->lock SX is needed - how to control with other index->lock S threads --> lock S>: start from root always downward direct child only lock SX>: ancestor page is needed This rule can avoid deadlock between left side latch and right side latch. -------- (yasufumi comment: left side and right side for lock SX> case might be allowed "not contiguous", if the 1st latched page's descendant. because the 1st page lock blocks the other threads for the branch under the 1st page.)
[2] change for index->lock level btr_validate_index() index->lock X -> SX btr_validate_level() changed also btr_validate_level() index->lock X -> SX block->lock X -> SX Using several mtr. Locking block down from root page along with left side of index until target level. Scanning the (level - 1) and the (level) with locking to right side. (* change: previously, (level - 1) blocks were not locked) | | +------ +------- * block->lock is SX because used recursively. (S is not allowed to be used in recursively) btr_cur_search_to_nth_level() btr_cur_open_at_index_side_func() btr_cur_open_at_rnd_pos_func() BTR_SEARCH_LEAF BTR_MODIFY_LEAF index->lock S (not changed) block->lock (not-leaf) NO -> Slatch root block (S) latch root-1 block (S) .... latch leaf+1 block (S) (latch leaf block) (not changed) (release index->lock) (not changed) release root ~ leaf+1 blocks BTR_MODIFY_TREE index->lock X -> SX block->lock (not-leaf) NO -> X (Because has index->lock with SX latch, no other threads can modify tree structure. So, we don't need block->lock for just confirm needed blocks.) confirm needed blocks latch root block (SX), if not will be modified (X) , if will be modified (SX is for segment operation should be done later.) lock other not-leaf blocks (X) in tree order, (blocks which will be modified only) BTR_SEARCH_TREE (new) index->lock SX block->lock (not-leaf) S block->lock is same to BTR_SEARCH_LEAF BTR_CONT_MODIFY_TREE BTR_CONT_SEARCH_TREE (new) Only lock the target block. To lock the target by SX, BTR_CONT_SEARCH_TREE is added. (The reason why not S for cont_search_tree is, btr_validate_level() needs recursive latching.) ut_ad(own index->lock) X -> SX block->lock (not-leaf) NO (not changed) target block->lock (not-leaf) X (BTR_CONT_MODIFY_TREE) (not changed) SX (BTR_CONT_SEARCH_TREE) (new) BTR_SEARCH_PREV BTR_MODIFY_PREV index->lock S (not changed) block->lock (not-leaf) NO -> S Basically same to BTR_SEARCH_LEAF, BTR_MODIFY_LEAF. But if the node_ptr to leaf page is the leftest record in the page, locks also the left uncle page. (If also node_ptr is the leftest in the grand-parent page, its left page also. ....) dict_stats_analyze_index() index->lock S -> SX Because dict_stats_analyze_index_level() scans the each levels in the tree. "latch right page without parent page's latch" needs SX of index->lock for the new latching order. ibuf_tree_root_get() index->lock X -> SX root page X -> SX Mainly for flst access called from ibuf_add_free_page(), ibuf_delete_rec() ibuf_insert_low(), ibuf_remove_free_page() No performance problem also for ibuf_is_empty(). row_ins_sec_index_entry_low() wider index->lock for temp table with fast index creation BTR_MODIFY_LEAF S (not changed) !BTR_MODIFY_LEAF (BTR_MODIFY_TREE base intended) X -> SX (because BTR_MODIFY_TREE) row_purge_remove_sec_if_poss_tree() index->lock for temp table with fast index creation X -> SX (because BTR_MODIFY_TREE intended) row_purge_remove_sec_if_poss_leaf() index->lock for temp table with fast index creation S (not changed) (BTR_MODIFY_LEAF intended) row_purge_upd_exist_or_extern_func() "Free possible externally stored fields" path index->lock X -> SX just prevent the other modifying tree root page X -> SX for page freeing in the segment row_undo_ins_remove_clust_rec() wider index->lock for if (online = dict_index_is_online_ddl(index)) S (not changed) (BTR_MODIFY_LEAF) row_undo_ins_remove_sec_low() BTR_MODIFY_LEAF index->lock S (not changed) BTR_MODIFY_TREE index->lock X -> SX row_undo_mod_del_mark_or_remove_sec_low() wider index->lock for temp table with fast index creation BTR_MODIFY_LEAF S (not changed) BTR_MODIFY_TREE X -> SX (because BTR_MODIFY_TREE) row_undo_mod_del_unmark_sec_and_undo_update() wider index->lock for temp table with fast index creation BTR_MODIFY_LEAF S (not changed) BTR_MODIFY_TREE X -> SX (because BTR_MODIFY_TREE) row_undo_mod_clust() wider index->lock for if (online = dict_index_is_online_ddl(index)) S (not changed) (BTR_MODIFY_LEAF) row_ins_clust_index_entry_low() wider index->lock for if (mode == BTR_MODIFY_LEAF && dict_index_is_online_ddl(index)) S (not changed) row_upd_clust_step() wider index->lock for if (dict_index_is_online_ddl(index)) S (not changed) (BTR_MODIFY_LEAF) row_upd_sec_index_entry() wider index->lock for temp table with fast index creation S (not changed) (BTR_MODIFY_LEAF) [3] other changes for block->lock write IO S -> SX to block the other SX-lock of updating not-user data in the page fsp list accesses X -> SX updating not-user records can be relaxed to SX calling fut_get_ptr(RW_S_LATCH -> RW_SX_LATCH)
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.