WL#6326: InnoDB: fix index->lock contention

Status: Complete   —   Priority: Medium

The next problem to be solved for scalability of RW workload is index->lock
contention. The lock represents whole of the index tree structure. It can be
split into each block->lock in the tree.
The acquisition of node pointer page latches is protected by index->lock latch.

Before WL#6326, index->lock protected all node pointer pages either in S or X
mode, and no individual block->lock were acquired on node pointer pages.

After WL#6326, node pointer pages are protected by individual block->lock
S-latch or X-latch. The acquisition of node pointer page latches is covered by
index->lock, for preventing deadlocks.

(0) Definition: B-tree level.

(0.1) The leaf pages of the B-tree are at level 0.
(0.2) The parent of a page at level L has level L+1. (The level of the root page
is equal to the tree height.)
(0.3) The B-tree lock (index->lock) has level=HEIGHT+1, i.e., it is the parent
of the root page.

There are 3 locking modes of index->lock:

(1) When holding index->lock S-latch:

(1.1) Page latches must be acquired in descending order of tree level.

(1.2) When acquiring the first node pointer page latch at a given B-tree level,
we must hold the parent latch (at level+1).

(1.3) If there is a node pointer page already latched on the same level, we can
only acquire its right sibling page latch.

(1.4) Node pointer page latches must be released in child-to-parent order.
(This intends to prevent deadlocks with the index->lock SX-lock case.)
(1.4.1) A node pointer page latch at level L can only be released when not
holding any latches at level<L (children), or
(1.4.2) All node pointer page latches can be released in one go, without
acquiring any latches in between.

(1.5) [derived from (1.1), (1.2)] The first node pointer page latch acquired
must be the root page latch.

(2) When holding index->lock SX-latch:

Rules (1.2) and (1.3) are relaxed and merged into (2.2), and rule (1.4) is
removed. This means that we can skip page latch acquisition at some tree levels,
and can acquire latches in a less restricted order.

(2.1) [same as (1.1)]: Page latches must be acquired in descending order of tree
level.

(2.2) When acquiring a node pointer page latch at level L, we must hold the left
sibling page latch (at level L) or some ancestor latch (at level>L).

(2.3) [derived from (2.1), (2.2)] The first node pointer page latch acquired can
be any node pointer page.

(3) When holding index->lock X-latch:

We can acquire node pointer page latches in any order.

NOTE: WL#6326 does not affect the latching rules of leaf pages:

Reads need index->lock S-latch for the node pointer traversal. Once the leaf
level is reached, we can release index->lock (and with the WL#6326 changes, all
node pointer latches). Once we are on a leaf page, we can safely acquire the
right sibling leaf page latch, release the old page latch, and do a
left-to-right index scan.

Modifications on a single leaf page (BTR_MODIFY_LEAF) are covered by index->lock
S-latch throughout the operation.

Operations involving page splits or merges (BTR_MODIFY_TREE) and page
allocations are covered by index->lock X-latch.

EXAMPLES:

[1] More than one non-leaf block->lock latch order for each index->lock level
which the thread already has
  already <index->lock S> latched case (*allows other <index->lock S> and
<index->lock SX> case)
	(1st latch page)
	Latching page should be started from root page

          Graph 1

	      a
	      |
	  +---+---+
	  |       |
	  b       c
	  |       |
	+-+-+   +-+-+
	|   |   |   |
	d   e   f   g

		- the 1st latching should be 'a'


	(latching lower level pages)
	1st latch page on the level needs its parent page's lock

           Graph 2

	     (a)
	      |
	  +---+---+
	  |       |
	  b       c
	  |       |
	+-+-+   +-+-+
	|   |   |   |
	d   e   f   g
	* "()" means latched

		- in this case, only 'b' or 'c' as next latch
		- cannot skip to level 1 ('d'~'g')


	(right side pages)
	After 2nd page on the same level should be contiguous right page
	of the latched page.
	"And its parent page should be locked already also."

          Graph 3

	     (a)
	      |
	  +---+---+
	  |       |
	 (b)      c
	  |       |
	+-+-+   +-+-+
	|   |   |   |
	(d) e   f   g
	* "()" means latched

		- in this case, at level 1, only 'e' as next latch
		- if we want to latch 'f' also, 'e' and 'c' should be latched before.

	(left side pages)
	prohibited

          Graph 4

	     (a)
	      |
	  +---+---+
	  |       |
	 (b)      c
	  |       |
	+-+-+   +-+-+
	|   |   |   |
	d  (e)  f   g
	* "()" means latched

		- in this case, we cannot latch 'd'
		- if want to latch 'd', need to release 'e'

  already <index->lock SX> latched case (*allow other <index->lock S> case only)
	(1st)
	1st page in the tree can be every page, because no other index->lock SX
	that is allowed not to start from root page.

          Graph 5

	      a
	      |
	  +---+---+
	  |       |
	  b       c
	  |       |
	+-+-+   +-+-+
	|   |   |   |
	d   e   f   g

		- all are OK as 1st latch


	(latching lower level pages)
	1st page on the level needs its ancestor page's lock
	And should not have descendant page's lock already (downward only)

          Graph 6

	     (a)
	      |
	  +---+---+
	  |       |
	  b       c
	  |       |
	+-+-+   +-+-+
	|   |   |   |
	d   e   f   g
	* "()" means latched

		- in this case, not only level 2 but also level 1 as next latch
		- but we cannot latch 'b' after 'd' latched (downward only)


	(right side pages) : (*important for validate level, stats_analyze etc..)
	After 2nd page on the same level should be contiguous right page of the latched
page.
	"And no need its parent or ancestor page's lock"

          Graph 7

	      a
	      |
	  +---+---+
	  |       |
	 (b)      c
	  |       |
	+-+-+   +-+-+
	|   |   |   |
	(d) e   f   g
	* "()" means latched

		- we can latch 'e'->'f'->'g', without latching 'a' or 'c'


	(left side pages) : (*important for BTR_MODIFY_TREE to modify next_page value
of the
left page)
	if common ancestor page is locked, left side contiguous page can be latched
		* it is enough to exclude (right side) of <with index->lock S> above.
		solved at the ancestor page's level

          Graph 8

	     (a)
	      |
	  +---+---+
	  |       |
	  b      (c)
	  |       |
	+-+-+   +-+-+
	|   |   |   |
	d   e  (f)   g
	* "()" means latched

		- we can latch 'e' next, because ancestor 'a' is already latched


  already <index->lock X> latched case (*exclude all others)
	Every order is OK

  <without index->lock>
	Single block only

-------
* The key point of the rule set is "how to allow to latch left side pages".
  - allowed thread should be only 1
	--> index->lock SX is needed
  - how to control with other index->lock S threads
	--> <index->lock  S>: start from root always
	                      downward direct child only
	    <index->lock SX>: ancestor page is needed
  This rule can avoid deadlock between left side latch and right side latch.


--------
(yasufumi comment: left side and right side for <index->lock SX> case might be
allowed "not contiguous", if the 1st latched page's descendant. because the 1st
page lock blocks the other threads for the branch under the 1st page.)
[2] change for index->lock level
  btr_validate_index()
	index->lock     X -> SX
		btr_validate_level() changed also

  btr_validate_level()
	index->lock     X -> SX
	block->lock     X -> SX
		Using several mtr.
		Locking block down from root page
		 along with left side of index until target level.
		Scanning the (level - 1) and the (level) with locking to right side.
			(* change: previously, (level - 1) blocks were not locked)

			|
			|
			+------
			+-------

		* block->lock is SX because used recursively.
			(S is not allowed to be used in recursively)

  btr_cur_search_to_nth_level()
  btr_cur_open_at_index_side_func()
  btr_cur_open_at_rnd_pos_func()
	BTR_SEARCH_LEAF
	BTR_MODIFY_LEAF
		index->lock             S (not changed)
		block->lock (not-leaf)  NO -> S

		<latch procedure>
			latch root block (S)
			latch root-1 block (S)
			....
			latch leaf+1 block (S)

			(latch leaf block)      (not changed)
			(release index->lock)   (not changed)

			release root ~ leaf+1 blocks

	BTR_MODIFY_TREE
		index->lock             X -> SX
		block->lock (not-leaf)  NO -> X

		<latch procedure>
			(Because has index->lock with SX latch,
			 no other threads can modify tree structure.
			 So, we don't need block->lock for just confirm
			 needed blocks.)
			confirm needed blocks
			latch root block (SX), if not will be modified
					 (X) , if will be modified
				(SX is for segment operation should be done later.)
			lock other not-leaf blocks (X) in tree order,
				(blocks which will be modified only)

	BTR_SEARCH_TREE (new)
		index->lock             SX
		block->lock (not-leaf)  S

		<latch procedure>
			block->lock is same to BTR_SEARCH_LEAF

	BTR_CONT_MODIFY_TREE
	BTR_CONT_SEARCH_TREE (new)
		Only lock the target block.
		To lock the target by SX, BTR_CONT_SEARCH_TREE is added.
		(The reason why not S for cont_search_tree is,
		 btr_validate_level() needs recursive latching.)

		ut_ad(own index->lock)  X -> SX

		block->lock (not-leaf)  NO (not changed)

		target block->lock (not-leaf)
			X       (BTR_CONT_MODIFY_TREE) (not changed)
			SX      (BTR_CONT_SEARCH_TREE) (new)

	BTR_SEARCH_PREV
	BTR_MODIFY_PREV
		index->lock             S (not changed)
		block->lock (not-leaf)  NO -> S

		<latch procedure>
			Basically same to BTR_SEARCH_LEAF, BTR_MODIFY_LEAF.
			But if the node_ptr to leaf page is the leftest record in the page,
			locks also the left uncle page.
			(If also node_ptr is the leftest in the grand-parent page,
			 its left page also. ....)

  dict_stats_analyze_index()
	index->lock     S -> SX
		Because dict_stats_analyze_index_level()
		scans the each levels in the tree.
		"latch right page without parent page's latch"
		needs SX of index->lock for the new latching order.


  ibuf_tree_root_get()
	index->lock     X -> SX
	root page       X -> SX
		Mainly for flst access
		called from ibuf_add_free_page(), ibuf_delete_rec()
		ibuf_insert_low(), ibuf_remove_free_page()
		No performance problem also for ibuf_is_empty().


  row_ins_sec_index_entry_low()
	wider index->lock for temp table with fast index creation
		BTR_MODIFY_LEAF
			S (not changed)

		!BTR_MODIFY_LEAF (BTR_MODIFY_TREE base intended)
			X -> SX (because BTR_MODIFY_TREE)

  row_purge_remove_sec_if_poss_tree()
	index->lock for temp table with fast index creation
		X -> SX (because BTR_MODIFY_TREE intended)

  row_purge_remove_sec_if_poss_leaf()
	index->lock for temp table with fast index creation
		S (not changed)  (BTR_MODIFY_LEAF intended)

  row_purge_upd_exist_or_extern_func()
	"Free possible externally stored fields" path
	index->lock     X -> SX
		just prevent the other modifying tree

	root page       X -> SX
		for page freeing in the segment

  row_undo_ins_remove_clust_rec()
	wider index->lock for if (online = dict_index_is_online_ddl(index))
		S (not changed)  (BTR_MODIFY_LEAF)

  row_undo_ins_remove_sec_low()
	BTR_MODIFY_LEAF
		index->lock     S (not changed)

	BTR_MODIFY_TREE
		index->lock     X -> SX

  row_undo_mod_del_mark_or_remove_sec_low()
	wider index->lock for temp table with fast index creation
	BTR_MODIFY_LEAF
		S (not changed)

	BTR_MODIFY_TREE
		X -> SX (because BTR_MODIFY_TREE)

  row_undo_mod_del_unmark_sec_and_undo_update()
	wider index->lock for temp table with fast index creation
		BTR_MODIFY_LEAF
			S (not changed)

		BTR_MODIFY_TREE
			X -> SX (because BTR_MODIFY_TREE)

  row_undo_mod_clust()
	wider index->lock for if (online = dict_index_is_online_ddl(index))
		S (not changed)  (BTR_MODIFY_LEAF)

  row_ins_clust_index_entry_low()
	wider index->lock for if (mode == BTR_MODIFY_LEAF &&
dict_index_is_online_ddl(index))
		S (not changed)

  row_upd_clust_step()
	wider index->lock for if (dict_index_is_online_ddl(index))
		S (not changed)  (BTR_MODIFY_LEAF)

  row_upd_sec_index_entry()
	wider index->lock for temp table with fast index creation
		S (not changed)  (BTR_MODIFY_LEAF)


[3] other changes for block->lock
  write IO           S -> SX
	to block the other SX-lock of updating not-user data in the page

  fsp list accesses  X -> SX
	updating not-user records can be relaxed to SX

  calling fut_get_ptr(RW_S_LATCH -> RW_SX_LATCH)