WL#10314: InnoDB: Lock-sys optimization: sharded lock_sys mutex

Affects: Server-8.0   —   Status: Complete

The Lock-sys orchestrates access to tables and rows. Each table, and each row,
can be thought of as a resource, and a transaction may request access right for
a resource. As two transactions operating on a single resource can lead to
problems if the two operations conflict with each other, Lock-sys remembers
lists of already GRANTED lock requests and checks new requests for conflicts in
which case they have to start WAITING for their turn.

Lock-sys stores both GRANTED and WAITING lock requests in lists known as queues.
To allow concurrent operations on these queues, we need a mechanism to latch
these queues in safe and quick fashion.

In the past a single latch protected access to all of these queues.
This scaled poorly, and the managment of queues become a bottleneck.
In this WL, we introduce a more granular approach to latching.
Functional Requirements
-----------------------

We just want the system to work correctly according to already existing tests,
in particular those related to data locks, such as those found in innodb suite.
No new requirements.

Non-Functional Requirements
---------------------------

NF1. Improve the QPS/TPS for high concurrency (128 - 1024 clients) OLTP RW
Sysbench tests, in particular for pareto distribution. I mean something like:
```
  $sb11 ./sb_exec/lua/OLTP_RW-trx.lua --db-driver=mysql \
   --table-size=10000000 --tables=8 --threads=$clients --time=$duration \
  --thread-init-timeout=0  --rate=0  --rand-type=pareto  --rand-seed=0  \
  --mysql-user=$user --mysql-password=$pass --mysql-host=$host \
  --mysql-port=$port --mysql-db=$db --events=0 run
```

NF2. Do not deteriorate QPS/TPS for other tests
Latching protocol changes:
--------------------------

Before this WL we had a single "lock mutex". In the code it was declared as
`lock_sys->mutex`, with id `LATCH_ID_LOCK_SYS`, tracked in Performance Schema as
`mysql_pfs_key_t lock_mutex_key`, and with level `SYNC_LOCK_SYS` which placed it
between `SYNC_TRX_SYS` and `SYNC_LOCK_WAIT_SYS`.

This WL will replace this single "lock mutex" with:
- "global latch"
	- a sharded read-write lock
	- in code declared as `lock_sys->latches.global_latch`,
	- tracked in PFS as `lock_sys_global_rw_lock_key`
	- with id `LATCH_ID_LOCK_SYS_GLOBAL`,
	- and sync level `SYNC_LOCK_SYS_GLOBAL`, which is between `SYNC_TRX_SYS` and
	  `SYNC_LOCK_SYS_SHARDED`
	- while conceptually it is a single rw-lock, it is implemented using
	  Sharded_rw_lock consisting of 64 rw_lock_t instances
- "table shard latches"
	- an array of 512 mutexes, one for each "table shard"
	- in code declared as `lock_sys->latches.table_shards.mutexes
	- tracked in PFS as `lock_sys_table_mutex_key`
	- with id `LATCH_ID_LOCK_SYS_TABLE`
	- and sync level `SYNC_LOCK_SYS_SHARDED`, which is between
	  `SYNC_LOCK_SYS_GLOBAL` and `LATCH_ID_LOCK_SYS_WAIT`
- "page shard latches"
	- an array of 512 mutexes, one for each "page shard"
	- in code declared as `lock_sys->latches.page_shards.mutexes
	- tracked in PFS as `lock_sys_page_mutex_key`
	- with id `LATCH_ID_LOCK_SYS_PAGE`
	- and sync level `SYNC_LOCK_SYS_SHARDED`, which is between
	  `SYNC_LOCK_SYS_GLOBAL` and `LATCH_ID_LOCK_SYS_WAIT`

Performance improvement:
------------------------
From the user perspective, there should be no important functional difference.
Hopefully, they will just notice higher number of queries per second, in
particular for loads similar to OLTP RW with pareto distribution for
128,256,512,1024 clients - tests like these cause a lot of pressure on the
Lock-sys as threads concurrently try to create and remove data locks from lock
queues, and these queues were protected by a single latch in the past.

Visibility in performance_schema:
---------------------------------
Other then that, a really curious user might notice that performance_schema and
information_schema no longer report the old, removed, mutex
(wait/synch/mutex/innodb/lock_mutex), but instead it shows:

- The 64 shards of the "global latch".
```
mysql> select name,count(1) from rwlock_instances where name like
'%innodb/lock_sys%' group by 1;
+--------------------------------------------------+----------+
| name                                             | count(1) |
+--------------------------------------------------+----------+
| wait/synch/sxlock/innodb/lock_sys_global_rw_lock |       64 |
+--------------------------------------------------+----------+
1 row in set (0.01 sec)
```

- The 512 "table shard latches" and 512 "page shard latches":
```
mysql> select name,count(1) from mutex_instances where name like
'%innodb/lock_sys%' group by 1;
+----------------------------------------------+----------+
| name                                         | count(1) |
+----------------------------------------------+----------+
| wait/synch/mutex/innodb/lock_sys_table_mutex |      512 |
| wait/synch/mutex/innodb/lock_sys_page_mutex  |      512 |
+----------------------------------------------+----------+
2 rows in set (0.02 sec)
```
- All of them visible in setup_instruments,
events_waits_summary_global_by_event_name and other performance_schema tables:
```
mysql> SELECT name FROM performance_schema.setup_instruments WHERE NAME LIKE
'%/innodb/lock_sys%';
+--------------------------------------------------+
| name                                             |
+--------------------------------------------------+
| wait/synch/mutex/innodb/lock_sys_page_mutex      |
| wait/synch/mutex/innodb/lock_sys_table_mutex     |
| wait/synch/sxlock/innodb/lock_sys_global_rw_lock |
+--------------------------------------------------+
3 rows in set (0.01 sec)
mysql> SELECT event_name FROM events_waits_summary_global_by_event_name WHERE
 event_name LIKE '%innodb/lock_sys%'  LIMIT 10;
+--------------------------------------------------+
| event_name                                       |
+--------------------------------------------------+
| wait/synch/mutex/innodb/lock_sys_page_mutex      |
| wait/synch/mutex/innodb/lock_sys_table_mutex     |
| wait/synch/sxlock/innodb/lock_sys_global_rw_lock |
+--------------------------------------------------+
3 rows in set (0.02 sec)
```
Architecture:
=============

The general idea is to take advantage of the fact, that most of operations
involve one or two Lock-sys queues, and are independent of any other operations
on other queues.
In extreme, one could imagine protecting each queue with a separate latch.
To avoid having too many latch objects, and having to create and remove them on
demand, we will use a more conservative approach.
The queues will be grouped into a fixed number of shards, and each shard is
protected by its own mutex.

However, there are several rare events in which we need to "stop the world" -
latch all queues, to prevent any activity inside lock-sys.
One way to accomplish this would be to simply latch all the shards one by one,
but it turns out to be way too slow in debug runs, where such "stop the world"
events are very frequent due to lock_sys validation.

To allow for efficient latching of everything, we'll introduce a global_latch,
which is a read-write latch.
Most of the time, we operate on one or two shards, in which case it is
sufficient to s-latch the global_latch and then latch shard's mutex.
For the "stop the world" operations, we x-latch the global_latch, which prevents
any other thread from latching any shard.
This hierarchical approach is similar conceptually to how we handle locking
of tables and rows:
- x-latching global_latch "is like a" table X lock
- s-latching global_latch "is like a" table IX lock
- latching the shard latch "is like a" row X lock

However, it turned out that on ARM architecture, the default implementation of
read-write latch (rw_lock_t) is too slow because increments and decrements of
the number of s-latchers is implemented as read-update-try-to-write loop, which
means multiple threads try to modify the same cache line disrupting each other.
Therefore, we use a sharded version of read-write latch (Sharded_rw_lock), which
internally uses multiple instances of rw_lock_t, spreading the load over several
cache lines. Note that this sharding is a technical internal detail of the
global_latch, which for all other purposes can be treated as a single entity.

This his how this conceptually looks like:
```
  [                           global latch                                ]
                                  |
                                  v
  [table shard 1] ... [table shard 512] [page shard 1] ... [page shard 512]

```

So, for example access two queues for two records involves following steps:
1. s-latch the global_latch
2. identify the 2 pages to which the records belong
3. identify the lock_sys 2 hash buckets which contain the queues for given pages
4. identify the 2 shard ids which contain these two buckets
5. latch mutexes for the two shards in the order of their addresses

All of the steps above (except 2, as we usually know the page already) are
accomplished with the help of single line:

    locksys::Shard_latches_guard guard{*block_a, *block_b};

And to "stop the world" one can simply x-latch the global latch by using:

    locksys::Exclusive_global_latch_guard guard{};

This class does not expose too many public functions, as the intention is to
rather use friend guard classes, like the Shard_latches_guard demonstrated.

Details:
========

Latch guards:
-------------

We will introduce:
- `Shard_latch_guard`
	- class which s-latches the global_latch and latches the shard mutex for its
    lifetime
- `Shard_latches_guard`
	- class which s-latches the global_latch and latches two shard mutexes in
    correct order for its lifetime
- `Exclusive_global_latch_guard`
	- class which x-latches the global_latch for its lifetime

While above classes handle most of usacases, and Exclusive_global_latch_guard is
used only for rare events such as deadlock resolution, there are some special
cases in our code which require more fine-grained tools:
- `Shared_global_latch_guard`
	- this class only s-latches the global_latch, but not any of shards. This
    alows to later latch shards individually when iterating over a list of locks
    of transaction in efficient way during trx lock release phase
- `Naked_shard_latch_guard`
	- this class only latches a shard, but not global_latch. It's used in
    combination with Shared_global_latch_guard
- `Try_exclusive_global_latch`
	- class which tries to x-latch global_latch for its lifetime
	- it's only usecase is in srv_printf_innodb_monitor() which tries to avoid
    interfering with workload when reporting INNODB MONITOR OUTPUT
- `Unsafe_global_latch_manipulator`
 	- allows to manually latch and unlatch the exclusive global_latch on demand
    manually in non-structured way, which is unfortunatelly needed in the
    current implementation of the
      srv_printf_innodb_monitor() =>
      srv_printf_locks_and_transactions() =>
      lock_print_info_all_transactions() =>
      lock_trx_print_locks() =>
      lock_rec_fetch_page()
    legacy code-path
The later two might hopefully be removed in future, simplifying the design, if
srv_printf_innodb_monitor() gets redesigned first.

Changes around trx->mutex:
--------------------------
(Following explanation, will also be added as a comment to source)

Interaction between Lock-sys and trx->mutex-es is rather complicated.
In particular we allow a thread performing Lock-sys operations to request
another trx->mutex even though it already holds one for a different trx.
Therefore one has to prove that it is impossible to form a deadlock cycle in the
imaginary wait-for-graph in which edges go from thread trying to obtain
trx->mutex to a thread which holds it at the moment.

In the past it was simple, because Lock-sys was protected by a global mutex,
which meant that there was at most one thread which could try to posses more
than one trx->mutex - one can not form a cycle in a graph in which only
one node has both incoming and outgoing edges.

With this WL it is much harder to prove, because we will shard the Lock-sys
mutex, and now multiple threads can perform Lock-sys operations in parallel, as
long as they happen in different shards.

Here's my attempt at the proof.

Assumption 1.
  If a thread attempts to acquire more then one trx->mutex, then it either has
  exclusive global latch, or it attempts to acquire exactly two of them, and at
  just before calling mutex_enter for the second time it saw
  trx1->lock.wait_lock==nullptr, trx2->lock.wait_lock!=nullptr, and it held the
  latch for the shard containing trx2->lock.wait_lock.

@see asserts in trx_before_mutex_enter

Assumption 2.
  The Lock-sys latches are taken before any trx->mutex.

@see asserts in sync0debug.cc

Assumption 3.
  Changing trx->lock.wait_lock from NULL to non-NULL requires latching
  trx->mutex and the shard containing new wait_lock value.

@see asserts in lock_set_lock_and_trx_wait()

Assumption 4.
  Changing trx->lock.wait_lock from non-NULL to NULL requires latching the shard
  containing old wait_lock value.

@see asserts in lock_reset_lock_and_trx_wait()

Assumption 5.
  If a thread is latching two Lock-sys shards then it's acquiring and releasing
  both shards together (that is, without interleaving it with trx->mutex
  operations).

@see Shard_latches_guard

Theorem 1.
  If the above Assumptions hold, then it's impossible for trx_mutex_enter() call
  to deadlock.

By proving the theorem, and observing that the assertions hold for multiple runs
of test suite on debug build, we gain more and more confidence that
trx_mutex_enter() calls can not deadlock.

The intuitive, albeit imprecise, version of the proof is that by Assumption 1
each edge of the deadlock cycle leads from a trx with NULL trx->lock.wait_lock
to one with non-NULL wait_lock, which means it has only one edge.

The difficulty lays in that wait_lock is a field which can be modified over time
from several threads, so care must be taken to clarify at which moment in time
we make our observations and from whose perspective.

We will now formally prove Theorem 1.
Assume otherwise, that is that we are in a thread which have just started a call
to mutex_enter(trx_a->mutex) and caused a deadlock.

Fact 0. There is no thread which possesses exclusive Lock-sys latch, since to
        form a deadlock one needs at least two threads inside Lock-sys
Fact 1. Each thread participating in the deadlock holds one trx mutex and waits
        for the second one it tried to acquire
Fact 2. Thus each thread participating in the deadlock had gone through "else"
        branch inside trx_before_mutex_enter(), so it verifies Assumption 1.
Fact 3.	Our thread owns_lock_shard(trx_a->lock.wait_lock)
Fact 4. Another thread has latched trx_a->mutex as the first of its two latches

Consider the situation from the point of view of this other thread, which is now
in the deadlock waiting for mutex_enter(trx_b->mutex) for some trx_b!=trx_a.
By Fact 2 and assumption 1, it had to take the "else" branch on the way there,
and thus it has saw: trx_a->lock.wait_lock == nullptr at some moment in time.
This observation was either before or after our observation that
trx_a->lock.wait_lock != nullptr (again Fact 2 and Assumption 1).

If our thread observed non-NULL value first, then it means a change from
non-NULL to NULL has happened, which by Assumption 4 requires a shard latch,
which only our thread posses - and we couldn't manipulate the wait_lock as we
are in a deadlock.

If the other thread observed NULL first, then it means that the value has
changed to non-NULL, which requires trx_a->mutex according to Assumption 3, yet
this mutex was held entire time by the other thread, since it observed the NULL
just before it deadlock, so it could not change it, either.

So, there is no way the value of wait_lock has changed from NULL to non-NULL or
vice-versa, yet one thread sees NULL and the other non-NULL - contradiction ends
the proof.

Changes around dict_table_t::autoinc_trx:
-----------------------------------------
This field is used to store pointer to trx which currently holds a GRANTED
AUTO_INC lock. There can be at most one such transaction, as these locks are
mutually exclusive. This field is "peeked at" by
row_lock_table_autoinc_for_mysql() to see if the current transaction already has
a AUTO_INC GRANTED, in which case it can follow a fast path. Such checks are
done by comparing the field with own trx pointer without any latch - this
requires changing the type to atomic<>, but is otherwise correct, as the boolean
result of such comparison is guaranteed to be meaningful, given that we only
change the value of this field during granting and releasing, and releasing can
not happen concurrently with the thread running the transaction.
However, some comments around this code, assertions, and the type of the field
have to be modified, to better explain why and how all this works.

Changes around dict_table_t::n_rec_locks:
-----------------------------------------
This field counts how many record locks (granted or waiting) are currently
associated with the given table. As such locks can be now created and destroyed
in parallel as long as they are in different shards, this field needs to become
atomic<>, and readers interested in meaningful value should acquire exclusive
global latch - otherwise the value can get modified before we act on the result.
Here again we need to update some comments.

Changes around hit list:
------------------------
For High Priority transactions we have a mechanism which tries to abort
conflicting transaction of lower-priority to avoid waiting on record locks.
The procedure of identifying and aborting confliciting transactions will require
exclusive global latch, as (among other reasons) it may require accessing an
unrelated Lock-sys queues in which the blocking transactions are waiting
themselves.
To avoid taking this exclusive global latch in the probable case of our HP
transaction not being in the waiting state at all, we need a reliable,
thread-safe way of checking if the transaction is waiting.
It's TBD if this can be simply done using
`hp_trx->lock.que_state == TRX_QUE_LOCK_WAIT`
- some places in code seem to modify
this field without any latch, which looks unsafe.
Alternative, correct approach, would be to use
`hp_trx->lock.blocking_trx.load() != nullptr`
instead, which requires only a minor change in comments, to clarify that it is
important to clear this field when wait finishes (it's already being done in
trunk, but comments leave some doubt).
Fixing que_state handling is out of scope of this WL.

Changes around lock_release():
------------------------------
Releasing all locks of transaction requires iterating over its locks, and for
each of them performing some actions in respective lock queue.
Simple, but inefficient, way of doing it is to acquire exclusive global locksys
latch.
It is much better for parallelism, to instead acquire just a shared global
latch, and then latch one by one only the shard containing particular lock as we
iterate. The difficulty with this approach is that latching order rules require
acquiring Lock-sys latches BEFORE trx latches, and trx->mutex protects the list
of transaction's locks. In particular it protects it from concurrent B-tree
modifications causing relocation of locks between pages (and thus shards).
So, there is a chicken-and-egg problem: we want to know which shard to latch,
for which we need to know what is the next lock, but to iterate over list we
need trx->latch, which we can only take AFTER latching the shard.
To overcome this, we will first latch trx->mutex, note down shard id of last
lock in the list, release the trx->mutex, latch the particular shard, latch
trx->mutex again, make sure that the lock still the tail of the list, and only
then proceed. This might seem complicated, but is actually much faster than the
"stop the world" approach.

Changes around lock_trx_release_read_locks():
---------------------------------------------
The lock_trx_release_read_locks() is used mostly in group replication appliers
to release read gap locks. In testing it turned out to be a bottleneck if global
exclusive latch is used to iterate over transactions's locks.
Similarly to `lock_release()` function we should instead acquire a shared global
latch, and latch shards one by one as we iterate.
The problem with this is that other threads can modify the list of our locks
concurrently (for example because of implicit-to-explicit conversion, or B-tree
reorganization), and we can't simply compare current lock to the tail, because,
we are not removing all locks, just a subset of them, so the trick with
operating only at the tail of the list is insuficcient.
To notice such situations (and restart iteration) we will introduce
`uint64_t trx->lock.trx_locks_version` incremented each time a lock is added to
or removed from trx lock list.
After several failed restarts we can switch back to old
`lock_trx_release_read_locks_in_x_mode()`.

Changes:
========
- separate the whole latching logic to deticated class locksys::Latches and
  document extensively the design in its header
- introduce the latch guards mentioned above
- all new functions will be in locksys namespace
- all usages of lock_mutex_enter()/lock_mutex_exit() will be replaced with
  apropriate latch guards, preferably locksys::Shard_latch_guard
- table->n_rec_locks must become atomic, as it can now be
  incremented/decremented in parallel during creation/destruction of record
  locks for given table
- dict/mem.cc doesn't really need to include lock0lock.h when compiled for
  UNIV_LIBRARY
- remove lock_mutex from PSI
- add lock_sys_global_latch_rw_lock to PSI
- add lock_sys_page_mutex to PSI
- add lock_sys_table_mutex to PSI
- all places where we use exclusive global latch will be documented to specify
  the remaining reasons we have to resort to such strong synchronization
- the table->autoinc_trx field should be atomic as it is "peeked" without any
  latch, and confusing/wrong comments and asserts around it have to be cleaned
  up, to clarify why it is correct
- `lock_rec_expl_exist_on_page()` should return a `bool` instead of potentially
  dangling pointer to a `lock_t`
- `lock_print_info_summary` and logic inside `srv_printf_innodb_monitor()` in
  general needs at least some small refactoring so that the latch guards can be
  used
- `lock_mutex_own()` debug predicate would have to be replaced with more
  specific `owns_exclusive_global_latch()`, `owns_shared_global_latch()`,
  `owns_page_shard(page)`, `owns_page_shard(table)`,...
- `bool Sharded_rw_lock::try_x_lock` needs to be implemented
- the control flow of `lock_rec_insert_check_and_lock()` (and it's copy&paste
  `lock_prdt_insert_check_and_lock`) can be simplified by removing code
  duplication, before we can use lock guards
- the code around `lock_rec_queue_validate()` could be simplified by removing
  code duplication, and using more structured latching
- update sync0debug so it has proper rules for latch order