WL#10314: InnoDB: Lock-sys optimization: sharded lock_sys mutex
Affects: Server-8.0
—
Status: Complete
The Lock-sys orchestrates access to tables and rows. Each table, and each row, can be thought of as a resource, and a transaction may request access right for a resource. As two transactions operating on a single resource can lead to problems if the two operations conflict with each other, Lock-sys remembers lists of already GRANTED lock requests and checks new requests for conflicts in which case they have to start WAITING for their turn. Lock-sys stores both GRANTED and WAITING lock requests in lists known as queues. To allow concurrent operations on these queues, we need a mechanism to latch these queues in safe and quick fashion. In the past a single latch protected access to all of these queues. This scaled poorly, and the managment of queues become a bottleneck. In this WL, we introduce a more granular approach to latching.
Functional Requirements ----------------------- We just want the system to work correctly according to already existing tests, in particular those related to data locks, such as those found in innodb suite. No new requirements. Non-Functional Requirements --------------------------- NF1. Improve the QPS/TPS for high concurrency (128 - 1024 clients) OLTP RW Sysbench tests, in particular for pareto distribution. I mean something like: ``` $sb11 ./sb_exec/lua/OLTP_RW-trx.lua --db-driver=mysql \ --table-size=10000000 --tables=8 --threads=$clients --time=$duration \ --thread-init-timeout=0 --rate=0 --rand-type=pareto --rand-seed=0 \ --mysql-user=$user --mysql-password=$pass --mysql-host=$host \ --mysql-port=$port --mysql-db=$db --events=0 run ``` NF2. Do not deteriorate QPS/TPS for other tests
Latching protocol changes: -------------------------- Before this WL we had a single "lock mutex". In the code it was declared as `lock_sys->mutex`, with id `LATCH_ID_LOCK_SYS`, tracked in Performance Schema as `mysql_pfs_key_t lock_mutex_key`, and with level `SYNC_LOCK_SYS` which placed it between `SYNC_TRX_SYS` and `SYNC_LOCK_WAIT_SYS`. This WL will replace this single "lock mutex" with: - "global latch" - a sharded read-write lock - in code declared as `lock_sys->latches.global_latch`, - tracked in PFS as `lock_sys_global_rw_lock_key` - with id `LATCH_ID_LOCK_SYS_GLOBAL`, - and sync level `SYNC_LOCK_SYS_GLOBAL`, which is between `SYNC_TRX_SYS` and `SYNC_LOCK_SYS_SHARDED` - while conceptually it is a single rw-lock, it is implemented using Sharded_rw_lock consisting of 64 rw_lock_t instances - "table shard latches" - an array of 512 mutexes, one for each "table shard" - in code declared as `lock_sys->latches.table_shards.mutexes - tracked in PFS as `lock_sys_table_mutex_key` - with id `LATCH_ID_LOCK_SYS_TABLE` - and sync level `SYNC_LOCK_SYS_SHARDED`, which is between `SYNC_LOCK_SYS_GLOBAL` and `LATCH_ID_LOCK_SYS_WAIT` - "page shard latches" - an array of 512 mutexes, one for each "page shard" - in code declared as `lock_sys->latches.page_shards.mutexes - tracked in PFS as `lock_sys_page_mutex_key` - with id `LATCH_ID_LOCK_SYS_PAGE` - and sync level `SYNC_LOCK_SYS_SHARDED`, which is between `SYNC_LOCK_SYS_GLOBAL` and `LATCH_ID_LOCK_SYS_WAIT` Performance improvement: ------------------------ From the user perspective, there should be no important functional difference. Hopefully, they will just notice higher number of queries per second, in particular for loads similar to OLTP RW with pareto distribution for 128,256,512,1024 clients - tests like these cause a lot of pressure on the Lock-sys as threads concurrently try to create and remove data locks from lock queues, and these queues were protected by a single latch in the past. Visibility in performance_schema: --------------------------------- Other then that, a really curious user might notice that performance_schema and information_schema no longer report the old, removed, mutex (wait/synch/mutex/innodb/lock_mutex), but instead it shows: - The 64 shards of the "global latch". ``` mysql> select name,count(1) from rwlock_instances where name like '%innodb/lock_sys%' group by 1; +--------------------------------------------------+----------+ | name | count(1) | +--------------------------------------------------+----------+ | wait/synch/sxlock/innodb/lock_sys_global_rw_lock | 64 | +--------------------------------------------------+----------+ 1 row in set (0.01 sec) ``` - The 512 "table shard latches" and 512 "page shard latches": ``` mysql> select name,count(1) from mutex_instances where name like '%innodb/lock_sys%' group by 1; +----------------------------------------------+----------+ | name | count(1) | +----------------------------------------------+----------+ | wait/synch/mutex/innodb/lock_sys_table_mutex | 512 | | wait/synch/mutex/innodb/lock_sys_page_mutex | 512 | +----------------------------------------------+----------+ 2 rows in set (0.02 sec) ``` - All of them visible in setup_instruments, events_waits_summary_global_by_event_name and other performance_schema tables: ``` mysql> SELECT name FROM performance_schema.setup_instruments WHERE NAME LIKE '%/innodb/lock_sys%'; +--------------------------------------------------+ | name | +--------------------------------------------------+ | wait/synch/mutex/innodb/lock_sys_page_mutex | | wait/synch/mutex/innodb/lock_sys_table_mutex | | wait/synch/sxlock/innodb/lock_sys_global_rw_lock | +--------------------------------------------------+ 3 rows in set (0.01 sec) mysql> SELECT event_name FROM events_waits_summary_global_by_event_name WHERE event_name LIKE '%innodb/lock_sys%' LIMIT 10; +--------------------------------------------------+ | event_name | +--------------------------------------------------+ | wait/synch/mutex/innodb/lock_sys_page_mutex | | wait/synch/mutex/innodb/lock_sys_table_mutex | | wait/synch/sxlock/innodb/lock_sys_global_rw_lock | +--------------------------------------------------+ 3 rows in set (0.02 sec) ```
Architecture: ============= The general idea is to take advantage of the fact, that most of operations involve one or two Lock-sys queues, and are independent of any other operations on other queues. In extreme, one could imagine protecting each queue with a separate latch. To avoid having too many latch objects, and having to create and remove them on demand, we will use a more conservative approach. The queues will be grouped into a fixed number of shards, and each shard is protected by its own mutex. However, there are several rare events in which we need to "stop the world" - latch all queues, to prevent any activity inside lock-sys. One way to accomplish this would be to simply latch all the shards one by one, but it turns out to be way too slow in debug runs, where such "stop the world" events are very frequent due to lock_sys validation. To allow for efficient latching of everything, we'll introduce a global_latch, which is a read-write latch. Most of the time, we operate on one or two shards, in which case it is sufficient to s-latch the global_latch and then latch shard's mutex. For the "stop the world" operations, we x-latch the global_latch, which prevents any other thread from latching any shard. This hierarchical approach is similar conceptually to how we handle locking of tables and rows: - x-latching global_latch "is like a" table X lock - s-latching global_latch "is like a" table IX lock - latching the shard latch "is like a" row X lock However, it turned out that on ARM architecture, the default implementation of read-write latch (rw_lock_t) is too slow because increments and decrements of the number of s-latchers is implemented as read-update-try-to-write loop, which means multiple threads try to modify the same cache line disrupting each other. Therefore, we use a sharded version of read-write latch (Sharded_rw_lock), which internally uses multiple instances of rw_lock_t, spreading the load over several cache lines. Note that this sharding is a technical internal detail of the global_latch, which for all other purposes can be treated as a single entity. This his how this conceptually looks like: ``` [ global latch ] | v [table shard 1] ... [table shard 512] [page shard 1] ... [page shard 512] ``` So, for example access two queues for two records involves following steps: 1. s-latch the global_latch 2. identify the 2 pages to which the records belong 3. identify the lock_sys 2 hash buckets which contain the queues for given pages 4. identify the 2 shard ids which contain these two buckets 5. latch mutexes for the two shards in the order of their addresses All of the steps above (except 2, as we usually know the page already) are accomplished with the help of single line: locksys::Shard_latches_guard guard{*block_a, *block_b}; And to "stop the world" one can simply x-latch the global latch by using: locksys::Exclusive_global_latch_guard guard{}; This class does not expose too many public functions, as the intention is to rather use friend guard classes, like the Shard_latches_guard demonstrated. Details: ======== Latch guards: ------------- We will introduce: - `Shard_latch_guard` - class which s-latches the global_latch and latches the shard mutex for its lifetime - `Shard_latches_guard` - class which s-latches the global_latch and latches two shard mutexes in correct order for its lifetime - `Exclusive_global_latch_guard` - class which x-latches the global_latch for its lifetime While above classes handle most of usacases, and Exclusive_global_latch_guard is used only for rare events such as deadlock resolution, there are some special cases in our code which require more fine-grained tools: - `Shared_global_latch_guard` - this class only s-latches the global_latch, but not any of shards. This alows to later latch shards individually when iterating over a list of locks of transaction in efficient way during trx lock release phase - `Naked_shard_latch_guard` - this class only latches a shard, but not global_latch. It's used in combination with Shared_global_latch_guard - `Try_exclusive_global_latch` - class which tries to x-latch global_latch for its lifetime - it's only usecase is in srv_printf_innodb_monitor() which tries to avoid interfering with workload when reporting INNODB MONITOR OUTPUT - `Unsafe_global_latch_manipulator` - allows to manually latch and unlatch the exclusive global_latch on demand manually in non-structured way, which is unfortunatelly needed in the current implementation of the srv_printf_innodb_monitor() => srv_printf_locks_and_transactions() => lock_print_info_all_transactions() => lock_trx_print_locks() => lock_rec_fetch_page() legacy code-path The later two might hopefully be removed in future, simplifying the design, if srv_printf_innodb_monitor() gets redesigned first. Changes around trx->mutex: -------------------------- (Following explanation, will also be added as a comment to source) Interaction between Lock-sys and trx->mutex-es is rather complicated. In particular we allow a thread performing Lock-sys operations to request another trx->mutex even though it already holds one for a different trx. Therefore one has to prove that it is impossible to form a deadlock cycle in the imaginary wait-for-graph in which edges go from thread trying to obtain trx->mutex to a thread which holds it at the moment. In the past it was simple, because Lock-sys was protected by a global mutex, which meant that there was at most one thread which could try to posses more than one trx->mutex - one can not form a cycle in a graph in which only one node has both incoming and outgoing edges. With this WL it is much harder to prove, because we will shard the Lock-sys mutex, and now multiple threads can perform Lock-sys operations in parallel, as long as they happen in different shards. Here's my attempt at the proof. Assumption 1. If a thread attempts to acquire more then one trx->mutex, then it either has exclusive global latch, or it attempts to acquire exactly two of them, and at just before calling mutex_enter for the second time it saw trx1->lock.wait_lock==nullptr, trx2->lock.wait_lock!=nullptr, and it held the latch for the shard containing trx2->lock.wait_lock. @see asserts in trx_before_mutex_enter Assumption 2. The Lock-sys latches are taken before any trx->mutex. @see asserts in sync0debug.cc Assumption 3. Changing trx->lock.wait_lock from NULL to non-NULL requires latching trx->mutex and the shard containing new wait_lock value. @see asserts in lock_set_lock_and_trx_wait() Assumption 4. Changing trx->lock.wait_lock from non-NULL to NULL requires latching the shard containing old wait_lock value. @see asserts in lock_reset_lock_and_trx_wait() Assumption 5. If a thread is latching two Lock-sys shards then it's acquiring and releasing both shards together (that is, without interleaving it with trx->mutex operations). @see Shard_latches_guard Theorem 1. If the above Assumptions hold, then it's impossible for trx_mutex_enter() call to deadlock. By proving the theorem, and observing that the assertions hold for multiple runs of test suite on debug build, we gain more and more confidence that trx_mutex_enter() calls can not deadlock. The intuitive, albeit imprecise, version of the proof is that by Assumption 1 each edge of the deadlock cycle leads from a trx with NULL trx->lock.wait_lock to one with non-NULL wait_lock, which means it has only one edge. The difficulty lays in that wait_lock is a field which can be modified over time from several threads, so care must be taken to clarify at which moment in time we make our observations and from whose perspective. We will now formally prove Theorem 1. Assume otherwise, that is that we are in a thread which have just started a call to mutex_enter(trx_a->mutex) and caused a deadlock. Fact 0. There is no thread which possesses exclusive Lock-sys latch, since to form a deadlock one needs at least two threads inside Lock-sys Fact 1. Each thread participating in the deadlock holds one trx mutex and waits for the second one it tried to acquire Fact 2. Thus each thread participating in the deadlock had gone through "else" branch inside trx_before_mutex_enter(), so it verifies Assumption 1. Fact 3. Our thread owns_lock_shard(trx_a->lock.wait_lock) Fact 4. Another thread has latched trx_a->mutex as the first of its two latches Consider the situation from the point of view of this other thread, which is now in the deadlock waiting for mutex_enter(trx_b->mutex) for some trx_b!=trx_a. By Fact 2 and assumption 1, it had to take the "else" branch on the way there, and thus it has saw: trx_a->lock.wait_lock == nullptr at some moment in time. This observation was either before or after our observation that trx_a->lock.wait_lock != nullptr (again Fact 2 and Assumption 1). If our thread observed non-NULL value first, then it means a change from non-NULL to NULL has happened, which by Assumption 4 requires a shard latch, which only our thread posses - and we couldn't manipulate the wait_lock as we are in a deadlock. If the other thread observed NULL first, then it means that the value has changed to non-NULL, which requires trx_a->mutex according to Assumption 3, yet this mutex was held entire time by the other thread, since it observed the NULL just before it deadlock, so it could not change it, either. So, there is no way the value of wait_lock has changed from NULL to non-NULL or vice-versa, yet one thread sees NULL and the other non-NULL - contradiction ends the proof. Changes around dict_table_t::autoinc_trx: ----------------------------------------- This field is used to store pointer to trx which currently holds a GRANTED AUTO_INC lock. There can be at most one such transaction, as these locks are mutually exclusive. This field is "peeked at" by row_lock_table_autoinc_for_mysql() to see if the current transaction already has a AUTO_INC GRANTED, in which case it can follow a fast path. Such checks are done by comparing the field with own trx pointer without any latch - this requires changing the type to atomic<>, but is otherwise correct, as the boolean result of such comparison is guaranteed to be meaningful, given that we only change the value of this field during granting and releasing, and releasing can not happen concurrently with the thread running the transaction. However, some comments around this code, assertions, and the type of the field have to be modified, to better explain why and how all this works. Changes around dict_table_t::n_rec_locks: ----------------------------------------- This field counts how many record locks (granted or waiting) are currently associated with the given table. As such locks can be now created and destroyed in parallel as long as they are in different shards, this field needs to become atomic<>, and readers interested in meaningful value should acquire exclusive global latch - otherwise the value can get modified before we act on the result. Here again we need to update some comments. Changes around hit list: ------------------------ For High Priority transactions we have a mechanism which tries to abort conflicting transaction of lower-priority to avoid waiting on record locks. The procedure of identifying and aborting confliciting transactions will require exclusive global latch, as (among other reasons) it may require accessing an unrelated Lock-sys queues in which the blocking transactions are waiting themselves. To avoid taking this exclusive global latch in the probable case of our HP transaction not being in the waiting state at all, we need a reliable, thread-safe way of checking if the transaction is waiting. It's TBD if this can be simply done using `hp_trx->lock.que_state == TRX_QUE_LOCK_WAIT` - some places in code seem to modify this field without any latch, which looks unsafe. Alternative, correct approach, would be to use `hp_trx->lock.blocking_trx.load() != nullptr` instead, which requires only a minor change in comments, to clarify that it is important to clear this field when wait finishes (it's already being done in trunk, but comments leave some doubt). Fixing que_state handling is out of scope of this WL. Changes around lock_release(): ------------------------------ Releasing all locks of transaction requires iterating over its locks, and for each of them performing some actions in respective lock queue. Simple, but inefficient, way of doing it is to acquire exclusive global locksys latch. It is much better for parallelism, to instead acquire just a shared global latch, and then latch one by one only the shard containing particular lock as we iterate. The difficulty with this approach is that latching order rules require acquiring Lock-sys latches BEFORE trx latches, and trx->mutex protects the list of transaction's locks. In particular it protects it from concurrent B-tree modifications causing relocation of locks between pages (and thus shards). So, there is a chicken-and-egg problem: we want to know which shard to latch, for which we need to know what is the next lock, but to iterate over list we need trx->latch, which we can only take AFTER latching the shard. To overcome this, we will first latch trx->mutex, note down shard id of last lock in the list, release the trx->mutex, latch the particular shard, latch trx->mutex again, make sure that the lock still the tail of the list, and only then proceed. This might seem complicated, but is actually much faster than the "stop the world" approach. Changes around lock_trx_release_read_locks(): --------------------------------------------- The lock_trx_release_read_locks() is used mostly in group replication appliers to release read gap locks. In testing it turned out to be a bottleneck if global exclusive latch is used to iterate over transactions's locks. Similarly to `lock_release()` function we should instead acquire a shared global latch, and latch shards one by one as we iterate. The problem with this is that other threads can modify the list of our locks concurrently (for example because of implicit-to-explicit conversion, or B-tree reorganization), and we can't simply compare current lock to the tail, because, we are not removing all locks, just a subset of them, so the trick with operating only at the tail of the list is insuficcient. To notice such situations (and restart iteration) we will introduce `uint64_t trx->lock.trx_locks_version` incremented each time a lock is added to or removed from trx lock list. After several failed restarts we can switch back to old `lock_trx_release_read_locks_in_x_mode()`. Changes: ======== - separate the whole latching logic to deticated class locksys::Latches and document extensively the design in its header - introduce the latch guards mentioned above - all new functions will be in locksys namespace - all usages of lock_mutex_enter()/lock_mutex_exit() will be replaced with apropriate latch guards, preferably locksys::Shard_latch_guard - table->n_rec_locks must become atomic, as it can now be incremented/decremented in parallel during creation/destruction of record locks for given table - dict/mem.cc doesn't really need to include lock0lock.h when compiled for UNIV_LIBRARY - remove lock_mutex from PSI - add lock_sys_global_latch_rw_lock to PSI - add lock_sys_page_mutex to PSI - add lock_sys_table_mutex to PSI - all places where we use exclusive global latch will be documented to specify the remaining reasons we have to resort to such strong synchronization - the table->autoinc_trx field should be atomic as it is "peeked" without any latch, and confusing/wrong comments and asserts around it have to be cleaned up, to clarify why it is correct - `lock_rec_expl_exist_on_page()` should return a `bool` instead of potentially dangling pointer to a `lock_t` - `lock_print_info_summary` and logic inside `srv_printf_innodb_monitor()` in general needs at least some small refactoring so that the latch guards can be used - `lock_mutex_own()` debug predicate would have to be replaced with more specific `owns_exclusive_global_latch()`, `owns_shared_global_latch()`, `owns_page_shard(page)`, `owns_page_shard(table)`,... - `bool Sharded_rw_lock::try_x_lock` needs to be implemented - the control flow of `lock_rec_insert_check_and_lock()` (and it's copy&paste `lock_prdt_insert_check_and_lock`) can be simplified by removing code duplication, before we can use lock guards - the code around `lock_rec_queue_validate()` could be simplified by removing code duplication, and using more structured latching - update sync0debug so it has proper rules for latch order
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.