WL#13468: Improved CATS implementation
Affects: Server-8.0
—
Status: Complete
The Contention-Aware Transaction Scheduling (CATS) is an algorithm which we now use to prioritize the transactions for which many other transactions are waiting when deciding who should get the record lock granted. As of now the implementation has some subtle bugs (the computed weight might be wrong) and performance issues (to update the weights we traverse wait-for-graph under highly contented lock_sys->mutex). This WL aims to solve both sets of problems by providing an implementation where the whole weight computing procedure occurs in a separate thread, without holding lock_sys->mutex, and with logic hopefully simpler to reason about. Also, as of now, we have two algorithms for scheduling, the other being First Come First Served (FCFS), and we have a (meta-)heuristic which decides which of the two to use at the given time. This WL will improve the implementation of CATS to the point where the FCFS will be redundant (as often slower, and easy to "emulate" by setting equal weights in CATS). This WL will remove FCFS from the code, further simplifying the lock_sys's logic.
Functional Requirements ----------------------- FR1. It should be possible to query the current weight assigned by CATS algorithm to a transaction waiting for a lock. Thus INFORMATION_SCHEMA.INNODB_TRX should have a new column TRX_SCHEDULE_WEIGHT which for a waiting transaction (TRX_STATE=="LOCK WAIT") shows the weight of the transaction computed by the algorithm. The value for a transaction which is not currently waiting should be NULL. As usual for rows in this table, data can be cached and thus slightly outdated. Non-Functional Requirements --------------------------- NF1. All existing sys QA performance tests perform equal or better
Information Schema changes: --------------------------- At the heart of CATS algorithm are weights assigned to transactions: their values imply their relative priority in the queue waiting for record locks (the transaction with higher weight should be granted the lock first). Therefore to debug, monitor and verify the implementation it is helpful to be able to see what these weights are. This WL will add a new column TRX_SCHEDULE_WEIGHT to INFORMATION_SCHEMA.INNODB_TRX https://dev.mysql.com/doc/refman/8.0/en/innodb-trx-table.html (not to be confused with an already existing TRX_WEIGHT which is computed by a different algorithm and is used to determine the victim rolled back during deadlock resolution). As stated in FR1. The weights are only computed (and thus: meaningful) for transactions which are currently in the "LOCK WAIT" state. For a waiting transaction the value in this column should be equal to the weight computed by the CATS algorithm. For other transactions this value should be NULL. New counters: ------------- lock_rec_release_attempts - "Number of times we attempted to release record locks" - Note that a single attempt may lead to zero or more actual record locks being released, as there might be zero or more record locks represented together by a single structure lock_rec_grant_attempts - "Number of times we attempted to grant locks for a record" - Note that a single attempt may result in zero or more actual record locks being granted lock_schedule_refreshes - "Number of times the wait-for graph was analyzed to update schedule weights of transactions" Other behavior changes: ----------------------- Another change observable by the end user is that CATS will be used always instead of FCFS (not just under heavy contention, like it was in the past), thus the order in which transactions are granted locks will be different.
Architecture: ============= The general idea is to take advantage from the fact, that for deadlock detection purposes we already have a dedicated thread which builds a (subset of) wait-for- graph and performs analysis on it. Given that the CATS weight of a transaction is defined as the number of transactions which are waiting for the transaction (including itself) we can see how having access to the wait-for-graph can be useful. The implementation of the deadlock detector is currently in lock0wait.cc, and you can find there `lock_wait_snapshot_waiting_threads()` which constructs a snapshot of edges between the transactions which are currently waiting. (To be specific: there is only at most one "reason for wait" for each transaction, even though the rule of conflicts between locks say that a transaction is blocked by all the transactions with conflicting lock. This is a slight deviation from the theoretical definition of CATS weight found in the paper, but in practice it is much easier to compute and gives empirically very good performance) We will create a very simple BFS algorithm to walk through this snapshot to compute the CATS weight of each waiting transaction, and then "publish" it to the trx->lock.cats_weight atomic, which can be used for queue sorting. As this takes O(waiters) time, we can afford performing this each time we analyze the wait-for-graph. This lets us remove much of the code dedicated to updating/maintaining the trx->age. Details: ======== Weight boosting --------------- The new algorithm of computing the weights gives us some flexibility in how we define the "initial weight" of a transaction before we "accumulate" them into final values for each subtree. Instead of using 1 for all transactions, we can choose to "boost" the weight of a given transaction (and transitively: all transactions which block it) by assigning it a larger initial weight. Empirical tests suggest that a good performance is reached if we "boost" transactions which spend "too much" time waiting. In particular the rule which seems both simple and performing very well is this one: "If the number of new slot reservations since my slot reservation is larger than twice the number of slots in use, then it is a high time to boost my transaction's priority". The rationale for this rule (apart from performance OLTP_RW tests with a Pareto distribution and thread count ~ 1024) is, that if 2*n transactions got suspended during our wait, and the current number of waiters is n, then it means that at least n transactions bypassed us, which seems unfair. Also, in a fair world, where suspensions and wakeups are balanced, 2*n suspensions mean around 2*n wakeups, and one would expect around n other transactions to wake up, until it is our turn to wake up. Sorting optimizations --------------------- The sorting of a lock queue needs to operate on an immutable copy of all trx- >lock.cats_weight, as changing these values in parallel to sorting, can lead to inconsistent comparisons' results and crashing of std::sort. One important optimization is that most of the transactions do not block any other transaction and thus their weight is nominal. By processing these transactions separately we can reduce time from O(nlogn) to O(n) for them. Another important optimization is that lock->trx->lock.blocking_trx points to the transaction which is the main reason given lock is waiting. So we can narrow the list of candidates for being granted the lock to only these transactions which were blocked by the releasing transaction. In lock_grant_cats() we will only consider granting a `lock`, if we (`in_trx`) are the reason it was waiting (`lock->trx->lock.blocking_trx == in_trx`). This serves two goals. First, it is a performance optimization: if `blocking_trx != in_trx`, then even though we are removing `in_trx`'s lock, there still will be one more lock (the one of `blocking_trx`) for which `lock` should wait, so there is no point in wasting time considering `lock` as candidate for granting a lock. Observe however, that the lock held by the `blocking_trx` can be either GRANTED, or WAITING. The later case is possible when we first add `lock` to the queue, as the algorithm we use there checks for conflict with both GRANTED and WAITING locks, so it may happen that `blocking_trx`'s lock is WAITING. We check conflicts with WAITING locks when adding to queue, to avoid starvation of WAITING X locks by incoming S lockers. Which brings us to the second reason... Second reason is that to avoid starvation of X locks, we should try to avoid situations where S lockers overtake waiting X lockers. Original implementation of CATS did little to prevent that: yes, it respected WAITING locks when performing enqueue operation, but once you were in the queue lock_grant_cats() did not bother to check for conflicts with WAITING locks at all. So, it often happened that when there were 20 S GRANTED locks, followed by WAITING X locks, and we came with another S lock, then initially it was enqueued in WAITING state (good!) but as soon as any of these 20 S locks got released, the WAITING S lock was considered a valid candidate and competed with WAITING X lock on equal terms, and will win (even if lighter!) because the X lock is still blocked by the remaining 19 granted S locks - thus the S lock got granted in place of the released S lock, and we are left where we started: 20 granted S locks, and starved X locks. Not good. The paper suggested that we should do something about it. Consider what would happen, if we allow `trx` which was originally blocked by `blocking_trx` (different from `in_trx`), to be GRANTED its `lock`. In particular consider scenario, when such transaction gets the lock granted. Let's analyze what that would tell us about the past. The `trx` was blocked by `blocking_trx` before `in_trx` released its lock, but now, `trx`'s `lock` can be granted to it, even though the `blocking_trx` has still not released its lock (perhaps it is even still waiting for it). How could this be, that `blocking_trx` was an obstacle before, when `trx` was enqueued, and now it is not? Well, the main difference is that lock_grant_cats() ignores WAITING locks, while enqueue operation does not ignore them. So, we conclude that `blocking_trx` is probably in WAITING state, and was in such state before, and moreover `trx`'s lock does conflict with it, and does not conflict with any of GRANTED locks. So, the story is this: `trx` came to the queue after `blocking_trx`, which was (and still is) WAITING, and now `trx` tries to overtake it. One of possible scenarios which fit this description is the one in which incoming S lock is trying to starve a WAITING X lock. We want to avoid such starvation, so we let the incoming lock to wait until the original reason for waiting goes away. Does it mean that CATS works like FCFS? No. The edges implied by `blocking_trx` field do not have to create a straight line queue, it may be a tree with multiple branches, and what CATS adds is a policy for which child to pick in such cases. Updating of "reason for waiting" -------------------------------- The `lock_rec_has_to_wait_cats(wait_lock, &granted, new_granted_index)` function as of today is used to check if the `wait_lock` conflicts with any of the `granted` locks. If it doesn't it can be granted. If it does, then we need to update wait_lock->trx->lock.blocking_trx so that it contains the correct reason for waiting. There is some flexibility here with respect to what to do if there are multiple granted locks which conflict with wait_lock: which should we choose. Historically we've chosen the one with smallest index in the granted[] array. With this WL We iterate over granted locks in reverse order. Conceptually this corresponds to chronological order. This is not for performance or correctness of lock_grant_cats() - it would work just as well without this reversal. This is rather to make sure that if there is a deadlock cycle, then it will be eventually discovered, as explained below. We only observe edges of the wait-for-graph which are stored explicitly in trx- >lock.blocking_trx, which means we see at most one outgoing edge per node, even though there might be multiple reasons to wait. So, it is possible that the wait-for-graph contains a cycle which we are not aware of, because current values of blocking_trx do not expose the problem. However, if the trx1->lock.blocking_trx is not in a deadlock cycle itself, then sooner or later it will commit or rollback and there will be progress leading to lock_grant_cats() and trx1 will update the reason for waiting stored in trx1- >lock.blocking_trx to something else. Our hope is that eventually (perhaps after several such updates) we will set trx1->blocking_trx to the real cause of the deadlock, which is the next node on the deadlock cycle. We want to prove that there is an upper bound on the number of such updates before we succeed to expose the deadlock. The main problem we need to overcome is that in CATS (as opposed to FCFS) new transactions can be heavier than old transactions, and thus be granted the lock, and be moved to the front of the queue (as we move granted locks to front). This could cause problem, if we naively scanned the queue in natural order and pick the first conflicting trx as the new blocking_trx, because it's possible that we would keep selecting newer and newer trxs and never get to the point of noticing that there is an old trx further in the queue which is the real reason of the deadlock. To overcome this, we iterate over granted locks in the reversed order, which means that there is a bounded number of transactions we will consider before we finally expose the real problem. This works because, whenever we create a new granted lock, or grant a waiting lock we move it to the front of the queue. There is also a case where we reuse already existing granted lock_t struct and only set a bit in the bitmap to 1 without moving the struct to the front of the queue, but this does not change much our claim: there is a finite number of structs which will be considered prior to the real cause of deadlock, and when all of these transactions finish what they are doing, the deadlock will be exposed. New invariant: granted locks are before waiting locks ----------------------------------------------------- Once we remove the FCFS from code leaving only CATS, a new invariant will be true: as CATS always puts granted locks in front of the queue, all waiting locks must be after all granted locks at all times. This is not very important from correctness point of view, but it makes it much easier to argue that new implementation is equivalent to the old one, in which we only checked for conflicts between granted and waiting lock if the waiting lock was after the granted lock in the queue. That is, the old CATS implementation took the relative position of two locks into account when deciding if one has to wait for the other. The reason old CATS implementation had this condition was in turn to emulate the behaviour of FCFS algorithm which did just that. One might ponder how it is even possible to have a granted lock B behind waiting lock A if A has-to-wait-for B - the answer is that our has-to-wait-for relation is not symmetric. In particular locks which have LOCK_INSERT_INTENTION flag have to wait for locks which lock the gap (but are not LOCK_INSERT_INTENTION locks) such as LOCK_GAP (a.k.a. gap-only lock) or LOCK_ORDINARY (a.k.a. next key lock), yet such gap locks do not have to wait for LOCK_INSERT_INTENTION locks at all. One could argue that even though the asymmetry exists, it does not follow, that there is any benefit from letting a waiting insert-intention lock ignore a granted gap lock simply because it is behind it in queue, as it will not really change the sad reality that one has to wait for any gap locks to be released before performing insert operation anyway (inserting thread will spuriously wake up, notice the gap-lock is still there, enqueue another insert-intention lock and go to sleep again). The reason FCFS had this rule, seems to be deadlock avoidance. In the old FCFS code, we've only ever run deadlock detector when adding the waiting lock to queue, but not when granting a lock. Since we can grant a LOCK_GAP even if there is a LOCK_INSERT_INTENTION in front of it, then it was important not to consider the LOCK_INSERT_INTENTION as "blocked by" LOCK_GAP, for such "backward edge" would not be detected for the simple reason that we didn't run deadlock detection at all during granting of LOCK_GAP. With the recent changes in deadlock detection, which is now run in a separate thread after each change in the wait-for-graph, it is no longer a problem: it is not important if the edge is "forward" or "backward", it is just important to call lock_wait_request_check_for_cycles() after it is added/modified, which we do. Anyway, this new invariant (granted locks are before waiting locks) seems easy to verify at runtime and useful assumption to use to simplify some existing functions, such as lock_add_priority(). Changes: ======== 1. We no longer need FCFS, so: 1.1. We can remove `bool lock_use_fcfs(const lock_t *lock)` which was used to decide if we should use FCFS or CATS 1.2. We can remove the `std::atomicn_waiting` global variable which tracked how many transactions are currently waiting for a lock. This information was used to decide if we should use FCFS or CATS if n_waiting was above some threshold. 1.3. We can remove `const int LOCK_CATS_THRESHOLD` mentioned above. 1.4. We can remove `lock_t *lock_rec_get_first(hash_table_t *hash, const RecID &rec_id)` only used by FCFS variants of code 1.5. We can remove `use_fcfs` argument from various functions which had two ways of handling locks depending on the algorithm. 1.5.1 `void lock_cancel_waiting_and_release(lock_t *lock, bool use_fcfs)` 1.5.2. `void lock_rec_dequeue_from_page(lock_t *in_lock, bool use_fcfs)` 1.5.3. `void lock_rec_grant(lock_t *in_lock, bool use_fcfs)` 1.6. We can simplify `bool RecLock::lock_add_priority(lock_t *lock, const lock_t *conflict_lock)` to always put granted locks in the front of the queue as in the CATS algorithm 1.7 The mechanism for "forcing" CATS (as opposed to FCFS) in MTRs can be removed 1.7.1. mysql-test/suite/innodb/include/force_cats.inc can be removed 1.7.2. mysql-test/suite/innodb/include/discourage_cats.inc can be removed 1.7.3. mysql-test/suite/innodb/include/cats_config.inc can be removed 1.7.4. All the usages of the above helpers, in particular all the MTRs which run once with FCFS and then again with CATS can be simplified by removing "CATS forcing" 2. We no longer perform updating of weights from transactions' threads, so: 2.1. We can remove `uint64_t lock_sys_t::mark_age_updated` which was used as epoc counter during update's DFS 2.2. We can remove the `uint64_t trx_t::age_update` which was used together with the `mark_age_updated` during the DFS to mark visited nodes 2.3 We can remove `void lock_update_trx_age(trx_t *trx, int32_t age)` 2.4. We can remove `void lock_update_age(lock_t *new_lock, ulint heap_no)` 3. We can rename `int32_t trx_t::age` to `std::atomic trx_lock_t::cats_weight`. Historically the name "age" related to the concept of time the transaction has spent in the system which was important in Variance Aware Transaction Scheduling (VATS). After changing the implementation to CATS, the name "age" was left, but the semantic has nothing to do with time anymore. Making it atomic makes it accessible from the thread dedicated to weight updating in parallel to transaction threads which sort the queue. Moving it from trx_t to trx_lock_t seem reasonable as this is a lock_sys related field. 4. Implement the new algorithm: 4.1 Introduce `lock_wait_compute_initial_weights()` which assigns 1 or WEIGHT_BOOST to each waiting transaction 4.1.1 Boosting heuristic can be implemented by comparing the already existing `srv_slot_t::reservation_no` to `lock_wait_table_reservations` to know how many transactions went to sleep in other slots since a given slot was taken. 4.2. Introduce `lock_wait_accumulate_weights()` which takes the initial values computed in `lock_wait_compute_initial_weights()` and computes for each transaction the sum of all weights of transactions which wait for it (including itself). In deadlock-free case this is simply a weight of a subtree rooted at given transaction. In case of cycles, we delay computation of weight of transactions on cycle to next stage: `lock_wait_update_weights_on_cycle()` 4.3. Introduce `lock_wait_update_weights_on_cycle()` which given a deadlock cycle and chosen victim, virtually unfolds the cycle into a straight path and updates the weights of the transactions on the cycle as if the victim was removed. 4.4. Introduce `lock_wait_publish_new_weights()` which stores the newly computed weights in the trx->lock.cats_weight fields 4.5. Plug the new functionality into relevant places of the deadlock-detector so it they can take advantage of already computed wait-for-graph, cycles and victims. 5. Change implementation of `lock_grant_cats()`: 5.1. As the `trx->lock.cats_weight` can be modified in parallel, we need to load their values into immutable snapshot before sorting (otherwise the std::sort algorithm could crash if comparison results were inconsistent across time) 5.2. We can take advantage of the observation that most of the transactions have initial weight, and thus instead of wasting O(nlogn) time, we can process them in O(n) time by simply appending them to the end of sorted result. 5.3. We can take advantage of the trx->lock.blocking_trx field to narrow the candidates set to the transactions which were blocked by the releasing transaction 5.4. We can remove the logic for trx->age updating 5.5. Rename to `lock_rec_grant_by_heap_no()` 6. Add new column TRX_SCHEDULE_WEIGHT to INFORMATION_SCHEMA.INNODB_TRX 7. Add new counters 7.1. "lock_rec_release_attempts" incremented in lock_rec_grant() 7.2. "lock_rec_grant_attempts" incremented in lock_rec_grant() and other places where lock_rec_grant_by_heap_no() is called from 7.3. "lock_schedule_refreshes" incremented in lock_wait_compute_and_publish_weights_except_cycles()
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.