WL#13468: Improved CATS implementation

Affects: Server-8.0   —   Status: Complete

The Contention-Aware Transaction Scheduling (CATS) is an algorithm which we now
use to
prioritize the transactions for which many other transactions are waiting when
deciding who should get the record lock granted.
As of now the implementation has some subtle bugs (the computed weight might be
wrong) and performance issues (to update the weights we traverse wait-for-graph
under highly contented lock_sys->mutex).
This WL aims to solve both sets of problems by providing an implementation where
the whole weight computing procedure occurs in a separate thread, without holding
lock_sys->mutex, and with logic hopefully simpler to reason about.

Also, as of now, we have two algorithms for scheduling, the other being First Come
First Served (FCFS), and we have a (meta-)heuristic which decides which of the two
to use at the given time. This WL will improve the implementation of CATS to the
point where the FCFS will be redundant (as often slower, and easy to "emulate" by
setting equal weights in CATS). This WL will remove FCFS from the code, further
simplifying the lock_sys's logic.

Functional Requirements
-----------------------

FR1. It should be possible to query the current weight assigned by CATS
algorithm to a transaction waiting for a lock. Thus
INFORMATION_SCHEMA.INNODB_TRX should have a new column TRX_SCHEDULE_WEIGHT which
for a waiting transaction (TRX_STATE=="LOCK WAIT") shows the weight of the
transaction computed by the algorithm. The value for a transaction which is not
currently waiting should be NULL. As usual for rows in this table,
data can be cached and thus slightly outdated.


Non-Functional Requirements
---------------------------

NF1. All existing sys QA performance tests perform equal or better
Information Schema changes:
---------------------------

At the heart of CATS algorithm are weights assigned to transactions: their
values
imply their relative priority in the queue waiting for record locks (the
transaction with higher weight should be granted the lock first). Therefore to
debug, monitor and verify the implementation it is helpful to be able to see
what
these weights are.

This WL will add a new column TRX_SCHEDULE_WEIGHT to INFORMATION_SCHEMA.INNODB_TRX
https://dev.mysql.com/doc/refman/8.0/en/innodb-trx-table.html
(not to be confused with an already existing TRX_WEIGHT which is computed by a
different algorithm and is used to determine the victim rolled back during
deadlock resolution).

As stated in FR1. The weights are only computed (and thus: meaningful) for
transactions which are currently in the "LOCK WAIT" state. For a waiting
transaction the value in this column should be equal to the weight computed by
the CATS algorithm. For other transactions this value should be NULL.

New counters:
-------------

lock_rec_release_attempts
- "Number of times we attempted to release record locks"
- Note that a single attempt may lead to zero or more actual record locks being
released, as there might be zero or more record locks represented together by a
single structure

lock_rec_grant_attempts
- "Number of times we attempted to grant locks for a record"
- Note that a single attempt may result in zero or more actual record locks
being granted

lock_schedule_refreshes
- "Number of times the wait-for graph was analyzed to update schedule weights of
transactions"

Other behavior changes:
-----------------------

Another change observable by the end user is that CATS will be used always
instead of FCFS (not just under heavy contention, like it was in the past), thus
the order in which transactions are granted locks will be different.

Architecture:
=============

The general idea is to take advantage from the fact, that for deadlock detection
purposes we already have a dedicated thread which builds a (subset of) wait-for-
graph and performs analysis on it.
Given that the CATS weight of a transaction is defined as the number of
transactions which are waiting for the transaction (including itself) we can see
how having access to the wait-for-graph can be useful.
The implementation of the deadlock detector is currently in lock0wait.cc, and
you can find there `lock_wait_snapshot_waiting_threads()` which constructs a
snapshot of edges between the transactions which are currently waiting.
(To be specific: there is only at most one "reason for wait" for each
transaction, even though the rule of conflicts between locks say that a
transaction is blocked by all the transactions with conflicting lock. This is a
slight deviation from the theoretical definition of CATS weight found in the
paper, but in practice it is much easier to compute and gives empirically very
good performance)

We will create a very simple BFS algorithm to walk through this snapshot to
compute the CATS weight of each waiting transaction, and then "publish" it to
the trx->lock.cats_weight atomic, which can be used for queue sorting.
As this takes O(waiters) time, we can afford performing this each time we
analyze the wait-for-graph. This lets us remove much of the code dedicated to
updating/maintaining the trx->age.


Details:
========


Weight boosting
---------------

The new algorithm of computing the weights gives us some flexibility in how we
define the "initial weight" of a transaction before we "accumulate" them into
final values for each subtree. Instead of using 1 for all transactions, we can
choose to "boost" the weight of a given transaction (and transitively: all
transactions which block it) by assigning it a larger initial weight. Empirical
tests suggest that a good performance is reached if we "boost" transactions
which spend "too much" time waiting. In particular the rule which seems both
simple and performing very well is this one:

  "If the number of new slot reservations since my slot reservation is
  larger than twice the number of slots in use, then it is a high time to
  boost my transaction's priority".

The rationale for this rule (apart from performance OLTP_RW tests with a Pareto
distribution and thread count ~ 1024) is, that if 2*n transactions got suspended
during our wait, and the current number of waiters is n, then it means that at
least n transactions bypassed us, which seems unfair. Also, in a fair world,
where suspensions and wakeups are balanced, 2*n suspensions mean around 2*n
wakeups, and one would expect around n other transactions to wake up, until it
is our turn to wake up.


Sorting optimizations
---------------------

The sorting of a lock queue needs to operate on an immutable copy of all trx-
>lock.cats_weight, as changing these values in parallel to sorting, can lead to
inconsistent comparisons' results and crashing of std::sort.

One important optimization is that most of the transactions do not block any
other transaction and thus their weight is nominal. By processing these
transactions separately we can reduce time from O(nlogn) to O(n) for them.

Another important optimization is that lock->trx->lock.blocking_trx points to
the transaction which is the main reason given lock is waiting. So we can narrow
the list of candidates for being granted the lock to only these transactions
which were blocked by the releasing transaction.
In lock_grant_cats() we will only consider granting a `lock`, if we (`in_trx`)
are the reason it was waiting (`lock->trx->lock.blocking_trx == in_trx`).
This serves two goals.

First, it is a performance optimization: if `blocking_trx != in_trx`, then even
though we are removing `in_trx`'s lock, there still will be one more lock (the
one of `blocking_trx`) for which `lock` should wait, so there is no point in
wasting time considering `lock` as candidate for granting a lock. Observe
however, that the lock held by the `blocking_trx` can be either GRANTED, or
WAITING. The later case is possible when we first add `lock` to the queue, as
the algorithm we use there checks for conflict with both GRANTED and WAITING
locks, so it may happen that `blocking_trx`'s lock is WAITING.
We check conflicts with WAITING locks when adding to queue, to avoid starvation
of WAITING X locks by incoming S lockers. Which brings us to the second
reason...

Second reason is that to avoid starvation of X locks, we should try to avoid
situations where S lockers overtake waiting X lockers.
Original implementation of CATS did little to prevent that: yes, it respected
WAITING locks when performing enqueue operation, but once you were in the queue
lock_grant_cats() did not bother to check for conflicts with WAITING locks at
all. So, it often happened that when there were 20 S GRANTED locks, followed by
WAITING X locks, and we came with another S lock, then initially it was enqueued
in WAITING state (good!) but as soon as any of these 20 S locks got released,
the WAITING S lock was considered a valid candidate and competed with WAITING X
lock on equal terms, and will win (even if lighter!) because the X lock is still
blocked by the remaining 19 granted S locks - thus the S lock got granted in
place of the released S lock, and we are left where we started: 20 granted S
locks, and starved X locks. Not good. The paper suggested that we should do
something about it.
Consider what would happen, if we allow `trx` which was originally blocked by
`blocking_trx` (different from `in_trx`), to be GRANTED its `lock`.
In particular consider scenario, when such transaction gets the lock granted.
Let's analyze what that would tell us about the past.
The `trx` was blocked by `blocking_trx` before `in_trx` released its lock, but
now, `trx`'s `lock` can be granted to it, even though the `blocking_trx` has
still not released its lock (perhaps it is even still waiting for it).
How could this be, that `blocking_trx` was an obstacle before, when `trx` was
enqueued, and now it is not?
Well, the main difference is that lock_grant_cats() ignores WAITING locks, while
enqueue operation does not ignore them.
So, we conclude that `blocking_trx` is probably in WAITING state, and was in
such state before, and moreover `trx`'s lock does conflict with it, and does not
conflict with any of GRANTED locks.
So, the story is this: `trx` came to the queue after `blocking_trx`, which was
(and still is) WAITING, and now `trx` tries to overtake it.
One of possible scenarios which fit this description is the one in which
incoming S lock is trying to starve a WAITING X lock.
We want to avoid such starvation, so we let the incoming lock to wait until the
original reason for waiting goes away.
Does it mean that CATS works like FCFS?
No. The edges implied by `blocking_trx` field do not have to create a straight
line queue, it may be a tree with multiple branches, and what CATS adds is a
policy for which child to pick in such cases.

Updating of "reason for waiting"
--------------------------------

The `lock_rec_has_to_wait_cats(wait_lock, &granted, new_granted_index)` function
as of today is used to check if the `wait_lock` conflicts with any of the
`granted` locks.
If it doesn't it can be granted.
If it does, then we need to update wait_lock->trx->lock.blocking_trx so that it
contains the correct reason for waiting.
There is some flexibility here with respect to what to do if there are multiple
granted locks which conflict with wait_lock: which should we choose.
Historically we've chosen the one with smallest index in the granted[] array.
With this WL We iterate over granted locks in reverse order.
Conceptually this corresponds to chronological order.
This is not for performance or correctness of lock_grant_cats() - it would work
just as well without this reversal.
This is rather to make sure that if there is a deadlock cycle, then it will be
eventually discovered, as explained below.
We only observe edges of the wait-for-graph which are stored explicitly in trx-
>lock.blocking_trx, which means we see at most one outgoing edge per node, even
though there might be multiple reasons to wait.
So, it is possible that the wait-for-graph contains a cycle which we are not
aware of, because current values of blocking_trx do not expose the problem.
However, if the trx1->lock.blocking_trx is not in a deadlock cycle itself, then
sooner or later it will commit or rollback and there will be progress leading to
lock_grant_cats() and trx1 will update the reason for waiting stored in trx1-
>lock.blocking_trx to something else.
Our hope is that eventually (perhaps after several such updates) we will set
trx1->blocking_trx to the real cause of the deadlock, which is the next node on
the deadlock cycle.
We want to prove that there is an upper bound on the number of such updates
before we succeed to expose the deadlock.
The main problem we need to overcome is that in CATS (as opposed to FCFS) new
transactions can be heavier than old transactions, and thus be granted the lock,
and be moved to the front of the queue (as we move granted locks to
front).
This could cause problem, if we naively scanned the queue in natural order and
pick the first conflicting trx as the new blocking_trx, because it's possible
that we would keep selecting newer and newer trxs and never get to the point
of noticing that there is an old trx further in the queue which is the real
reason of the deadlock.
To overcome this, we iterate over granted locks in the reversed order, which
means that there is a bounded number of transactions we will consider before we
finally expose the real problem.
This works because, whenever we create a new granted lock, or grant a waiting
lock we move it to the front of the queue. There is also a case where we reuse
already existing granted lock_t struct and only set a bit in the bitmap to 1
without moving the struct to the front of the queue, but this does not change
much our claim: there is a finite number of structs which will be considered
prior to the real cause of deadlock, and when all of these transactions finish
what they are doing, the deadlock will be exposed.

New invariant: granted locks are before waiting locks
-----------------------------------------------------

Once we remove the FCFS from code leaving only CATS, a new invariant will be
true: as CATS always puts granted locks in front of the queue, all waiting locks
must be after all granted locks at all times.
This is not very important from correctness point of view, but it makes it much
easier to argue that new implementation is equivalent to the old one, in
which we only checked for conflicts between granted and waiting lock if the
waiting lock was after the granted lock in the queue.
That is, the old CATS implementation took the relative position of two locks
into account when deciding if one has to wait for the other.
The reason old CATS implementation had this condition was in turn to emulate the
behaviour of FCFS algorithm which did just that.
One might ponder how it is even possible to have a granted lock B behind waiting
lock A if A has-to-wait-for B - the answer is that our has-to-wait-for relation
is not symmetric. In particular locks which have LOCK_INSERT_INTENTION flag have
to wait for locks which lock the gap (but are not LOCK_INSERT_INTENTION locks)
such as LOCK_GAP (a.k.a. gap-only lock) or LOCK_ORDINARY (a.k.a. next key lock),
yet such gap locks do not have to wait for LOCK_INSERT_INTENTION locks at all.
One could argue that even though the asymmetry exists, it does not follow, that
there is any benefit from letting a waiting insert-intention lock ignore a
granted gap lock simply because it is behind it in queue, as it will not really
change the sad reality that one has to wait for any gap locks to be released
before performing insert operation anyway (inserting thread will spuriously wake
up, notice the gap-lock is still there, enqueue another insert-intention
lock and go to sleep again).
The reason FCFS had this rule, seems to be deadlock avoidance.
In the old FCFS code, we've only ever run deadlock detector when adding the
waiting lock to queue, but not when granting a lock. Since we can grant a
LOCK_GAP even if there is a LOCK_INSERT_INTENTION in front of it, then it was
important not to consider the LOCK_INSERT_INTENTION as "blocked by" LOCK_GAP,
for such "backward edge" would not be detected for the simple reason that we
didn't run deadlock detection at all during granting of LOCK_GAP.
With the recent changes in deadlock detection, which is now run in a separate
thread after each change in the wait-for-graph, it is no longer a problem: it is
not important if the edge is "forward" or "backward", it is just important to
call lock_wait_request_check_for_cycles() after it is added/modified, which we
do.

Anyway, this new invariant (granted locks are before waiting locks) seems easy
to verify at runtime and useful assumption to use to simplify some existing
functions, such as lock_add_priority().



Changes:
========

1. We no longer need FCFS, so:
1.1. We can remove `bool lock_use_fcfs(const lock_t *lock)` which was used to
decide if we should use FCFS or CATS
1.2. We can remove the `std::atomic n_waiting` global variable which
tracked how many transactions are currently waiting for a lock. This information
was used to decide if we should use FCFS or CATS if n_waiting was above some
threshold.
1.3. We can remove `const int LOCK_CATS_THRESHOLD` mentioned above.
1.4. We can remove `lock_t *lock_rec_get_first(hash_table_t *hash, const RecID
&rec_id)` only used by FCFS variants of code
1.5. We can remove `use_fcfs` argument from various functions which had two ways
of handling locks depending on the algorithm.
1.5.1 `void lock_cancel_waiting_and_release(lock_t *lock, bool use_fcfs)`
1.5.2. `void lock_rec_dequeue_from_page(lock_t *in_lock, bool use_fcfs)`
1.5.3. `void lock_rec_grant(lock_t *in_lock, bool use_fcfs)`
1.6. We can simplify `bool RecLock::lock_add_priority(lock_t *lock, const lock_t
*conflict_lock)` to always put granted locks in the front of the queue as in the
CATS algorithm
1.7 The mechanism for "forcing" CATS (as opposed to FCFS) in MTRs can be removed
1.7.1. mysql-test/suite/innodb/include/force_cats.inc can be removed
1.7.2. mysql-test/suite/innodb/include/discourage_cats.inc can be removed
1.7.3. mysql-test/suite/innodb/include/cats_config.inc can be removed
1.7.4. All the usages of the above helpers, in particular all the MTRs which run
once with FCFS and then again with CATS can be simplified by removing "CATS
forcing"

2. We no longer perform updating of weights from transactions' threads, so:
2.1. We can remove `uint64_t lock_sys_t::mark_age_updated` which was used as
epoc counter during update's DFS
2.2. We can remove the `uint64_t trx_t::age_update` which was used together with
the `mark_age_updated` during the DFS to mark visited nodes
2.3 We can remove `void lock_update_trx_age(trx_t *trx, int32_t age)`
2.4. We can remove `void lock_update_age(lock_t *new_lock, ulint heap_no)`

3. We can rename `int32_t trx_t::age` to `std::atomic
trx_lock_t::cats_weight`. Historically the name "age" related to the concept of
time the transaction has spent in the system which was important in Variance
Aware Transaction Scheduling (VATS). After changing the implementation to CATS,
the name "age" was left, but the semantic has nothing to do with time anymore.
Making it atomic makes it accessible from the thread dedicated to weight
updating in parallel to transaction threads which sort the queue. Moving it from
trx_t to trx_lock_t seem reasonable as this is a lock_sys related field.

4. Implement the new algorithm:
4.1 Introduce `lock_wait_compute_initial_weights()` which assigns 1 or
WEIGHT_BOOST to each waiting transaction
4.1.1 Boosting heuristic can be implemented by comparing the already existing
`srv_slot_t::reservation_no` to `lock_wait_table_reservations` to know how many
transactions went to sleep in other slots since a given slot was taken.
4.2. Introduce `lock_wait_accumulate_weights()` which takes the initial values
computed in `lock_wait_compute_initial_weights()` and computes for each
transaction the sum of all weights of transactions which wait for it (including
itself). In deadlock-free case this is simply a weight of a subtree rooted at
given transaction. In case of cycles, we delay computation of weight of
transactions on cycle to next stage: `lock_wait_update_weights_on_cycle()`
4.3. Introduce `lock_wait_update_weights_on_cycle()` which given a deadlock
cycle and chosen victim, virtually unfolds the cycle into a straight path and
updates the weights of the transactions on the cycle as if the victim was
removed.
4.4. Introduce `lock_wait_publish_new_weights()` which stores the newly computed
weights in the trx->lock.cats_weight fields
4.5. Plug the new functionality into relevant places of the deadlock-detector so
it they can take advantage of already computed wait-for-graph, cycles and
victims.

5. Change implementation of `lock_grant_cats()`:
5.1. As the `trx->lock.cats_weight` can be modified in parallel, we need to load
their values into immutable snapshot before sorting (otherwise the std::sort
algorithm could crash if comparison results were inconsistent across time)
5.2. We can take advantage of the observation that most of the transactions have
initial weight, and thus instead of wasting O(nlogn) time, we can process them
in O(n) time by simply appending them to the end of sorted result.
5.3. We can take advantage of the trx->lock.blocking_trx field to narrow the
candidates set to the transactions which were blocked by the releasing
transaction
5.4. We can remove the logic for trx->age updating
5.5. Rename to `lock_rec_grant_by_heap_no()`

6. Add new column TRX_SCHEDULE_WEIGHT to INFORMATION_SCHEMA.INNODB_TRX

7. Add new counters
7.1. "lock_rec_release_attempts" incremented in lock_rec_grant()
7.2. "lock_rec_grant_attempts" incremented in lock_rec_grant() and other places
     where lock_rec_grant_by_heap_no() is called from
7.3. "lock_schedule_refreshes" incremented in
lock_wait_compute_and_publish_weights_except_cycles()