WL#13574: Include MDL and ACL locks in MTS deadlock detection infra-structure

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

EXECUTIVE SUMMARY

This worklog aims at integrating the thread serialization infra-structure needed to deliver a multi-threaded replica that keeps the same commit order as the source, with the MDL and ACL access serialization infra-structures. This integration is needed to mitigate three-actors deadlocks that stem from combining waits on the order of commit with waits on MDL and/or ACL locks.

By producing the changes described in this WL, deadlocks that rise from executing client issued statements related with global locks and/or ACL statements on a replica that is actively processing a change-stream, should be mitigated and, eventually, broken. Note that by broken deadlock one should understand that the server will eventually reach a consistent state or a state in which is possible to operate in order to make it reach a consistent state.

Other than retrying to apply the event as many as slave_transaction_retries times in the hope that the one of the actors in the deadlock has exited the critical section, there is no deadlock resolution involving replica multi-threaded applier workers and commit order. This work is not intended to change that mechanism, it's intended that such mechanism is enforced in every identified deadlock, without exception.

USER STORIES

As a user, I wish to enable the multi-threaded applier and commit order preservation on a replica and execute any statement while that same replica is applying events, without the server being indefinitely hang on a deadlock.

GLOSSARY

Terms that may need a definition prior to reading further.

lock: in a computer system, a lock is a shared resource that manages the access to a critical execution block. Processes or thread can request to acquire a lock and, once the request is granted, safely proceed to executing the critical block.
deadlock: in a multi-process or multi-threaded computer system, a deadlock is a no-progress state where two or more processes or threads acquire ownership of a locking mechanism in a way they end-up waiting on each other, indefinitely.
dependency: in a multi-process or multi-threaded computer system, a process or thread P is dimmed dependent on a process or thread Q when Q has the ability to unilaterally determine if P can progress (or not) with it's execution.
graph: a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense related. The objects correspond to abstractions called nodes and each of the related pairs of nodes is called an edge.
multi-threaded replica: a replica running mode where the SQL applier is composed of several worker threads, applying the transactions from the change-stream in parallel.
worker: a thread that is part of a multi-threaded replica set of threads that will apply the transactions coming in the change-stream in parallel.
commit order: the order by which transactions are commit in the source. This is order is kept and logged so that it may be used to apply transactions in the exact same order on the replicas.
dependency tracking: mechanism in which dependency between transactions is tracked in order to understand if transactions can be applied in parallel on the replica.
MDL: metadata locking management infra-structure and API.
ACL: access control list.
queue: in a computer system, a queue is a FIFO data-structure.
lock-free: in a computer system, a lock-free data structure is one that doesn't need locks in order for multiple processes or threads to execute critical sections of code where such data structures may be accessed or changed, concurrently.
integral: a type is integral if is bool, char, char8_t, char16_t, char32_t, wchar_t, short, int, long, long long, or any implementation-defined extended integer types, including any signed, unsigned, and cv-qualified variants.

IDENTIFIED REQUIREMENTS

R1. A multi-threaded applier with enabled commit order preservation MUST be able to break deadlocks mixing commit order serialization and global locks like the ones taken by SET READ_ONLY = ON or FLUSH TABLES WITH READ LOCK.

R2. A multi-threaded applier with enabled commit order preservation MUST be able to break deadlocks mixing commit order serialization and ACL related locks.

R3. A multi-threaded applier with enabled commit order preservation MUST keep the current retry behavior by enforcing the slave_transaction_retries variable value for the amount of retries, when requested to back-off a transaction deadlocked by R1 and R2 described scenarios.

R4. A multi-threaded applier with enabled commit order preservation MUST keep the current replication channel behavior by stopping both the coordinator and worker threads when the configured amount of transaction retries is reached unsuccessfully, when requested to back-off a transaction deadlocked by R1 and R2 described scenarios.

R5. The observed performance decrease resulting from the changes introduced by this worklog must not be higher than 5%.

STATE-OF-THE-ART

Current approach

The current implementation of the multi-threaded applier uses thread condition signaling as a wait/release mechanism for serializing the applier worker threads turn to commit.

Using a linked list to implement a commit order queue, where each record contains:

the identifier of the worker to commit,
the thread condition to wait on,
a pointer to the next node in the list

the coordinator thread is able to push into it the worker information in the order the commits should be executed. The workers can then apply the transactions in parallel up until the commit stage.

At the commit stage, each worker thread must wait for it's turn to commit according to an algorithm that could be synthesized in the following manner.

Each applier worker:

Checks if it is at the head of the queue.
If not at the head of the queue, wait on the worker's thread condition.
If at the head of the queue:
1. Commit the worker processed transaction or rollback in case a deadlock was found.
2. Pop the head of the queue, moving it to the next node.
3. Signal the thread condition for the worker at the new head.

Deadlock detection with current approach

In terms of the underlying logic needed to serialize the commit order, the current approach has no inherent shortcoming.

However, by using a a waiting/release mechanism specific to the commit order queue to serialize the access to the commit stage, it doesn't take into consideration or gets taken into consideration by some of the other locking infra-structure used by the server during deadlock detection analysis.

Although data access deadlocks identified by the storage engines are already reported and handled by the multi-threaded applier, others like MDL global locks and deadlock detection graphs aren't, making the applier running with commit order preservation prone to three-actor deadlocks that mix commit order queue locks and MDL locks.

An example of a deadlock while executing FLUSH TABLE WITH READ LOCK, assuming table DDLs have already been executed in the replica and the source replicated the following set of sequentially dependent statements:

  INSERT INTO t VALUES (1); -> worker 1
  INSERT INTO t VALUES (2); -> worker 2

At the same time, an application using two different connections will execute:

Connection 1:

  BEGIN:
  INSERT INTO t VALUES (1);
  ROLLBACK;

Connection 2:

  FLUSH TABLES WITH READ LOCK;

A possible execution diagram that will lead to a deadlock:

  ==========================+==========================+==========================
   Application connections  |     Worker thread 1      |     Worker thread 2
  ==========================+==========================+==========================
                            |                          |
  Connection 1              |                          |
  ··························|                          |
  BEGIN;                    |                          |
               Acquired MDL |                          |
                global lock |                          |
                            |                          |
  INSERT INTO t VALUES (1); |                          |
                            |··························|··························
              Acquired data |             Acquired MDL |             Acquired MDL
                 lock for 1 |              global lock |              global lock
  ··························|                          |
                            |             Release MDL  |INSERT INTO t VALUES (2);
                            |         global lock and  |
                            |            wait on data  |            Acquired data
                            |              lock for 1  |               lock for 2
                            |         and hold exec of |
                            |INSERT INTO t VALUES (1); |           Wait on commit
                            |··························|             order signal
                            |                          |         to be emitted by
                            |                          |                 Worker 1
                            |                          |··························
  Connection 2              |                          |
  ··························|                          |
  FLUSH TABLES WITH READ    |                          |
  LOCK;                     |                          |
                            |                          |
            Try to hold off |                          |
   shared access of the MDL |                          |
            global lock but |                          |
         unable because one |                          |
          holder is waiting |                          |
       on the commit signal |                          |
      So, will wait holding |                          |
            the exclusivity |                          |
                on the lock |                          |
  ··························|                          |
                            |                          |
  Connection 1              |                          |
  ··························|                          |
  ROLLBACK;                 |                          |
               Release data |                          |
                 lock for 1 |                          |
  ··························|                          |
                            |··························|
                            |              Wait on MDL |
                            |              global lock |
                            |        held in exclusive |
                            |          by Connection 2 |
                            |                          |
  ==========================+==========================+==========================

The above could be avoided if the thread for Connection 2 would be able to signal and back off Worker 2.

PROPOSED APPROACH

Using graphs for deadlock detection

One technique used to track and identify dependencies between different entities is a directed graph, more specifically, dependency graphs. By representing each entity by a node and the dependency between entities by an edge, we can not only track the dependencies but also identify higher-level dependencies by traversing the paths in the graph.

One set of graph paths is particularly important for the work in this WL, the paths that start and end in the same node, named circularity's or circular paths. In a dependency graph a circularity means that a given entity is, eventually, dependent on itself, meaning that, while using a dependency graph as a locking request tracking mechanism, a circularity represents a deadlock.

A possible formalization of a dependency graph:

Let N be the set of nodes in the graph that represent the threads that may acquire locks.
Let it be A: N x N the set of directional edges that represent a dependency between two elements of N, (x, y) is a member of A if x is a dependency of y or in other words, if y is waiting for a lock acquired by x.
Let it be S: A^i, i ∈ ℕ the set of sequences of A that form paths, ((x1, y1), (x2, y2), ..., (xj, yj)), j ∈ ℕ where for n, m ∈ ℕ, n. In other words, a path is formed when the second component of a given dependency pair is equal to the first component of the element right after, in the sequence.

Let it be C ⊂ S the sub-set of sequences that form circularities, ((x1, y1), (x2, y2), ..., (xj, yj)), j ∈ ℕ ∧ yj = x1

To summarize:

Given a set of threads that may wait on each other acquired locks.

Given a sequence of more that two threads waiting on each other acquired locks.

A circularity is found if the above sequence starts and ends in the same thread.

Translating the previously provided deadlock example into the above described formalization:

Let it be N = {C1, C2, W1, W2} where C1 and C2 are the client connection threads and W1 and W2 are the worker threads, the graph being formed for the above scenario is:

G = {(C1, W1)}

G = {(C1, W1), (W1, W2)}

G = {(C1, W1), (W1, W2), (W2, C2)}

G = {(W1, W2), (W2, C2)}

G = {(W1, W2), (W2, C2), (C2, W1)}

As per the definition of circularity, a sequence s can be extracted from the last state of the graph that belongs to C, meaning, where the second component of the last pair is the first component of the first pair:

s = ((W1, W2), (W2, C2), (C2, W1))

The MySQL server already has an implementation of a dependency graph, used to track metadata locks requests and acquisitions: the MDL infra-structure.

ACL locks, Global locks and the MDL infra-structure

Both ACL related statements and statements exclusively acquiring Global locks use the MDL infra-structure to acquire the ACL related table locks and the global locks, respectively. Therefore, by integrating the commit order serialization with the MDL infra-structure one is implicitly enabling deadlock detection between the commit order serialization mechanism and ACL and global locks.

Therefore, the overall proposed approach is to integrate the commit order serialization with the MDL infra-structure in order to full-fill the enumerated requirements.

One graph to rule them all

As the main issue with the current architecture is the lack of communication between the commit order signaling mechanism and the MDL infra-structure, we propose the use of the MDL infra-structure as the signaling mechanism to serialize the commit order.

By submitting an MDL wait ticket for each worker thread when it needs to wait and releasing the worker by granting the ticket, we are adding the wait request to the MDL graph, allowing for the MDL deadlock detection mechanism to take into account the commit serialization when finding possible deadlocks. Enclosed within the wait ticket is the logic that will allow for any threads searching for a deadlock, to check if the holder of each active wait ticket is part of a circular graph path.

The commit queue should be kept to determine the worker commit order, conceptually no changes are needed to that part of the architecture, we should still have a coordinator that populates the queue and each committing worker should be able to release the following worker to commit.

Re-writing the commit order serialization algorithm, it could be synthesized in the following manner.

Each applier worker:

Checks if it's at the head of the queue.

If at the head of the queue or the ticket is granted:

Commit the worker processed transaction or rollback in case a deadlock was found.

Pop the head of the queue, moving it to the next node.

Grant the ticket held by the worker at the new head.

If not at the head of the queue:

Adds an MDL wait ticket, holding a request to proceed to commit.

Processes the MDL graph for possible deadlocks, searching the nodes in the graph that belong to the graph paths starting at the worker node.

Waits on the requested ticket.

If the ticket isn't granted, it means that the worker should back-off the transaction.

The wait ticket

One of the central pieces in the MDL infra-structure is the MDL wait ticket, materialized in the MDL infra-structure code by the class MDL_wait_for_subgraph.

We need to specialize MDL_wait_for_subgraph in order to allow any other thread trying to find a deadlock, to assert if a given worker's graph node, already waiting on the commit order wait ticket, is part of a graph path that forms a circularity.

Since the position in the commit order queue implies a dependency between threads, where the thread occupying a higher index in the queue depends on the threads occupying a lower index, while traversing the graph searching for a circularity and when a commit order queue wait_for ticket is reached, we need to:

match the ticket's graph node against the initial node, to check if a circularity as become full-circle and, if so, stop the search

allow the search to proceed to the commit order queue wait_for ticket for the worker with the immediate lower index (the dependency represented in the commit order queue)

In theory, there is no possibility of deadlock between worker threads if they are just waiting on the commit order queue ticket, since the commit order dependencies form a linear order, any circularity must contain also other dependencies. Therefore, we don't need to search for deadlocks with workers that appear first in the commit order queue and are already waiting on the commit order ticket, meaning, the above algorithm can be optimized to allow the graph search to process to the worker with the immediate lower index that isn't waiting already on a commit order wait_for ticket.

Avoid unnecessary work & algorithmic complexity

Taking the algorithm described in the previous section, one could argue that in terms of graph traversing heuristics it would be enough to recursively search for the worker Wi that immediately appears before V. That is correct and would keep the order of complexity of the algorithm closer to being linear and not dependent of the number of worker threads. On the other hand, one wouldn't find deadlocks affecting worker threads appearing previously in the commit queue and would be relinquishing that responsibility for future deadlock searches.

A deadlock stemming from the commit order that could be identified earlier would force the workers to back-off earlier and prevent work that we know will be discarded. So, by taking advantage of the linear dependency of the worker threads in the commit queue and trying to find deadlocks earlier, we're increasing the order of complexity of the algorithm but we're also trying to decrease the amount of unnecessary work.

Analyzing the order of complexity of the proposed algorithm in a possible worst-case scenario:

Let n be the number of worker threads in the commit order queue and n is always an even number.

Let Wi be the ith worker thread in the commit order queue.

W1, W3, ... Wk, k = 2i + 1, 0 <= i < n / 2 are already waiting on a commit order ticket (COT = commit order ticket):

W0 W1 W2 W3 Wn-4 Wn-3 Wn-2 Wn-1 Wn +-----+-----+-----+-----+ +-----+-----+-----+-----+-----+ | | COT | | COT | ... | COT | | COT | | COT | +-----+-----+-----+-----+ +-----+-----+-----+-----+-----+

W2, W4, ... Wk, k = 2i, 1 <= i < n / 2 are already waiting on an MDL lock ticket (MLT = MDL lock ticket), where the lock is held by the worker immediately on the left, Wk-1:

W0 W1 W2 W3 Wn-4 Wn-3 Wn-2 Wn-1 Wn +-----+-----+-----+-----+ +-----+-----+-----+-----+-----+ | | COT | MLT | COT | ... | COT | MLT | COT | MLT | COT | +-----+-----+-----+-----+ +-----+-----+-----+-----+-----+

Worker W0 isn't waiting, yet.

Worker Wn creates and adds a new commit order wait ticket and starts a search for a deadlock.

Worker Wn executes the described algorithm and starts iterating over the commit order queue to visit higher priority worker's sub-graph.

The search initiated by Wn will, recursively, execute the search algorithm (n - 1) / 2 + (n - 3) / 2 + ... + 1 <=> Sum[k=0..(n/2)-1] (n - (2k + 1)) / 2 times for a single deadlock search.

For worker m that waits on the commit order queue, the number of paths being traversed in sub-sequent search will be increased by m. The mathematical expression that represents the order of complexity for the n workers (where m is the amount of workers waiting on the commit order queue) is: n = 2*m, P(m) = 1 + Sum[k=1..m-1] T(k), m >= 1.

This amounts to an exponential complexity order with an exponential factor of n / 2 - 1, which can be proven by induction:

Hip: P(m) = 1 + Sum[k=1..m-1] P(k), m >= 1 <=> P(m) = 2^(m - 1) Proof: For 1 ----- P(1) = 1 + Sum[k=1..1-1] P(k) = 1 + Sum[k=1..0] P(k) = 1 + 0 = 1 = 2^0 = 2^(1 - 1) For m + 1 --------- P(m + 1) = 1 + Sum[k=1..m+1-1] P(k) = 1 + Sum[k=1..m] P(k) = 1 + Sum[k=1..m-1] P(k) + P(m) per hip -> = 2^(m - 1) + 2^(m - 1) = 2 . 2^(m - 1) = 2^m = 2^(m + 1 - 1) QED

Since n = 2*m, the order of complexity or O(2^(n/2 - 1)).

The decision to parallelize transactions and distribute them amongst applier worker threads is based on a parallelization algorithm running in the source of replication that identifies and tracks dependencies between transactions. Because of that, the above scenario could be dimmed impossible due the nonexistence of worker thread inter-dependency other than the commit order queue. However, the parallelization heuristics is based on the data received through the replication channel (logical timestamps) that may be generated by a flawed dependency tracking system or by a malicious replication source.

Therefore, all comes down to weighing the probability of the worst case scenario against the advantages of early deadlock detection and deciding which is more fruitful. For now, our assessment is that, given the type of statements that can interfere with dependency tracking and the specific sequence of the statements being applied by the worker threads in the commit queue, the worst case scenario should be uncommon and hard to observe.

Nonetheless, this is a scenario we should keep an eye on and, if needed, revisit this assessment and find alternative paths.

A lock-free commit order queue

Since there is no longer a thread condition to wait on, there is no longer the need for a mutex that protects that condition. Hence, the only reason to keep the mutex would be to serialize the access to the commit order queue. But, in that case, we would have two different serialization mechanisms, again, and other deadlock scenarios could stem from that.

One possible solution would be to exchange a mutex by a read-write lock and optimizing a bit for readers, since our queue is a single-producer single-consumer multiple-observers queue. But then again, the lock could become a contention point where threads may be scheduled out when everything that they need is to read a value. The main problem that may rise from using a read/write lock is the pushing and poping operations being left to wait for a long time because of readers wanting to iterate over the queue. In a read-write lock protecting a normal queue scenario, push and pop will need write access and iteration will need read access. If some thread requests read access to the lock just before the last reader finishes iterating and this scenario occurs consecutively several times, it will keep the lock in read mode for a large amount of time, leaving push and pop waiting too much.

A more suitable solution would be to provide a lock-free queue and a shared-exclusive access lock that allows for: a concurrent push and pop with shared access on the lock; a non-concurrent iteration operation with exclusive access on the lock. A spin-lock can be used for the described scenario, minimizing the possibility of the threads being scheduled out while waiting.

Since the only thing we need to keep in the queue is the identifier of the workers, we can build a lock-free queue on top of a size-bounded circular queue that only keeps integral elements. By doing so, we avoid the memory management constraints of a lock-free linked list, the usual architecture for a more generic purpose lock-free queue.

If the lock-free circular queue is size-bounded to the number of workers plus one, it should be enough to never run out of space, since the implemented algorithm will never add to the queue the identifier of a worker that is already in the queue.

To summarize, in order to remove the need for a mutex or a common read-write lock guarding the commit order queue (removing a contention point), the following should be provided:

A size-bounded, integral element only, lock-free circular queue.

A shared-exclusive spin-lock to protect queue iterations from inconsistent views of the data.

An implementation that:

Acquires shared access for push and pop, allowing for these operations to occur concurrently.

Acquires exclusive access for iterating over the queue, allowing for a consistent view of the data.

IMPACT ANALYSIS

Performance

The above described changes may have an impact in performance since we will need to search the MDL graph for each worker that needs to wait for it's turn. Performance testing should be extensive and a focus when optimizing the code patch.

Storage engines

No special concerns storage engines.

Security context

No special concerns regarding security.

Upgrade/Downgrade

No special concerns regarding upgrade/downgrade.

Cross-version replication

No special concerns regarding cross-version replication.

User interface

No foreseeable changes regarding user-interface.

Observability

The current implementation already has observability instrumentation in place regarding worker executing stages. The added functionality MUST keep that instrumentation intact.

Deployment / installation

No changes to deployment or installation processes.

Protocol

No foreseeable changes to the MySQL protocol.

API AND DATA STRUCTURES

Changes in the current API

The existing Commit_order_manager class will be extended to include a wait_on_graph method, which will replace the logic of waiting on the thread condition.

bool Commit_order_manager::wait_on_graph(Slave_worker *worker);

The set of new APIs needed to implement the above described approach:

Commit_order_queue: will encapsulate all the logic of accessing the data of the worker and it's associated position in the commit queue.

Integrals_lockfree_queue: will encapsulate all the logic regarding a lock-free circular queue.

Commit_order_lock_graph: will encapsulate the MDL graph waiting ticket logic, applied to the commit order waiting requirements.

The interface for the above classes:

class Commit_order_queue { public: class Node { public: std::uint32_t m_worker_id; MDL_context *m_mdl_context; Aligned_atomic<enum_transaction_stage> m_stage; }; Commit_order_queue(size_t n_workers); Node &operator[](size_t idx); bool is_empty(); std::uint32_t pop(); void push(std::uint32_t id); std::uint32_t front(); private: std::vector<Commit_order_queue::Node> m_workers; Integrals_lockfree_queue<std::uint32_t> m_commit_queue; }; template <typename T, T Null = std::numeric_limits<T>::max(), T Erased = Null, typename A = std::nullptr_t> class Integrals_lockfree_queue { static_assert(std::is_integral<T>::value, "class `Integrals_lockfree_queue` requires an integral " "type as a first template parameter"); public: using pointer_type = T *; using const_pointer_type = T const *; using reference_type = T &; using const_reference_type = T const &; using value_type = T; using const_value_type = T const; using index_type = unsigned long long; using element_type = std::atomic<T>; Integrals_lockfree_queue(A &alloc, size_type size); Integrals_lockfree_queue(size_type size); value_type operator[](size_type index); bool is_empty(); size_type head(); size_type tail(); T front(); T back(); value_type pop(); Integrals_lockfree_queue &push(value_type to_push); private: Unique_ptr<unsigned char[], A> m_array; Aligned_atomic<size_type> m_head; Aligned_atomic<size_type> m_tail; }; class Commit_order_lock_graph : public MDL_wait_for_subgraph { public: Commit_order_lock_graph(MDL_context &ctx, Commit_order_manager &mngr, uint32 worker_id); MDL_context *get_ctx() const; uint32 get_worker_id() const; bool accept_visitor(MDL_wait_for_graph_visitor *dvisitor) override; uint get_deadlock_weight() const override; private: MDL_context &m_ctx; Commit_order_manager &m_mngr; uint32 m_worker_id; };

The set of utility classes needed in the implementation of the feature:

Unique_ptr: a smart-pointer that can only have one reference being held and that allows for an Allocator to be provided. It's in every way similar to std::unique_ptr but having the possibility to define an allocator helps with memory tracking and instrumentation.

Aligned_atomic: an std::atomic wrapper that uses memory chunks with size multiple of the CPU cache-line to store the underlying atomic, in order to prevent false sharing.

Sentry: a templated class that receives a predicate in the constructor and executes said predicate when the object instance is disposed of. In other words, a generic purpose RAII sentry class.

Shared_spin_lock: a shared-exclusive lock based on std::atomic that make the thread spin when needing to wait. This technique is often used to prevent thread from being scheduled out when need to wait for small amounts of time.

IMPLEMENTATION PLAN

Step#1: provide the set of utility classes needed to develop the commit order queue, that being: Unique_ptr, Aligned_atomic, Sentry, Shared_spin_lock, Integrals_lockfree_queue.

Step#2: provide the implementation of the commit order queue Commit_order_queue, built to abstract the logic of managing such queue and upon the Integrals_lockfree_queue lock-free queue for keeping the commit order.

Step#3: provide the integration of Commit_order_queue and Commit_order_lock_graph with the Commit_order_manager code.