In this blog series, I’m describing how InnoDB locks data (tables and rows) in order to provide illusion to clients that their queries are executed one after another, and how this was improved in recent releases.
In InnoDB Data Locking – Part 1 “Introduction” I’ve introduced basic concepts required to understand current post:
- databases, tables, rows (like files on a shared drive, spreadsheets inside a file, and rows inside a spreadsheet)
- serializability of transactions (ability to explain states observed over time with a convincing story about relative order of parallel operations)
- timeouts (for misbehaving lock owners, and to resolve deadlocks)
- reader-writer lock (shared/exclusive access rights)
- starvation (permanent inflow of readers starving a writer waiting for its turn)
- queueing (FIFO, or priority)
- read views (read-only snapshots which allow stale reads concurrent to new writes)
- granularity of a lock (access right to everything vs. only the resources needed )
- deadlocks caused by granularity, and ways to overcome them by lock ordering
Summary of this post: In this post I’ll describe how deadlock detection works in InnoDB 8.0.18, and thus introduce following concepts :
- wait-for graph
- a deadlock cycle
- a deadlock victim
An example of a deadlock
In InnoDB Data Locking – Part 1 “Introduction” we’ve seen a simple example of scenario in which two people can not finish what they’ve started, because one is waiting for the other to release a resource which is currently in use by the other and vice-versa: ABe already had Read access to file A and requested Write access to file B, but had to wait for BAsil to first release his Read access rights for this file, however BAsil can not do that until he finishes what he planned, which was to obtain Write access to file A, currently in use by ABe.
Why can’t they be more polite?
Perhaps it is instructive to first answer a lingering question: Why can’t one of them simply release the Read access right as soon as they are done reading the first file and before requesting the Write access right for next file? For example, why not let ABe release his Read access right to file A, so that BAsil can obtain the Write access right to file A, finish his work, releasing all held access rights and thus letting ABe continue without any further delays? This would surely eliminate the deadlock, so why not do this seemingly smart thing? The short answer is: that would effectively be undistinguishable from cutting a transaction into two smaller ones, which might have surprising results. Remember that to achieve the promised serializability server has to come up with a convincing lie about the order of all transactions. If we cut ABe’s transaction into two independent transactions, then the server would be allowed to behave as if the order was:
1 |
ABe.part1 << BAsil << ABe.part2 |
(Or, if it was a really nasty server, it could even pretend that ABe.part2 happened before ABe.part1, but thankfully a single-master InnoDB never cheats as much)
What does such a split and reordering mean in practice? Let’s say that ABe’s plan was to “set the Balance in file B to the number of Apples in file A“:
1
2
3
4
|
ABe1. open file A for Read ABe2. read number of Apples from file A #let's call this number $ApplesSeenByABe ABe3. open file B for Write ABe4. write $ApplesSeenByABe to file B as the Balance |
and that the plan of BAsil was “set the number of Apples in file A to the Balance in file B“:
1
2
3
4
|
BAsil1. open file B for Read BAsil2. read the Balance from file B #let's call this number $BalanceSeenByBAsil BAsil3. open file A for Write BAsil4. write $BalanceSeenByBAsil to file A as the number of Apples |
Clearly, both of them wanted to make both files contain same number, and you can see for yourself that whatever the relative order of the two plans (transactions) the end result should be that both files contain same number no matter what was the initial condition.
For example, assume initially there were 10 Apples in file A, and Balance was 0 in file B.
If ABe’s transaction happened before BAsil’s then ABe should change Balance to 10, and BAsil transaction will not change much, leading to final state in which Apples=Balance=10.
And if BAsil’s transaction happened before ABe’s, then BAsil will see Balance=0, and set Apples to 0, and then ABe will fruitlessly copy this 0 back to B, leading to final state in which both Apples and Balance are 0.
But, if we allowed the server to release Read access right to file A held by ABe prematurely, then following interleaving can happen:
1
2
3
4
5
6
7
8
9
10
11
|
ABe1. open file A for Read ABe2. read number of Apples from file A #$ApplesSeenByABe=10 BAsil1. open file B for Read BAsil2. read the Balance from file B #$BalanceSeenByBAsil=0 ABe2b. prematurely close file A #<-- HERE !!! BAsil3. open file A for Write BAsil4. write $BalanceSeenByBAsil (which is 0) to file A as the number of Apples ABe3. open file B for Write ABe4. write $ApplesSeenByABe (which is 10) to file B as the Balance |
The end result here is a mismatch between Apples=0 in file A and Balance=10 in file B and none of them realizes there even is a problem until looking into the files again, when doing monthly report, oops!
This end result is not consistent with any possible ordering of the two transactions, but it is consistent with a history in which ABe’s transaction got split into two:
ABe’s First transaction:
1
2
|
ABe1. open file A for Read ABe2. read number of Apples from file A #let's call this number $ApplesSeenByABe |
ABe’s Second transaction:
1
2
|
ABe3. open file B for Write ABe4. write $ApplesSeenByABe to file B as the Balance |
and the server insists it executed them in the order:
1 |
ABe.first << BAsil << ABe.second |
(This is intuitively why releasing locks too early can be seen as committing a transaction and starting a new one)
How to avoid deadlocks?
So, if releasing access rights (locks) prematurely is not an option, then what are other options to avoid a deadlock?
We’ve already seen one previously: request all the needed access rights up front. This is easier said than done in many cases, and still requires care to request them in the correct order. In ABe’s and BAsil’s case, following revised plans guarantee no deadlock can occur (“deadlock freedom”):
1
2
3
4
|
new-ABe1. open file A for Read new-ABe2. open file B for Write new-ABe3. read number of Apples from file A #let's call this number $ApplesSeenByABe new-ABe4. write $ApplesSeenByABe to file B as the Balance |
1
2
3
4
|
new-BAsil1. open file A for Write new-BAsil2. open file B for Read new-BAsil3. read the Balance from file B #let's call this number $BalanceSeenByBAsil new-BAsil4. write $BalanceSeenByBAsil to file A as the number of Apples |
How can we prove a deadlock freedom? For that it would be helpful to introduce the concept of waits-for graph. A directed graph is a set of dots (“nodes”) and arrows (“edges”) connecting some of them. What matters is the relation between the nodes: which two are connected and which aren’t, not their exact position on the drawing, so you can layout the same relation in whatever way you find useful.
Our waits-for relation will involve transactions (people) waiting for resources (files) which are currently in possession of other transactions. Thus there are two kind of nodes (transactions and resources) and two kinds of edges
(T –waits-for-access-to–> R and R –-is-accessed-by–> T)
For example, this picture shows ABe, BAsil, file A and file B at the beginning of our story, where nobody had access to any resource yet.
1
2
3
|
ABe fileA BAsil fileB |
Now, let’s see how this picture evolves when we execute steps of their original plans in following order:
ABe1. open file A for Read
1
2
3
|
ABe <--is-accessed-by-- fileA BAsil fileB |
ABe2. read number of Apples from file A
(this does not request any new access rights nor grants them, so let’s move on)
BAsil1. open file B for Read
1
2
3
|
ABe <--is-accessed-by-- fileA BAsil <--is-accessed-by-- fileB |
BAsil2. read the Balance from file B
(no change to the picture)
ABe3. open file B for Write
(this operation will be blocked, ABe will have to wait)
1
2
3
4
5
6
|
ABe <--is-accessed-by-- fileA \ \--waits-for-access-to--\ \ v BAsil <--is-accessed-by-- fileB |
BAsil3. open file A for Write
(this operation will also be blocked, BAsil will have to wait)
1
2
3
4
5
6
7
8
|
ABe <--is-accessed-by-- fileA <-. \ | \--waits-for-access-to--\ | \ | v | BAsil <--is-accessed-by-- fileB | \ | \-----waits-for-access-to--------/ |
Now, what we observe in real world is that both ABe and BAsil are stuck waiting, and eventually at least one of them will timeout.
But, how this real world problem manifests on the picture? As a cycle:
1 |
ABe --> fileB --> BAsil --> fileA --> ABe ... |
Whenever we can see such a cycle, we can be sure that in reality there is a set of transactions which can not proceed due to the deadlock, thus we call it a “deadlock cycle”.
Why is that? Because the only edges going out of transactions are of the “waits-for-access” kind and lead to resources, and the only edges going out of resources are of the “is-accessed-by” kind and lead to transactions, so the cycle must alternate between transactions and resources, have an even length, involve an equal number of transactions and resources, and each transaction is blocked waiting for a resource it can not obtain because it is in possession of another transaction which is also waiting. None of the transactions in the cycle can proceed, thus the cycle will persist till we resolve the deadlock.
So, why a deadlock is impossible in the modified version of the plan (new-ABe1..new-ABe4 and new-BAsil1..new-BAsil4)? Because if you look at the picture:
1
2
3
|
ABe fileA BAsil fileB |
And try to imagine any deadlock cycle involving ABe and BAsil, it soon becomes clear that it would have to at least have one incoming edge and one outgoing edge for both ABe and BAsil:
1
2
3
4
5
6
7
|
ABe <--... fileA \ \_____.. BAsil <--... fileB \ \__... |
otherwise they would not be on a cycle. But, having an incoming edge means already holding some resource, and having outgoing edge means requesting a resource, so we can refine our unfinished sketch to:
1
2
3
4
5
6
7
|
ABe <--is-accessed-by-.. fileA \ \--waits-for-access-to-.. BAsil <--is-accessed-by-.. fileB \ \--waits-for-access-to-.. |
And if we look at their execution plans, we can easily identify where these edges must be coming from and where they are leading, because we know that each of them first acquired access to file A (steps new-ABe1 and new-BAsil1) before requesting access to file B (steps new-ABe2 and new-BAsil2). Which means that the only way to finish this picture is this:
1
2
3
4
5
6
7
8
9
10
|
_____________________________________ / \ | ABe <--is-accessed-by-------------- fileA | \ \ \--waits-for-access-to------------\ \________________________ | \ v BAsil <--is-accessed-by--/ fileB \ ^ \--waits-for-access-to-------------/ |
But this is not a deadlock cycle – note the direction of arrows, they do not form a cycle! 🙂
Also, this picture depicts a violation of assumption that Write access to file A should be exclusive, as we can see two arrows going out of file A, so we have a contradiction with our assumptions.
This picture can be interpret as two paths leading from file A to file B.
In general, if everybody follows a rule to acquire resources in the same order, the picture will always look like all the paths follow this order. And this is why a cycle is impossible: it would have to somehow get back from a later resource to an earlier resource, but all edges lead forward only.
Deadlock detection
How can we find deadlocks in an automatic way? There are many possible algorithms, but one has to take into account that the graph is potentially huge (many concurrent transactions, many possessed resources) and constantly changing as new edges and nodes appear and disappear.
Dealing with these dynamic changes can be simplified by an observation that removing an edge or a node can not introduce a deadlock cycle. It is only when a new edge is added, that a new cycle can be formed, and moreover this new edge must be the missing link that closes the cycle. This suggests an approach in which search for a deadlock cycle is only performed when new edge is added – which mostly occurs when a transaction T requests an access right to a resource R and has to wait – and should focus on answering a reachability question: is there already a path leading from R to T?
(Another non-trivial case when a new edge might appear is when a transaction is being granted a Read access to a resource before another transaction already waiting for a Write access to it – we ignore this here, assuming that algorithms guarding against starvation preclude this case. Even more complicated case is when a transaction requests a LOCK_GAP and there already is a transaction waiting for a LOCK_INSERT_INTENTION to this gap. InnoDB allows “readers” to bypass “inserters” in such cases, which means a new edge leading to our node is added which also requires us to check for deadlocks in the opposite direction. A bug in this area was one of the motivations for reimplementing deadlock detection algorithm).
If such a path exists then adding the edge will cause a deadlock, otherwise the deadlock will not be introduced by this particular request, so we can continue.
Once a deadlock cycle is identified, we need to somehow “resolve it”. As explained earlier we can’t just remove an edge (release a single access right prematurely). We have to remove a whole node (rollback one of the transactions). We pick one of the nodes on the cycle to be a “deadlock victim” and sacrifice it to let others proceed. There are various strategies for choosing the right deadlock victim. (InnoDB avoids killing high-priority transactions used by group replication, and favours sacrificing “lighter” transactions as defined by INFORMATION_SECHEMA.INNODB_TRX.TRX_WEIGHT
)
If only things were so simple. The way I’ve told the story, you might get an impression that there is always at most one edge outgoing from each node. While it seems true that a transaction can clearly state a single resource it is currently waiting on, it is generally not true that each resource is accessed by just one transaction at a time. In other words, there might be multiple “–is-accessed-by–>” edges outgoing from a single resource to several transactions. The reasons include:
- there may be several transactions having a shared (Read) access right to a resource granted at the same time
- to avoid starving transactions already waiting for a resource, newcomers have to wait till the existing waiters finish wait, which in some sense is like the old waiters are “blocking” the resource they are “waiting for”. Thankfully, this just adds loops to the graph which can be ignored
- (actually in InnoDB transactions often request access to both a row and the gap before it at the same time, and it is a matter of perspective if you prefer to model it as having two outgoing edges from one transaction to two separate resources [a gap and a row], or you prefer to model it as a single resource [a gap+row] with additional complex access rights [ReadRow,ReadGap,ReadGapAndRow,WriteRow,WriteGap,WriteGapAndRow,WriteRowAndReadGap,…] with non-trivial “compatibility” rules which might permit several transactions to share the resource and thus you have multiple edges outgoing from the resource. As of today, InnoDB uses the later approach flexibly using the “Lock splitting” technique to reap benefits of the former approach)
However, it remains true, that basically, one has to use some kind of search (BFS, DFS,..) through the graph, to see if after adding the new edge from T to R there is now a path leading back to the place we’ve started. This was roughly the old implementation used in MySQL up to and including 8.0.17.
To give a more detailed description of how it worked, we have to recall from previous article that Lock System represents requested access rights as objects in memory called “locks”. The “wait-for graph” was not explicitly stored in memory in 8.0.17, but could be inferred on-fly from the lists of lock objects for each resource and a single pointer each transaction stored to point to the current lock it awaits being granted. Given the lock the transaction is currently waiting for, you could inspect the list of all the locks related to the same resource and check which of them are conflicting with your lock and who owns them – this way you could deduce which transaction your transaction is waiting for. This means that while conceptually simple, the DFS over “wait-for graph” in 8.0.17 required a rather complicated low-level code which iterated over lists of locks, latching the whole Lock System while doing so. Also, to avoid N^2 penalty from checking the same lock multiple times for conflict with several transactions in same queue it was a DFS over locks – marking visited locks rather than marking visited transactions or resources. Given that locks are “requests for access” they are more like edges than nodes of the “wait-for graph”, which means the DFS was rather difficult to wrap your head around. To avoid stack overflow it was both: flattened to a loop with manual stack management and had various hard limits on numbers of steps taken. The bottom line: it wasn’t the most beautiful piece of the code and was a bottleneck on a critical path. I have no big hopes that this video will help you understand how it worked, but let’s give it a try:
In this video transactions (T1,T2,..) are depicted as triangles and locks are circles: green for (G)ranted or red for (W)aiting. If transaction is waiting for a lock then it points to the lock it is waiting for. The algorithm scans the list of locks related to the resource the T1 is waiting for [row1] and for each lock which conflicts with the one T1 is waiting for, it checks if the lock owner (say T2) is also waiting for something in which case it is descending “recursively”. Once all locks before the waiting lock in question are checked in this way, it backtracks. If at any point it finds a path leading back to T1 it reports a deadlock cycle.
Now, let me give you a high level intuitive idea of the change we’ve introduced in 8.0.18.
New deadlock detection algorithm
One problem with the old algorithm was that to operate safely and correctly it had to “stop the world” for the duration of graph traversal, and the graph could be potentially huge as it contained both transactions and resources held by them.
The first observation is that you can work on a much smaller graph which consists only of transactions, but does not explicitly represent resources: an edge from transaction A to transaction B means that transaction A “waits for a resource being accessed by transaction” B.
Or in short: A–waits-for–>B. In more mathematical words: there exists a resource R such that A–waits-for-access-to–>R–is-accessed-by–>B. So, a cycle in the big graph must correspond to a cycle in the smaller graph and vice-versa.
Thus, we could imagine a deadlock detection algorithm to work on this smaller graph instead and produce same results. But this raises a question of how to create and maintain such a waits-for graph without wasting too much resources.
Note, that even though the set of nodes shrunk, we may still have many edges (if there were many edges outgoing from R, they will be “inherited” by A).
And if we still have to “stop the world” to recreate such graph each time, then it might defeat the purpose.
The second observation is less trivial: we can create a “sparse graph” by selecting just the oldest outgoing edge for each transaction, and use that instead of the original “dense graph” to detect deadlocks.
Let me first give you some intuitions at “hand-waving” level of why this is so:
The edges in the “dense graph” which form a cycle can not vanish from the “dense graph”, precisely because their nodes are deadlocked. And edges which belong to unblocked transactions will eventually vanish from the “dense graph”, so in a finite time the “sparse graph” will have to select the edges forming a deadlock cycle.
Proof that deadlock detection can be restricted to the “sparse graph”
Now, to make this proof more formal we will assume that:
- number of transactions in the system is bounded (by Tnum),
- number of resources accessed by a single transaction is bounded (by Rnum),
- resources are granted in a way preventing starvation (at minimum: at most Tnum other transactions can be granted a resource transaction T waits for before it is its turn),
- transactions do not wait at anything else beyond resources (in particular they finish or request next resource in finite time),
The proof will be by contradiction, so let’s assume that at some point in time there appears a deadlock cycle inside the “dense graph” such that it goes unnoticed forever: none of the nodes is ever chosen to be deadlock victim. (If there are more than one, pick any of them, say the one which formed the earliest, and sequence of involved transaction ids is lexicography first, or some other deterministic way, which avoids axiom of choice:D).
Let’s color the cycle’s edges and nodes red.
Now, let’s look at the sparse graph: clearly it must be missing at least one of the red edges, as otherwise the algorithm would have to choose at least one of the red nodes as a victim to resolve the red deadlock cycle which would contradict the assumption that the red nodes went unnoticed forever.
Quite possibly the sparse graph already contains some of the red edges – observe that once a red edge appears in the sparse graph, it will stay in it forever, because the only way a red edge can vanish is when one of its endpoint transactions finishes, which we assumed never happens. So the set of red edges in the sparse graph only grows over time, and as it is bounded (by Tnum), it must eventually stop growing.
Let’s fast forward to this moment in time.
Let’s color blue all of the nodes and edges in the dense graph that were not already red, so that each node and edge is either red or blue at this moment in time, and the edges and nodes added in the future will be black.
Now, some mathematical trick from set order theory: we will put a pair of counters storing natural numbers in each node and will update them following some rules. Intuitively the first counter limits how many resources this transaction has yet to acquire, and the second counter limits how many other transactions can bypass it waiting for a resource before starvation.
Initially we set all counters to <Rnum,Tnum> and also each time a new (black) node appears, we put <Rnum,Tnum> in it. Each time a new edge in the sparse graph appears, it might be either because the transaction requested a new resource (we decrement the first counter of the pair, and reset the second counter back to Tnum) or because the resource it waits for got granted to another transaction which was luckier (we decrement the second counter). We don’t know which one happens and let the adversary pick the worst. But we know, that because of starvation freedom the second counter can not drop below zero (as there are at most Tnum transactions before our turn), and because each transaction requests a bounded number of resources the first counter can’t drop below zero neither.
So, now let’s look at one of the red nodes which does not have a red outgoing edge in the sparse graph. (Recall that the “sparse graph” contains all of the nodes of the “dense graph”, so in particular it contains all of the red nodes).
It must have *some* outgoing edge, because there is at least one outgoing edge in the dense graph (the red one that we haven’t picked). Because we pick the oldest edge, and black edges are newer than the red one which is still available, it must be that we’ve selected a blue edge.
As each node in the sparse graph has at most one outgoing edge, let’s follow the only possible path in the sparse graph which starts in this node.
As the number of nodes is finite it has to either loop back to some node it has visited earlier or stop in a node with no outgoing edges.
Let’s note down the sequence of numbers along the path (until it loops) <R1,T1>,..,<Rk,Tk>.
In the case it loops the algorithm will detect the cycle and choose one of its nodes as a deadlock victim (and we assumed it is not our red node). This means that one of the loop’s nodes will be removed from the graph, and the node before it will need updating: either it will be granted the resource and thus there will be no outgoing edge (at least for a finite time), or will have to wait because someone else got the resource before him (in which case we decrement its second counter): in any case the path will become smaller lexicographically.
In case the path stops in a node which does not have an outgoing edge, it means that in a finite time this last node will either finish (and thus again: either the path will become shorter or the node before it will get updated to smaller lexicographically) or will request a resource, in which case the path may become longer, but the first counter will have to drop, thus it will also be smaller lexicographically.
OK, so in a finite time the sequence of numbers associated with the path will become smaller lexicographically.
Now, is the moment you’ve seen coming: there are no infinitely decreasing chains of bounded sequences of bounded numbers! Haha. After all each such sequence can be seen as a very big 2*Tnum-digit long integer number in base Tnum+Rnum+1 and it gets at least decremented by 1 each time, and you can not decrement a natural number infinitely many times, can you?
What does it mean? It means that after a finite time our red node will have to switch the outgoing edge to something else. But there are only a finite number of blue edges. So after a finite time our red node will have to pick the red edge, but this contradicts the assumption that no more red edges will appear.
Contradiction ends the proof.
But – you may object – the assumption that number of resources accessed by a transaction can be bounded is unfair: after all, number of rows in a database can grow arbitrarily. And you can suspect that equating “starvation freedom” with “each request is granted in an a priori bounded time” is not a usual definition. Fine. But you must admit, that a number of rows is finite, and to avoid starvation a request must be granted in a finite time, so let’s agree on a compromise: you can put any pair of natural numbers, as big as you want, into each new node. You see, the proof will still work, because there is no infinitely long decreasing chain of bounded-length sequences of natural numbers neither, BUAHAHA! 🙂
You can see this by induction: clearly it holds for sequences of length <=1. And if we look at just the first element in each sequence, then it must be non-increasing starting from some finite natural number. If there was an infinitely long chain of sequences not longer than N+1 then we could group them based on this first element, and at least one of the groups would contain an infinitely long decreasing chain of sequences all starting with the same number, because we have just a finite number of groups and infinite number of elements in them. If we chop this first element out of them, we get an infinitely long decreasing chain of sequences not longer than N. Contradiction ends the proof. Haha 🙂 Another way to see this: it’s a very boring count down, but as the last number keeps getting smaller and smaller then eventually the last number will have to reach zero, and the previous one will have to drop, so in finite time the last-but one will get smaller, so eventually it will reach zero, and the one before it will have to decrease etc.
But, but.. – you may object – the assumption that sequences are of bounded length is unfair, once we’ve already granted that the system can grow over time! Yes it is finite at any given time, but not bounded. Ugh, that’s a tougher one: are there infinitely-long decreasing chains of finite sequences of natural numbers? Well, yes there are: <1>, <0,1>, <0,0,1>, <0,0,0,1>,… So, yes, if your server can handle arbitrary high number of simultaneous transactions, then this proof might not work – but there’s only so many sockets, and RAM and clients in the world.
OK, so much for the theory and astronomically huge numbers. In practice deadlocks are spotted almost as soon as they appear most of the time, because most of the time situation is quite simple: transactions don’t get starved almost to death, cycles are not involving almost all transactions, we don’t get extremely bad luck at selecting edges etc. The theoretical proof above is just to give us some warm cozy feeling that even in the worst case imaginable it should work. Eventually.
What does it buy us, is that instead of a big messy graph, we can simply store a single pointer per transaction which points to a transaction it currently waits for (or nullptr). And graphs which have at most one outgoing edge are trivial to navigate: there is only one way forward, so you can find a cycle in constant memory and linear time. But there is even more you can do once you don’t have to deal with such a complex data structure as graph!
The third observation is that you don’t have to “stop the world” to search for a cycle in the sparse graph. You see, IF there is a deadlock cycle, THEN it must involve only the transactions which are already waiting. They are not going anywhere 🙂
It turns out InnoDB already had an array which held all the currently waiting transactions, so detecting a cycle is as simple as iterating over this array to note down the reasons they are waiting, and run simple linear algorithm to detect a cycle in the copied data. As the algorithm works on a copy you don’t need to “stop the world” when it runs, and in case you found a cycle in it, you have a guarantee that the cycle also still exists in the real dense graph, because it can’t resolve itself (well, almost: remember about timeouts and other ways to kill a query). You might think that we need to “stop the world” at least for the duration of taking the snapshot. Ha ha. No.
The fourth observation is that we only need to make sure that the transactions we iterate over do not get freed from memory when making the snapshot, so it is safe to copy the current reason for wait from them – they can still be granted resources (edge gets removed), or change the reason they wait (edge’s endpoint changes) while we iterate. We just need to avoid torn reads of the pointer itself (using atomics), but otherwise it doesn’t matter if the snapshot is consistent. Again, what matters is that IF there is a red cycle THEN it stays there until we notice it. So, the important fragment of the snapshot will be consistent. Yes, there might appear some false positives: if you stitch together patches of edges from different moments in time then the resulting “frankenstein-graph” might look like there is a cycle where in fact there was never one. But this can be avoided with some minor extra work to track ABA issues and stopping the world only when finally double-checking the deadlock cycle candidate is really a deadlock (we have almost no false positives, and real deadlocks should happen rarely enough, that “stopping the world” for each of them is not a big deal). And the reason we have almost no false positives is that if you really think it trough, then even if the cycle was stitched from edges from different moments in time, then because transactions are using two-phase locking the only way a cycle can be a false positive is if somehow it already got resolved due to one of the transactions being aborted (chances of which you can minimize by running the deadlock detection from the same thread which is responsible for detecting timeouts) or somehow you’ve mistook one transaction for another (say because it had reused id or pointer address or slot number or whatever you use for identification), so you just need to be more careful with ABA errors.
Observability (demo)
InnoDB keeps track of some statistics related to deadlock detection. You can see them in INFORMATION_SCHEMA.INNODB_METRICS :
- lock_deadlocks – Number of deadlocks
- lock_timeouts – Number of lock timeouts
- lock_deadlock_false_positives – Number of times a heuristic found a spurious candidate deadlock cycle in the wait-for graph
- lock_deadlock_rounds – Number of times a wait-for graph was scanned in search for deadlocks
- lock_threads_waiting – Number of query threads sleeping waiting for a lock
Above information comes from information_schema.innodb_metrics
1
2
3
|
SELECT name,comment FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE name IN ('lock_deadlocks','lock_timeouts','lock_deadlock_false_positives','lock_deadlock_rounds') |
You can for example find how many deadlocks occurred so far by using:
1
2
|
SELECT `count` FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME="lock_deadlocks"; |
Also, you can get information about the latest deadlock by using:
1 |
SHOW ENGINE INNODB STATUS; |
and looking into the (possibly absent) LATEST DETECTED DEADLOCK section.
For example let’s simulate the example with ABe and BAsil in SQL world. (Actually we will use a slightly more sensible and granular approach in which we request access to individual “cells” (“Apples” or “Balance”), as oppose to whole “files” (“fileA” and “fileB”), but the resulting deadlock will have same structure. We have to do it this way, because table-level locks in InnoDB interfere with table-level locks of MySQL server itself in ways which make explanation more complicated)
First let’s prepare the stage:
1
2
3
4
5
6
|
CREATE DATABASE test; use test; CREATE TABLE fileA (name VARCHAR(10) PRIMARY KEY, value INT); CREATE TABLE fileB (name VARCHAR(10) PRIMARY KEY, value INT); INSERT INTO fileA (name,value) VALUES ("Apples",10); INSERT INTO fileB (name,value) VALUES ("Balance",0); |
Now, in ABe’s client:
1
2
3
4
5
6
7
|
ABe> BEGIN; ABe> SELECT value FROM fileA WHERE name='Apples' FOR SHARE; +-------+ | value | +-------+ | 10 | +-------+ |
We can confirm that, as expected ABe’s got access to “Apples” granted:
1
2
3
4
5
6
|
> SELECT ENGINE_TRANSACTION_ID as trx_id,OBJECT_NAME as `table`,INDEX_NAME,LOCK_DATA,LOCK_MODE,LOCK_STATUS FROM performance_schema.data_locks WHERE LOCK_TYPE='RECORD'; +-----------------+-------+------------+-----------+---------------+-------------+ | trx_id | table | INDEX_NAME | LOCK_DATA | LOCK_MODE | LOCK_STATUS | +-----------------+-------+------------+-----------+---------------+-------------+ | 284271835218160 | filea | PRIMARY | 'Apples' | S,REC_NOT_GAP | GRANTED | +-----------------+-------+------------+-----------+---------------+-------------+ |
This performance_schema.data_locks contains a row for each access right a transaction has requested. Please note how a transaction is identified by ENGINE_TRANSACTION_ID, a table is identified by OBJECT_NAME, and that actual resource being requested is a row in particular index (thus INDEX_NAME is important if there are multiple indexes per table) and LOCK_DATA identifies the row. As mentioned eariler, InnoDB uses a very complex set of LOCK_MODEs which lets a transaction specify if it needs access to the row, or gap before it, or some combination of those, and as ABe has specified he SELECTs the row FOR SHARE and could point to it directly via primary key, only the record, but not the gap before it had to be locked and just in S(hared) mode.
Now, let’s BAsil start in separate connection:
1
2
3
4
5
6
7
|
BAsil> BEGIN; BAsil> SELECT value FROM test.fileB WHERE name='Balance' FOR SHARE; +-------+ | value | +-------+ | 0 | +-------+ |
And confirm that he got the requested access right, too:
1
2
3
4
5
6
7
|
> SELECT ENGINE_TRANSACTION_ID as trx_id,OBJECT_NAME as `table`,INDEX_NAME,LOCK_DATA,LOCK_MODE,LOCK_STATUS FROM performance_schema.data_locks WHERE LOCK_TYPE='RECORD'; +-----------------+-------+------------+-----------+---------------+-------------+ | trx_id | table | INDEX_NAME | LOCK_DATA | LOCK_MODE | LOCK_STATUS | +-----------------+-------+------------+-----------+---------------+-------------+ | 284271835218160 | filea | PRIMARY | 'Apples' | S,REC_NOT_GAP | GRANTED | | 284271835219208 | fileb | PRIMARY | 'Balance' | S,REC_NOT_GAP | GRANTED | +-----------------+-------+------------+-----------+---------------+-------------+ |
Now, let BAsil continue, and try to modify the Apples to 0:
1 |
BAsil> UPDATE test.fileA SET value=0 WHERE name='Apples'; |
You’ll notice that this query does not return immediately, but instead start to wait. If you aren’t quick enough and wait too long, a timeout will hit. You can check the default timeout with:
1
2
3
4
5
6
|
SELECT @@innodb_lock_wait_timeout; +----------------------------+ | @@innodb_lock_wait_timeout | +----------------------------+ | 50 | +----------------------------+ |
To not have to hurry when playing with scenarios like this, you can change it globally to something larger like this:
1 |
SET GLOBAL innodb_lock_wait_timeout=1000000; |
(as this value is checked before going to sleep use above comment up front).
You can confirm that BAsil is WAITING for an X(clusive) lock on record ‘Apples’ in table filea (but not on the gap before it) in following way:
1
2
3
4
5
6
7
8
|
> SELECT ENGINE_TRANSACTION_ID as trx_id,OBJECT_NAME as `table`,INDEX_NAME,LOCK_DATA,LOCK_MODE,LOCK_STATUS FROM performance_schema.data_locks WHERE LOCK_TYPE='RECORD'; +-----------------+-------+------------+-----------+---------------+-------------+ | trx_id | table | INDEX_NAME | LOCK_DATA | LOCK_MODE | LOCK_STATUS | +-----------------+-------+------------+-----------+---------------+-------------+ | 3064 | fileb | PRIMARY | 'Balance' | S,REC_NOT_GAP | GRANTED | | 3064 | filea | PRIMARY | 'Apples' | X,REC_NOT_GAP | WAITING | | 284271835218160 | filea | PRIMARY | 'Apples' | S,REC_NOT_GAP | GRANTED | +-----------------+-------+------------+-----------+---------------+-------------+ |
One peculiar detail to notice here, is that the ENGINE_TRANSACTION_ID of BAsil has changed from the very long pseudo-random 284271835219208 to more sane-looking and small 3064. InnoDB distinguishes between read-only and read-write transactions. InnoDB tries to be optimistic and initially assume that transaction is read-only until it performs some update. The idea is to avoid having to assign a transaction-id which comes from a monotonic sequence, because this process is expensive to synchronize. So transaction has no real id assigned until it is really necessary. So, as soon as BAsil issued an UPDATE statement it become apparent to InnoDB that it is a read-write transaction and thus it needs to assign a real transaction id to it.
Now, let’s get back to ABe and try to continue by setting Balance to 10:
1
2
|
ABe> UPDATE fileB SET value=10 WHERE name='Balance'; ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction |
This is how a deadlock looks like from the point of the victim. This time it was ABe who was picked as the victim, but it could’ve been BAsil.
From the point of the view of survivor it simply looks like the requested for access was granted, so BAsil’s query simply returns:
1 |
Query OK, 1 row affected (23.52 sec) |
(These 23 seconds is how long it took me to follow manually this script while typing blog at the same time :D)
So, BAsil can finish without any problem:
1
2
|
BAsil> COMMIT; Query OK, 0 rows affected (0.01 sec) |
And the final state of the database is consistent with a history where ABe has rolled back and BAsil has committed:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
> SELECT * FROM test.fileA; +--------+-------+ | name | value | +--------+-------+ | Apples | 0 | +--------+-------+ 1 row in set (0.00 sec) > SELECT * FROM test.fileB; +---------+-------+ | name | value | +---------+-------+ | Balance | 0 | +---------+-------+ 1 row in set (0.00 sec) |
And what about ABe? He should simply try again his whole transaction. This is in general how you should deal with a deadlock error: do not panic, simply retry.
We can confirm that this deadlock was properly counted:
1
2
3
4
5
6
|
> SELECT `count` FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME="lock_deadlocks"; +-------+ | count | +-------+ | 1 | +-------+ |
And see this (most recent) incident in the SHOW ENGINE INNODB STATUS output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
|
------------------------ LATEST DETECTED DEADLOCK ------------------------ 2020-07-27 14:21:59 0x2410 *** (1) TRANSACTION: TRANSACTION 3064, ACTIVE 154 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 4 lock struct(s), heap size 1200, 2 row lock(s) MySQL thread id 10, OS thread handle 22236, query id 30 localhost ::1 root updating UPDATE test.fileA SET value=0 WHERE name='Apples' *** (1) HOLDS THE LOCK(S): RECORD LOCKS space id 5 page no 4 n bits 72 index PRIMARY of table `test`.`fileb` trx id 3064 lock mode S locks rec but not gap Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0 0: len 7; hex 42616c616e6365; asc Balance;; 1: len 6; hex 000000000bf7; asc ;; 2: len 7; hex 810000009a0110; asc ;; 3: len 4; hex 80000000; asc ;; *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 4 page no 4 n bits 72 index PRIMARY of table `test`.`filea` trx id 3064 lock_mode X locks rec but not gap waiting Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0 0: len 6; hex 4170706c6573; asc Apples;; 1: len 6; hex 000000000bf5; asc ;; 2: len 7; hex 81000000990110; asc ;; 3: len 4; hex 8000000a; asc ;; *** (2) TRANSACTION: TRANSACTION 3065, ACTIVE 631 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 4 lock struct(s), heap size 1200, 2 row lock(s) MySQL thread id 9, OS thread handle 6128, query id 32 localhost ::1 root updating UPDATE fileB SET value=10 WHERE name='Balance' *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 4 page no 4 n bits 72 index PRIMARY of table `test`.`filea` trx id 3065 lock mode S locks rec but not gap Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0 0: len 6; hex 4170706c6573; asc Apples;; 1: len 6; hex 000000000bf5; asc ;; 2: len 7; hex 81000000990110; asc ;; 3: len 4; hex 8000000a; asc ;; *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 5 page no 4 n bits 72 index PRIMARY of table `test`.`fileb` trx id 3065 lock_mode X locks rec but not gap waiting Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0 0: len 7; hex 42616c616e6365; asc Balance;; 1: len 6; hex 000000000bf7; asc ;; 2: len 7; hex 810000009a0110; asc ;; 3: len 4; hex 80000000; asc ;; *** WE ROLL BACK TRANSACTION (2) |
There’s a lot to unpack here.. First of all each deadlock involves some number of transactions in a cycle referred to as (1),…(N). Here we have a cycle involving only two transactions (1) and(2). The ***
mark important sections which:
- try to describe transaction (1):
- data which identifies the transaction itself,
- what access right (needed by previous trx on the cycle) it held at the moment of the deadlock,
- and what access right it was waiting to obtain,
- and then the same for transaction (2):
- …
- <and if there were more transactions they are all described in the same way>
- Finally, the verdict to sacrifice transaction (2) is announced.
The way access rights held and requested by a transaction are described is pretty low-level and requires some practice to parse correctly. But more importantly it requires understanding of how locks are implemented and tied to InnoDB tree pages. You see, InnoDB stores records grouped into pages, and a page is identified by a pair of space_id and page_no. There might be multiple records on a single page. A common case for a transaction is to select a sequence of rows requesting similar access rights to all of them, which gives opportunity for compression: instead of storing each access right request individually, we group all requests of the same kind for the same transaction for a given page and we use a bitmap to indicate which of the records on this page this is about. This also means that the Lock System does not have to know the actual “names” of the tables and records, so no memory is wasted on strings and primary keys. (The price for that is that to present information in human-readable form an actual database page has to be consulted to decode how space_id and page_no maps to names of table and index, and how a position in the bitmap (the heap_no)maps to actual key value. Sometimes (for example due to buffer-pool cache miss) such information is unavailable and thus will be missing in the output. Another drawback is that each B-tree page reorganization requires cooperation with Lock System to update the mapping)
So, for example:
1
2
3
4
5
6
7
|
*** (1) HOLDS THE LOCK(S): RECORD LOCKS space id 5 page no 4 n bits 72 index PRIMARY of table `test`.`fileb` trx id 3064 lock mode S locks rec but not gap Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0 0: len 7; hex 42616c616e6365; asc Balance;; 1: len 6; hex 000000000bf7; asc ;; 2: len 7; hex 810000009a0110; asc ;; 3: len 4; hex 80000000; asc ;; |
means transaction (1) holds following access rights:
- Rights for particular records listed below all of which are inside page identified by space_id=5 and page_no=4. There are 72 bits in the bitmap. The page belongs to PRIMARY index of table test.fileb. The transaction has id 3064. All the rights listed below have the same mode “S” and are for records themselves but not for gaps before them:
- Right to record residing on this page’s heap at position no 2. (Which means that the 72-bit long bitmap had its 2nd bit set, numbering from zero). This record has 4 fields and is in a compact format. It has info bits equal to 0 . Here are the fields of this record (note this is very low-level stuff which depends on kind of the index and particular version of InnoDB, some of these columns are generated by InnoDB itself etc. so this is impossible to interpret without some additional knowledge of the table shape, index structure and source code):
- 0th field has length 7 bytes and contains the string “Balance“
- 1st field has 6 bytes and contains x000000000bf7 (which is 3063 which is a transaction id of the last transaction which modified this row, which was the INSERT during preparation steps)
- 2nd field has 7 bytes and contains x810000009a0110 (this is a “roll_ptr” a.k.a. “undo pointer” which should point to a previous version of the row, but as this is the initial version of the row after insert, it simply has 55th bit set to 1 to indicate that fact and some other info)
- 3rd field has 4 bytes and contains x80000000 (which is how in hex a positive zero looks like in the encoding used by SQL. In other words this is value=0)
- <more user-defined columns would go here…>
- <in case there were more bits set to 1 in the bitmap, then more records would be listed here>
- Right to record residing on this page’s heap at position no 2. (Which means that the 72-bit long bitmap had its 2nd bit set, numbering from zero). This record has 4 fields and is in a compact format. It has info bits equal to 0 . Here are the fields of this record (note this is very low-level stuff which depends on kind of the index and particular version of InnoDB, some of these columns are generated by InnoDB itself etc. so this is impossible to interpret without some additional knowledge of the table shape, index structure and source code):
-
IMPORTANT: in case transaction had access rights to other pages, or to the same page but with a mode different than S,REC_NOT_GAP, then they would NOT be listed here. The output only contains description of the lock object which is involved in the deadlock cycle, but not other lock objects held by the transaction. You may occasionally see more locks listed if they are all encoded in the bitmap of the same lock object, but in general, this output will not let you know about all the locks held by the transaction, sorry. You can figure out if something is missing by looking at the “LOCK WAIT 4 lock struct(s), heap size 1200, 2 row lock(s)” part of the output, it says there should be 4 lock objects in memory, 2 of them for record locks (and the other 4-2=2 for table locks).
You can parse the other sections in similar way.
Perhaps, for us, humans, the most important part is the one which describes the transaction itself by providing the query it was trying to execute:
1 |
*** (1) TRANSACTION: TRANSACTION 3064, ACTIVE 154 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 4 lock struct(s), heap size 1200, 2 row lock(s) MySQL thread id 10, OS thread handle 22236, query id 30 localhost ::1 root updating UPDATE test.fileA SET value=0 WHERE name='Apples' |
Again, parsing and interpreting this correctly might require looking into source code, but at least we know what query was the transaction waiting for completion of.
The SHOW ENGINE INNODB STATUS only shows you the most recent deadlock found, so if you want to track all of them you might want to enable logging them to error log by innodb_print_all_deadlocks, but first check
1
2
|
SELECT `count` FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME="lock_deadlocks" |
to see how much output to expect, so that you don’t get overwhelmed.
There is also an option to disable the deadlock detection algorithm by using innodb_deadlock_detect variable, which was useful for some customers with the old implementation. I believe that the new (8.0.18) implementation is not only faster, but more importantly runs from a dedicated background thread without blocking the transaction threads, so I see no reason to disable this mechanism. I recommend you keep the default (ON).
This article was about handling situations where waiters wait on each other in a deadlock. But how do we handle a more typical case, where there is no deadlock, yet we have to somehow queue the waiters and pick who to grant the lock once it becomes available? If you wonder if we use first-come-first-served, priority-queue, or some randomized approach, read our next article InnoDB Data Locking – Part 4 “Scheduling”.
Thank you for using MySQL!