WL#3262: Add mutex lock order checking to the server

Affects: Server-5.6   —   Status: Un-Assigned   —   Priority: Medium

Would make us detect bugs such as BUG#25306 (perhaps not exactly this, but 
there are others, please add here if you have better examples: ...)

Check that mutexes are not deadlocking.


This is some common deadlock situations:
Assuming mutex A, B, C

Some lock order that can cause deadlocks (when exectued by different threads)

a)  Simple case


Thread 1 locks A then thread 2 locks B then thread 1 wants to lock B
(so waits for thread 2) and thread 2 wants to lock A (so waits for
thread 1 i.e. waits for itself, it's a deadlock).

b) Ring



One way to avoid that is to decide on a fixed order: that mutex B must
always be locked after mutex A, i.e. it's forbidden for a thread to
lock B then to want to lock A.  There are areas like replication where
it's not so sure that such mutex locking order is always defined and

Some requirements for the mutex deadlock detector::
- Should be able to give warnings for both direct mutex conflicts (case a) and
rings (case b)
- Should be able to handle any number of mutexes that are dynamically created
and destroyed
- When getting a mutex conflict, write out at least the current mutex stack and
reasonable information about the path that may conflict with this one.

The same ideas could be applied to rwlocks (if we have safe_rwlock...).


Status update 2011-10-13, Marc Alff:
- added dependency to WL#5825 Using C++ Standard Library with MySQL code,
to avoid re inventing the STL wheel all the time.
- Proposal 4 is implemented as a prototype in bzr branch mysql-trunk-lock-order

Proposal 1

The proposal is to add to the safemutex_t a parameter:
mutex_list *must_not_be_locked_before_me;
where mutex_list is a usual list struct { safemutex_t *s, mutex_list *next }.
Then add to safemutex_lock() this check (before it tries to lock the mutex):
for M in must_not_be_locked_before_me
  safemutex_assert_not_owner(M); /* see if I don't own M */
Add a call which can manipulate the must_not_be_locked_before_me list:
says that M2 should not be locked before M1, that is it adds M2 to the list in M1.
This call should be separated from the mutex init(), because there can
situations where order should not be enforced (e.g. single-thread initiliazation
of mysqld components).

Proposal 2 (by Serg, rephrased by KristianN below)

There will be a single global mutex graph. Whenever a thread T locks mutex M
followed by mutex N, there will be an edge from M to N ("N locked after
If at server exit there is a loop in the graph, it means that some mutexes M1
and M2 are somtimes locked in the order M1, M2; and sometimes M2, M1 (or more
complex situations with more threads/mutexes). Possibly (but not necessarily)
a realised or potential deadlock, and definitely shows the lack of a
well-defined locking order for the mutexes involved. This would mean a large
graph, and probably checking at mutex destroy time or even process exit.

Proposal 3 (By Monty; To be implemented in Maria tree)

- Extend safe_mutex to handle mutex deadlock detection.
- Add global startup option to enable/disable deadlock detection
- Deadlock detection can also be disabled for a mutex or a path.
  (For temporarly disable warnings)
- Add an unique identifier for each mutex; This is used as the basic
  identication of mutex as pointers to mutex are not safe.
- Keep a linked list of all mutex in use per thread
- For each mutex maintain a locked-mutex-hash that contains all mutex that
  was locked when this mutex was locked. It also contains any mutex that
  was locked when any of the mutex in the hash was locked (recursively)
  For example, if you have mutex locking path's  A->B  and  B->C  then the
  hash for A contains both B and C
- For each mutex keep a used-mutex-hash of pointers to all mutex that has this
mutex in it's locked-mutex-hash. This is needed to delete hash entries when the
mutex is destroyed.

With this in place, you can detect a wrong mutex order by checking if
the locked-mutex-hash of the mutex your are trying to lock contains
any of the mutex that are locked by the curent thread.
The hash entry will also contain descriptive information about where
the mutex is locked so that one can write descripitive warnings.

Proposal 4 (by Marc Alff)

There will be a unique graph.
This graph is oriented: an edge going from
A to B means that it is ok to lock B when holding a lock on A.
This graph is acyclic: a cycle in the graph reflects a bug in the design.
Relations in this graph are not transitive:
- if edges (A --> B) and (B --> C) exists, 
  the edge (A --> C) may or may not exist.
- If locking C while holding a lock on A is considered valid,
  an explicit A --> C edge should exist.

So far this description is in line with proposal 2 from Sergei.

Points that are different:

The graph is built explicitly in the code, and embedded in the server
binary in debug builds.
In particular, edges "A --> B" must be explicitly declared.
At runtime, when locking a mutex X, the code checks for every mutex
M held by the current thread that there is an explicit M --> X edge
in the graph, or prints a message in a trace file and/or asserts.
The check happens in real time when mutex M is locked, so that failure
to comply with the pre defined graph are detected on the spot
-- not post mortem at mutex destroy or server shutdown -- which gives
the opportunity for developers to have access to the exact query and call
stack that caused the illegal lock order.

By virtue of being explicitly declared in the code, the graph is unique
for the entire code base, not only for a given server run.
In particular, this enforces that test 1 executed with server run 1 and test
2 executed with server run 2 do both comply with the *same* dependency graph.
Executing any post mortem analysis based on data collected during a server
run only does not provide this assurance.
For example, the "main" test suite when executed with mysql-test-run can
execute around 600 tests while starting a server 100 times to do that,
and this is not taking into account other tests suites.

There is a huge difference between:
(a) enforcing 100 times that on average 6 tests are compatible between
  themselves, using a graph that may or may not be compatible with
  the graph of a different server run.
(b) enforcing 600 times that every test is compatible with the unique,
  explicit, dependency graph.
Any solution that only provides (a) does not resolves the problem.

By virtue of being explicitly declared in the code, the dependency
graph also serves as design documentation, which is critically needed.
By being enforced at runtime, this documentation is by definition up
to date with the code.

This solution also addresses a problem related to management of
software development.
Assuming for example that there is a common code used by 10 different teams
to implement 10 different projects in parallel, that each add a new mutex,
and that the combined code at the end creates a graph between the 10
new/existing mutexes:
(c) when doing post mortem analysis to find loops in a non documented graph,
the last team that merges to common code repository is the team blamed for
closing a loop, and the team that get to cancel a project.
(d) when doing analysis to find loops in a pre documented graph, code review
and in particular design review should catch when each team is creating a
suspicious dependency, and design issues should be resolved in the code that
caused the problem (team X), not in the code that got to find the problem
(the last one to merge in case c).
Any solution that only provides (c) does not resolves the problem: in fact,
it makes the code base even worse.

Nodes in the graph are objects already instrumented for the performance
This includes mutexes, but also rwlocks, that can benefit from this.
This includes every kind of mutex, not only safe_mutex: code that
implements it's own mutex / rwlock / prlock / other can still benefit
from the dead lock detection.
Nodes in the graph are not mutex instances, but classes, identified
by the "sql/LOCK_open" name already assigned by the instrumentation to
mysql_mytex_t LOCK_open;
The mutex dead lock detector is *another* implementation of the
instrumentation interface, that collects data with PSI_server->xxx() calls,
which are already in place.
In other words, there are zero code changes for the existing code already
instrumented for the performance schema.

Other benefits of this approach is that not only mutex deadlocks can
be detected, but cases of poorly scaling code can be detected.
For example, it may be deemed acceptable to hold mutex "mutex/sql/LOCK_open"
while performing file IO on "file/sql/FRM", so that the is an edge between
these nodes in the graph, but it may be un acceptable to hold mutex
"mutex/sql/LOCK_thread_count" while writing to a file: the absence of an
edge here will cause the code enforcing the graph order to complain
when this is attempted at runtime, automatically pointing out areas
in the code that could be revisited to gain performances.

See for example bugs like http://bugs.mysql.com/bug.php?id=41158
DROP TABLE holds LOCK_open during unlink()
A complete graph with both mutexes and files, with documented edges, can prevent
this type of bugs.

End of proposal 4

Jonas testing and use of lockdep
Jonas Oreland wrote:
> Found this during "start slave"
> my_lockdep_lock(0xf3f718, 0xf43938) potential deadlock
> cnt: 2
>  lock: 97(0xf3f718) [ file: slave.cc line: 5107
>  lock: 98(0xf3f698) [ file: slave.cc line: 4372
> file: slave.cc line: 5107 vs. file: slave.cc line: 4372

more...(keep in mind still beta, so it could be false alarms :-)

my_lockdep_lock(0xf40790, 0xf95cd8) potential deadlock
cnt: 3
 lock: 92(0xf40790) [ file: slave.cc line: 119
 lock: 97(0xf3f718) [ file: slave.cc line: 118
 lock: 25(0xeaadc8) [ file: sql_parse.cc line: 2999
file: slave.cc line: 119 vs. file: slave.cc line: 118
my_lockdep_lock(0xf40790, 0xf95cd8) potential deadlock
cnt: 3
 lock: 92(0xf40790) [ file: slave.cc line: 767
 lock: 97(0xf3f718) [ file: slave.cc line: 118
 lock: 25(0xeaadc8) [ file: sql_parse.cc line: 2999
file: slave.cc line: 767 vs. file: slave.cc line: 118

Moreover it also hammers log with "my_lockdep_unlock(0xf40710, 0xf46198)
incorrect unlock order"


after having more/better tracing to lockdep validator i concluded the following.

The potential deadlock detected is in start/stop slave, is "safe"

when starting slave:
1) lock global mutex
2) start IO thread lock(x) lock(y) unlock(y) unlock(x)
3) start SQL thread lock(y) lock(x) unlock(x) unlock(y)
4) unlock global mutex

This makes validator scream.
And later on warn on any x->y y->x locking

So if you could fix this, it would be good
This as after this, there will be lots of "false" warnings...
  (lockdep can not currently detect that global mutex will protect this locking)


        This is a bit of a guess work as slave.cc is as hairy as
ha_ndb_binlog.cc :-)


Guilhem adds:

Yes, I added 1) and 4) to fix
http://bugs.mysql.com/bug.php?id=2921 in March 2004.
It was not obvious to find another solution (I still hope this is


Monty adds:

... And I agree with Jonas (in his later emails on this thread) that we
should fix all cases so that we never take locks in different order.


Marko Mäkelä adds:

Not knowing about this, I filed WL#5775, suggesting to extend the
UNIV_SYNC_DEBUG mechanism of InnoDB to the whole server code base. At least one
change should be relatively simple to implement: Add a handler or handlerton
debug call for checking that no storage engine mutexes are being held whenever
acquiring any mutex or rw-lock outside the storage engine.