MySQL :: Removing Scalability Bottlenecks in the Metadata Locking and THR

The MySQL Server 5.7.5 Development Milestone Release, which was published recently, contains some significant changes to the metadata locking (MDL) subsystem and to the usage of the THR_LOCK manager for InnoDB tables. This post provides more information about these changes, which resulted in nice improvements in both scalability and performance.

Sometime during the development cycle of MySQL 5.6 we realized that locks used in the metadata locking subsystem (MDL) implementation can become a scalability bottleneck when many short statements were executed in @autocommit=1 mode against InnoDB tables.

Bug #66473 is a good example (8-table Sysbench POINT_SELECT/InnoDB test). In this particular case the bottleneck was the mutex protecting the MDL_map hash which contained the MDL_lock objects describing the individual metadata locks. To some extent, we were able to solve the problem for this specific case by partitioning the MDL_map hash and the mutex protecting it in MySQL 5.6.8.

But this solution was not without its own problems — see bug #68487 (fixed in 5.6.15). Moreover, it didn’t solve the problem for the case when most of the statements accessed the same table (e.g. 1-table Sysbench POINT_SELECT/InnoDB test). This problem became even more prominent in 5.7 after some improvements were made in the InnoDB storage engine. After these changes we started to see the rwlock protecting the individual metadata lock (MDL_lock::m_rwlock) as the hottest lock in the server for such workloads.

So at some point during the Summer of 2013 we realized that a more drastic solution was needed.

We started by experimenting with a patch which disabled most of the metadata locking subsystem and thus removed the bottlenecks associated with it (and broke correctness in a big way). We immediately saw that the bottleneck moved to the THR_LOCK table-level lock manager, and the THR_LOCK::mutex in particular. Another patch was drafted that disabled THR_LOCK locking for InnoDB tables (which also broke correctness). With both these patches applied we saw the potential for significant scalability/performance improvements.

After these experiments it became obvious that we needed to address issues both in the metadata locking (MDL) subsystem and the THR_LOCK lock manager in order to improve scalability and performance:

For MDL we decided to use the fact that workloads where this subsystem can become a bottleneck mostly involve DML statements. DDL statements are rare in them. DDL-specific metadata locks are likely to conflict with other metadata locks, while DML-specific metadata locks don’t conflict with each other. So it is possible to improve scalability by making acquisition of DML-specific locks cheap (e.g. it can be as simple as checking a flag and increasing a counter) and making acquisition of DDL-specific locks a bit more expensive. WL#7304 “Improve MDL performance and scalability by implementing ‘fast-path’ for DML locks” implements exactly this idea. Essentially, it decreases the size of the critical section associated with MDL_lock::m_rwlock for DML-specific metadata locks at the expense of DDL-specific locks, and thus makes MDL_Lock::m_rwlock less hot.Moreover, we found that it was possible to develop this idea further and implement acquisition of DML-specific metadata locks as a lock-free atomic operation by using atomics support and the lock-free hash implementation already present in MySQL Server (and used in the Performance Schema). This resulted in WL#7305 “Improve MDL scalability by using lock-free hash” and WL#7306 “Improve MDL performance and scalability by implementing lock-free lock acquisition for DML”.After these changes, acquisition of a DML-specific lock consists of a look-up in a lock-free hash and then an atomic compare-and-swap operation on a single 64-bit value. This value contains counts of the various types of DML-specific locks acquired and flags indicating the presence of pending/active DDL-specific locks. If there are pending or active DDL-specific locks, we have to resort to the old approach using MDL_lock::m_rwlock lock. Similarly, the release of a DML-specific lock is in the best case a single atomic compare-and-swap operation which decrements one of these counters (for all of the details, please read the above WLs).
As a result of this change, MDL_lock::m_rwlock became insignificant in our benchmarks — we no longer see it in the list of hot locks, and MDL_map::m_lock/ MDL_map_partition::m_lock was removed completely as we now use a single lock-free hash.

One of the interesting consequences of WL#7305 was that we had to change the handling of unused MDL_lock objects (these are objects representing metadata locks which are not acquired by anyone). In 5.6 such objects are simply moved to an LRU list of unused MDL_lock objects. If the list gets bigger than --metadata_locks_cache_size elements, the least recently used object is deleted. With the lock-free algorithm, deletion becomes more expensive, so some kind of caching policy was definitely needed, but implementing an LRU list gets complicated. So instead of deleting the least recently used element we delete a random unused element once the unused/used ratio reaches a certain threshold. This meant that the --metadata_locks_cache_size option no longer made sense so it was removed.

Since MDL_map is now implemented using a single lock-free hash instead of several hash partitions, the --metadata_locks_cache_size option was removed as well.
The situation with the THR_LOCK manager looked a bit simpler. InnoDB doesn’t really need/rely on THR_LOCK locks for the execution of most of the statements. Instead InnoDB relies on its own row-level locks (for DML) and on metadata locks (for DDL and DML/DDL conflicts). The only exception was the LOCK TABLES READ statement for which acquisition of a strong THR_LOCK lock was necessary to properly isolate it from any concurrent DML. Once a new type of metadata lock was introduced, to replace THR_LOCK lock in this role, we were able to remove THR_LOCK acquisition for InnoDB tables completely. Doing this was not as easy as it might sound since there were various questions to answer:
- What priority should the new lock get compared to other DDL and DML locks?
- How do we prevent DML from starving out LOCK TABLES READ?
- What should we do about the LOW_PRIORITY clause?
As a result we ended-up adding even more lock types to the metadata locking subsystem. See WL#6671 “Improve scalability by not using thr_lock.c locks for InnoDB tables” for details.

While doing the above we were not able to completely avoid introducing incompatible changes. One thing, which I would like to note in particular, is that multi-update now is blocked by concurrent LOCK TABLES READ on any table in its join, even if such table is only used for reading.

On the other hand, the introduction of a new metadata lock type for LOCK TABLES READ allowed us to fix the long-standing bug #57474 “Deadlock when DDL under LOCK TABLES WRITE, READ + PREPARE” as a bonus.

The above changes removed the metadata locking subsystem and the THR_LOCK lock manager from the list of scalability bottlenecks for scenarios involving InnoDB tables. They became part of the MySQL 5.7.5 DMR (some of the changes were already available in the MySQL 5.7.4 DMR).

They are also partially responsible for the impressive numbers which Dimitri Kravtchuk got in his benchmarks. See his recent blog post: “Indeed, MySQL 5.7 rocks : OLTP_RO Point-Selects 8-tables Benchmark“. Dimitri also presented these numbers at Oracle OpenWorld 2014 “MySQL 5.7: Performance and Scalability Benchmarks“, and that link contains a link to his presentation.

A big Thank you! goes out to all who participated/helped with this work:

Mark Callaghan for reporting bug #66473, issues which led to the reporting of bug #68487 and urging us to look at scalability bottlenecks in the MDL subsystem
Dimitri Kravtchuk for his benchmarks and help in identifying the problems
Mikael Ronstrom for discussions about possible solutions to the problem
Jon Olav Hauglid for reviewing the design and the code of these worklogs
Shipra Jain, Vince Rezula, and Jeb Miller for QAing them
And last but not least, to Paul Dubois for documenting changes caused by these WLs

Thank you for using MySQL!