WL#6044: InnoDB: Policy-based mutex
Status: Complete
Background
==========
InnoDB makes a distinction between OS mutexes fast_mutex_t (e.g., POSIX on
Unices and CRITICAL_SECTION on Windows) and a home brew version (ib_mutex_t).
You cannot mix the two together. The ib_mutex_t use ranking/ordering to make it
easy to check for deadlocks if #UNIV_SYNC_DEBUG is defined. There is no such
check for fast_mutex_t. ib_mutex_t is used for all core code synchronisation.
There are two configuration variables that control the behaviour of ib_mutex_t.
ib_mutex_t is implemented as a Test and Test And Set mutex And Wait (TTASAW).
* innodb_spin_wait_delay maps to global variable srv_n_spin_wait_rounds
This variable sets the upper bound on the delay after the Test
* innodb_sync_spin_loops maps to global variable srv_spin_wait_delay
This variable controls the number of spins of the Test.
All ib_mutexe_t instances are controlled by these two variables. If the thread
fails to acquire the mutex then it is forced to wait on a condition variable
that is attached to the ib_mutex_t. When the the mutex is free all threads are
woken up via a broadcast signal on the condition variable. One of the threads
will end up acquiring the mutex and the remaining threads will have to go
through the TTAS and Wait cycle again, and so on.
The Problem
===========
There are several issues with this:
1. Thundering herd due to the broadcast
2. Two global variables that control all instances
3. Size overhead due to condition variable that is part of the ib_mutex_t
This has been noted as a concern for very large buffer pools. Each block in
the buffer pool has a mutex attached to it. Therefore the smaller the block size
the higher the overhead.
4. Different subsystems have different requirements.
5. Not flexible
i. Monitoring is cumbersome
ii. Can't switch between OS mutex and ib_mutex_t, where we know that the OS
implementation performs better than the homegrown version.
iii. Cannot use mutexes with different characteristics in different parts of
the code. e.g., use a wait mechanism based on a futex on Linux instead of the
condition variable
6. The code is not generic, it cannot be used as a library
The solution
============
1. Have one mutex type, the OS mutex and any homebrew mutexes are subtypes of
this generic type
2. Use static inheritance (templates) to reduce vptr overhead. This is mainly
for the buffer pool block mutexes. Otherwise dynamic inheritance should be OK
too.
3. Decouple the code from InnoDB so that it can be used as a library.
4. Make it easy to add new policies and make it easy to customise mutex
behaviour.
Split ib_mutex_t into several mutex types.
1. OSBasicMutex
This is a thin wrapper around the system mute, POSIX or Windows
CriticalSection. These are used by the events code. We can't track the lock/unlock
of these mutexes because they are used by the condition variables. The condition
variables change the state outside of InnoDB control
2. OSTrackedMutex
This is derived from OSBasicMutex and it can be used as a general purpose
mutex anywhere inside InnoDB code. It replaces os_fast_mutex_t. It brings
os_fast_mutex_t under UNIV_SYNC_DEBUG and we can rank them just like the homebrew
muteness.
3. Futex - Only on Linux
This a a TTAS type of mutex that uses Futexes instead of os_event_t to wait
when there is contention. The advantage is that we avoid the thundering herd
problem by using the Linux futex infrastructure.
4. TTASMutex - Test Test and Spin Mutex (TTAS)
Simple spin mutex that will only spin and never wait when there is contention
, ie. it only does busy waits
5. TTASWaitMutex
It will busy wait up to a configured setting but then wait on a condition
variable once the busy wait exceeds the configured spin setting. This is
equivalent to the old ib_mutex_t.
The Mutex implementation takes a Policy template parameter that manages various
things relevant to the mutex. The following policies exist, more can be added or
existing ones can be changed.
1. DefaultPolicy
Only implements the busy wait loop, it uses the current mechanism around
srv_spin_wait_delay and src_spin_wait_rounds.
3. TrackPolicy
This implements tracking only, and uses the DefaultPolicy to manage the busy
wait. It tracks the mutex name, which file and line number in that file where the
mutex was created. These member variables will add to the physical size of the
mutex implementation. This is equivalent to what we were tracking in the existing
implementation of mutexes.
4. DebugPolicy
This derives from TrackPolicy and adds additional checks that check for
mutex ordering. It is used when UNIV_SYNC_DEBUG is set. This policy additional
tracks the filename and line number where the mutex was acquired to help in
debugging.
All of the above mutex types should not be used directly but via the PolicyMutex
interface. The PolicyMutex takes a template parameter that is the mutex
implementation type and calls the relevant methods. The technique here is to use
static inheritance (using templates) and not dynamic inheritance (using
inheritance). The PolicyMutex knows about the Performance Schema and manages the
PS integration. The mutex implementation doesn't know about PS and shouldn't have
to. The PolicyMutex uses the mutex implementation Policy member variable to notify
the policy about the mutex state changes.
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.