WL#6044: InnoDB: Policy-based mutex

Status: Complete   —   Priority: Medium

InnoDB makes a distinction between OS mutexes fast_mutex_t (e.g., POSIX on 
Unices and CRITICAL_SECTION on Windows) and a home brew version (ib_mutex_t). 
You cannot mix the two together. The ib_mutex_t use ranking/ordering to make it 
easy to check for deadlocks if #UNIV_SYNC_DEBUG is defined. There is no such 
check for fast_mutex_t. ib_mutex_t is used for all core code synchronisation. 
There are two configuration variables that control the behaviour of ib_mutex_t. 
ib_mutex_t is implemented as a Test and Test And Set mutex And Wait (TTASAW).

 * innodb_spin_wait_delay maps to global variable srv_n_spin_wait_rounds
   This variable sets the upper bound on the delay after the Test

 * innodb_sync_spin_loops maps to global variable srv_spin_wait_delay
   This variable controls the number of spins of the Test.

All ib_mutexe_t instances are controlled by these two variables. If the thread 
fails to acquire the mutex then it is forced to wait on a condition variable 
that is attached to the ib_mutex_t. When the the mutex is free all threads are 
woken up  via a broadcast signal on the condition variable. One of the threads 
will end up acquiring the mutex and the remaining threads will have to go 
through the TTAS and Wait cycle again, and so on.

The Problem
There are several issues with this:

 1. Thundering herd due to the broadcast

 2. Two global variables that control all instances

 3. Size overhead due to condition variable that is part of the ib_mutex_t
    This has been noted as a concern for very large buffer pools. Each block in 
the buffer pool has a mutex attached to it. Therefore the smaller the block size 
the higher the overhead.

 4. Different subsystems have different requirements.

 5. Not flexible
    i. Monitoring is cumbersome

    ii. Can't switch between OS mutex and ib_mutex_t, where we know that the OS 
implementation performs better than the homegrown version.

    iii. Cannot use mutexes with different characteristics in different parts of 
the code. e.g., use a wait mechanism based on a futex on Linux instead of the 
condition variable

 6. The code is not generic, it cannot be used as a library

The solution
 1. Have one mutex type, the OS mutex and any homebrew mutexes are subtypes of 
this generic type

 2. Use static inheritance (templates) to reduce vptr overhead. This is mainly 
for the buffer pool block mutexes. Otherwise dynamic inheritance should be OK 

 3. Decouple the code from InnoDB so that it can be used as a library.

 4. Make it easy to add new policies and make it easy to customise mutex 
Split ib_mutex_t into several mutex types.

  1. OSBasicMutex
     This is a thin wrapper around the system mute, POSIX or Windows 
CriticalSection. These are used by the events code. We can't track the lock/unlock 
of these mutexes because they are used by the condition variables. The condition 
variables change the state outside of InnoDB control

  2. OSTrackedMutex
     This is derived from OSBasicMutex and it can be used as a general purpose 
mutex anywhere inside InnoDB code. It replaces os_fast_mutex_t. It brings 
os_fast_mutex_t under UNIV_SYNC_DEBUG and we can rank them just like the homebrew 

  3. Futex - Only on Linux
     This a a TTAS type of mutex that uses Futexes instead of os_event_t to wait 
when there is contention. The advantage is that we avoid the thundering herd 
problem by using the Linux futex infrastructure.

  4. TTASMutex - Test Test and Spin Mutex (TTAS)
     Simple spin mutex that will only spin and never wait when there is contention 
, ie. it only does busy waits

  5. TTASWaitMutex
     It will busy wait up to a configured setting but then wait on a condition 
variable once the busy wait exceeds the configured spin setting. This is 
equivalent to the old ib_mutex_t.

The Mutex implementation takes a Policy template parameter that manages various 
things relevant to the mutex. The following policies exist, more can be added or 
existing ones can be changed.

   1. DefaultPolicy
      Only implements the busy wait loop, it uses the current mechanism around 
srv_spin_wait_delay and src_spin_wait_rounds.

   3. TrackPolicy
      This implements tracking only, and uses the DefaultPolicy to manage the busy 
wait. It tracks the mutex name, which file and line number in that file where the 
mutex was created. These member variables will add to the physical size of the 
mutex implementation. This is equivalent to what we were tracking in the existing 
implementation of mutexes.

   4. DebugPolicy
      This derives from TrackPolicy and adds additional checks that check for 
mutex ordering. It is used when UNIV_SYNC_DEBUG is set. This policy additional 
tracks the filename and line number where the mutex was acquired to help in 
All of the above mutex types should not be used directly but via the PolicyMutex 
interface. The PolicyMutex takes a template parameter that is the mutex 
implementation type and calls the relevant methods. The technique here is to use 
static inheritance (using templates) and not dynamic inheritance (using 
inheritance). The PolicyMutex knows about the Performance Schema and manages the 
PS integration. The mutex implementation doesn't know about PS and shouldn't have 
to. The PolicyMutex uses the mutex implementation Policy member variable to notify 
the policy about the mutex state changes.