Background ========== InnoDB makes a distinction between OS mutexes fast_mutex_t (e.g., POSIX on Unices and CRITICAL_SECTION on Windows) and a home brew version (ib_mutex_t). You cannot mix the two together. The ib_mutex_t use ranking/ordering to make it easy to check for deadlocks if #UNIV_SYNC_DEBUG is defined. There is no such check for fast_mutex_t. ib_mutex_t is used for all core code synchronisation. There are two configuration variables that control the behaviour of ib_mutex_t. ib_mutex_t is implemented as a Test and Test And Set mutex And Wait (TTASAW). * innodb_spin_wait_delay maps to global variable srv_n_spin_wait_rounds This variable sets the upper bound on the delay after the Test * innodb_sync_spin_loops maps to global variable srv_spin_wait_delay This variable controls the number of spins of the Test. All ib_mutexe_t instances are controlled by these two variables. If the thread fails to acquire the mutex then it is forced to wait on a condition variable that is attached to the ib_mutex_t. When the the mutex is free all threads are woken up via a broadcast signal on the condition variable. One of the threads will end up acquiring the mutex and the remaining threads will have to go through the TTAS and Wait cycle again, and so on. The Problem =========== There are several issues with this: 1. Thundering herd due to the broadcast 2. Two global variables that control all instances 3. Size overhead due to condition variable that is part of the ib_mutex_t This has been noted as a concern for very large buffer pools. Each block in the buffer pool has a mutex attached to it. Therefore the smaller the block size the higher the overhead. 4. Different subsystems have different requirements. 5. Not flexible i. Monitoring is cumbersome ii. Can't switch between OS mutex and ib_mutex_t, where we know that the OS implementation performs better than the homegrown version. iii. Cannot use mutexes with different characteristics in different parts of the code. e.g., use a wait mechanism based on a futex on Linux instead of the condition variable 6. The code is not generic, it cannot be used as a library The solution ============ 1. Have one mutex type, the OS mutex and any homebrew mutexes are subtypes of this generic type 2. Use static inheritance (templates) to reduce vptr overhead. This is mainly for the buffer pool block mutexes. Otherwise dynamic inheritance should be OK too. 3. Decouple the code from InnoDB so that it can be used as a library. 4. Make it easy to add new policies and make it easy to customise mutex behaviour.
Split ib_mutex_t into several mutex types. 1. OSBasicMutex This is a thin wrapper around the system mute, POSIX or Windows CriticalSection. These are used by the events code. We can't track the lock/unlock of these mutexes because they are used by the condition variables. The condition variables change the state outside of InnoDB control 2. OSTrackedMutex This is derived from OSBasicMutex and it can be used as a general purpose mutex anywhere inside InnoDB code. It replaces os_fast_mutex_t. It brings os_fast_mutex_t under UNIV_SYNC_DEBUG and we can rank them just like the homebrew muteness. 3. Futex - Only on Linux This a a TTAS type of mutex that uses Futexes instead of os_event_t to wait when there is contention. The advantage is that we avoid the thundering herd problem by using the Linux futex infrastructure. 4. TTASMutex - Test Test and Spin Mutex (TTAS) Simple spin mutex that will only spin and never wait when there is contention , ie. it only does busy waits 5. TTASWaitMutex It will busy wait up to a configured setting but then wait on a condition variable once the busy wait exceeds the configured spin setting. This is equivalent to the old ib_mutex_t. The Mutex implementation takes a Policy template parameter that manages various things relevant to the mutex. The following policies exist, more can be added or existing ones can be changed. 1. DefaultPolicy Only implements the busy wait loop, it uses the current mechanism around srv_spin_wait_delay and src_spin_wait_rounds. 3. TrackPolicy This implements tracking only, and uses the DefaultPolicy to manage the busy wait. It tracks the mutex name, which file and line number in that file where the mutex was created. These member variables will add to the physical size of the mutex implementation. This is equivalent to what we were tracking in the existing implementation of mutexes. 4. DebugPolicy This derives from TrackPolicy and adds additional checks that check for mutex ordering. It is used when UNIV_SYNC_DEBUG is set. This policy additional tracks the filename and line number where the mutex was acquired to help in debugging. All of the above mutex types should not be used directly but via the PolicyMutex interface. The PolicyMutex takes a template parameter that is the mutex implementation type and calls the relevant methods. The technique here is to use static inheritance (using templates) and not dynamic inheritance (using inheritance). The PolicyMutex knows about the Performance Schema and manages the PS integration. The mutex implementation doesn't know about PS and shouldn't have to. The PolicyMutex uses the mutex implementation Policy member variable to notify the policy about the mutex state changes.