WL#6578: InnoDB: Optimize read view creation

Affects: Benchmarks-3.0   —   Status: Complete

The multi-version concurrency control (MVCC) in InnoDB requires that each
MVCC-using transaction be assigned a read view. The read view is created by
traversing the trx_sys->rw_trx_list, which is a linked list of active read-write

This change is required to improve InnoDB performance both for RO and RW.
This WL is only about performance, there is no user level change. All changes are 
internal to InnoDB.
The problems
There are several aspects to this problem, especially on high end NUMA:

 1. Locality of reference
 2. Cache coherency

The general problem around efficiency is that the MVCC read view create is an O(N) 
operation, under the protection of the trx_sys_t::mutex. Where N is the length of 
the trx_sys_t::rw_trx_list. Another problem is that the read view is allocated 
while the trx_sys_t::mutex is held.

For pure Auto-commit Non-locking Read-only (AC-NL-RO) transaction load, even 
though the trx_sys_t::rw_trx_list length is 0, the cost of acquiring the mutex and  
creating an empty view while other transactions are simultaneously freeing views 
is considerable. On a 32 core CPU e.g., supra03 this can result in a meltdown 
because all CPUs will try and acquire the trx_sys_t::mutex simultaneously. The 
solution to this problem has to deal with high concurrency related issues along 
with a more efficient read view create.

Another problem related to the same infrastructure (maintaining running 
transactions) is that the trx_sys_t::rw_trx_list is also accessed by the locking 
code and the MVCC code to do two things:

1. Determine if a transaction is active when doing an implicit to explicit lock 

2. Map the transaction id to the trx_t instance

#2 is required so that the transaction/thread doing the implicit to explicit 
conversion can add the locks to the transaction instance (trx_t*) that inserted 
the record. For this we need to acquire the trx_sys_t::mutex and traverse the 
trx_t::rw_trx_list. All this adds considerable overhead on the trx_sys_t::mutex 
and restricts scaling.

The solution:

1. Reduce the malloc/free calls while the trx_sys_t::mutex is held

2. Optimise read view create/close for AC-NL-RO transactions

3. Use a dictionary based data structure to do the trx_id_t to trx_t* mapping

Refactor the code and use an OO idiom.

Create an MVCC manager class. This class manages the read view lifecycle. To avoid  
acquiring the trx_sys_t::mutex for the AC-NL-RO case we set a flag in the read  
view pointer in trx_t::read_view. If the view is open the flag is cleared and when  
the view is closed the flag is set. This flag uses the first bit of 
trx_t::read_view. Additionally, for purge to filter out stale  views we set a 
closed flag when an AC-NL-RO view is closed. The closed flag is set to false when 
the view is active. Note: AC-NL-RO transaction views are  not removed from the 
active view list when they are closed. To remove them would require acquiring the  
trx_sys_t::mutex and we want to avoid that at all costs. If purge has to traverse 
the MVCC::views list to get the oldest view then it is a smaller price to pay. 
This could be optimised in the future too. We can keep track of the starting point 
in purge because views can only be added from one end and purge always starts the 
traversal from the opposite end.

This class also maintains a free view list, this is to reduce malloc/free 
overhead. When this class is created we pre-allocate N (configurable) views and 
put them on the free list for use later. When a non-AC-NL-RO transaction closes 
its view that view is immediately removed from the active list and moved to the 
free list. The reason for this is that we have to acquire the trx_sys_t::mutex 
anyway for other reasons.

For AC-NL-RO transactions, the view is reused if the trx_sys_t::max_trx_id is 
equal to the existing attached view's m_low_limit_id. This means that no new RW 
transactions have been created since the last view was taken. To keep things 
simple, another requirement is that the view's trx_ids vector must be empty. In a 
pure RO load this should eliminate the read view create overhead totally. We can 
optimise it later for mixed loads. In the mixed load scenario we can filter out 
the expired transaction ids, the missing piece here is how to do it without 
acquiring the trx_sys_t::mutex.

Since the AC-NL-RO stale views can also be on the active view list, purge has to 
filter them out. It checks the stale flag and then compares the view's 
m_low_limit_id with the trx_sys_t::max_trx_id, if a RW transaction has been 
created since, then the read view is skipped.

The lifecycle of an AC-NL-RO view is different from other views. The view is 
created and allocated when the trx_t::read_view is NULL. It is released when the 
session is closed by the server layer. We close/remove the view when we remove it 
from the trx_sys_t::mysql_trx_list or the trx_t instance is allocated to a RW 

class MVCC {
        /** Constructor
        @param size             Number of views to pre-allocate */
        explicit MVCC(ulint size);

        /** Destructor.
        Free all the views in the m_free list */

        Allocate and create a view.
        @param view             view owned by this class created for the
                                caller. Must be freed by calling close()
        @param id               transaction creating the view */
        void view_open(ReadView*& view, trx_t* trx);

        Close a view created by the above function.
        @para view              view allocated by trx_open.
        @param own_mutex        true if caller owns trx_sys_t::mutex */
        void view_close(ReadView*& view, bool own_mutex);

        Release a view that is inactive but not closed. Caller must own
        the trx_sys_t::mutex.
        @param view             View to release */
        void view_release(ReadView* view);

        /** Clones the oldest view and stores it in view. No need to
        call view_close(). The caller owns the view that is passed in.
        This function is called by Purge to create it view.
        @param view             Preallocated view, owned by the caller */
        void clone_oldest_view(ReadView* view);

        @return the number of active views */
        ulint size() const;

        @return true if the view is active and valid */
        static bool is_view_active(ReadView* view);