WL#6578: InnoDB: Optimize read view creation
Affects: Benchmarks-3.0
—
Status: Complete
The multi-version concurrency control (MVCC) in InnoDB requires that each MVCC-using transaction be assigned a read view. The read view is created by traversing the trx_sys->rw_trx_list, which is a linked list of active read-write transactions. This change is required to improve InnoDB performance both for RO and RW.
This WL is only about performance, there is no user level change. All changes are internal to InnoDB.
The problems ============ There are several aspects to this problem, especially on high end NUMA: 1. Locality of reference 2. Cache coherency The general problem around efficiency is that the MVCC read view create is an O(N) operation, under the protection of the trx_sys_t::mutex. Where N is the length of the trx_sys_t::rw_trx_list. Another problem is that the read view is allocated while the trx_sys_t::mutex is held. For pure Auto-commit Non-locking Read-only (AC-NL-RO) transaction load, even though the trx_sys_t::rw_trx_list length is 0, the cost of acquiring the mutex and creating an empty view while other transactions are simultaneously freeing views is considerable. On a 32 core CPU e.g., supra03 this can result in a meltdown because all CPUs will try and acquire the trx_sys_t::mutex simultaneously. The solution to this problem has to deal with high concurrency related issues along with a more efficient read view create. Another problem related to the same infrastructure (maintaining running transactions) is that the trx_sys_t::rw_trx_list is also accessed by the locking code and the MVCC code to do two things: 1. Determine if a transaction is active when doing an implicit to explicit lock conversion. 2. Map the transaction id to the trx_t instance #2 is required so that the transaction/thread doing the implicit to explicit conversion can add the locks to the transaction instance (trx_t*) that inserted the record. For this we need to acquire the trx_sys_t::mutex and traverse the trx_t::rw_trx_list. All this adds considerable overhead on the trx_sys_t::mutex and restricts scaling. The solution: ============= 1. Reduce the malloc/free calls while the trx_sys_t::mutex is held 2. Optimise read view create/close for AC-NL-RO transactions 3. Use a dictionary based data structure to do the trx_id_t to trx_t* mapping
Refactor the code and use an OO idiom. Create an MVCC manager class. This class manages the read view lifecycle. To avoid acquiring the trx_sys_t::mutex for the AC-NL-RO case we set a flag in the read view pointer in trx_t::read_view. If the view is open the flag is cleared and when the view is closed the flag is set. This flag uses the first bit of trx_t::read_view. Additionally, for purge to filter out stale views we set a closed flag when an AC-NL-RO view is closed. The closed flag is set to false when the view is active. Note: AC-NL-RO transaction views are not removed from the active view list when they are closed. To remove them would require acquiring the trx_sys_t::mutex and we want to avoid that at all costs. If purge has to traverse the MVCC::views list to get the oldest view then it is a smaller price to pay. This could be optimised in the future too. We can keep track of the starting point in purge because views can only be added from one end and purge always starts the traversal from the opposite end. This class also maintains a free view list, this is to reduce malloc/free overhead. When this class is created we pre-allocate N (configurable) views and put them on the free list for use later. When a non-AC-NL-RO transaction closes its view that view is immediately removed from the active list and moved to the free list. The reason for this is that we have to acquire the trx_sys_t::mutex anyway for other reasons. For AC-NL-RO transactions, the view is reused if the trx_sys_t::max_trx_id is equal to the existing attached view's m_low_limit_id. This means that no new RW transactions have been created since the last view was taken. To keep things simple, another requirement is that the view's trx_ids vector must be empty. In a pure RO load this should eliminate the read view create overhead totally. We can optimise it later for mixed loads. In the mixed load scenario we can filter out the expired transaction ids, the missing piece here is how to do it without acquiring the trx_sys_t::mutex. Since the AC-NL-RO stale views can also be on the active view list, purge has to filter them out. It checks the stale flag and then compares the view's m_low_limit_id with the trx_sys_t::max_trx_id, if a RW transaction has been created since, then the read view is skipped. The lifecycle of an AC-NL-RO view is different from other views. The view is created and allocated when the trx_t::read_view is NULL. It is released when the session is closed by the server layer. We close/remove the view when we remove it from the trx_sys_t::mysql_trx_list or the trx_t instance is allocated to a RW transaction. class MVCC { public: /** Constructor @param size Number of views to pre-allocate */ explicit MVCC(ulint size); /** Destructor. Free all the views in the m_free list */ ~MVCC(); /** Allocate and create a view. @param view view owned by this class created for the caller. Must be freed by calling close() @param id transaction creating the view */ void view_open(ReadView*& view, trx_t* trx); /** Close a view created by the above function. @para view view allocated by trx_open. @param own_mutex true if caller owns trx_sys_t::mutex */ void view_close(ReadView*& view, bool own_mutex); /** Release a view that is inactive but not closed. Caller must own the trx_sys_t::mutex. @param view View to release */ void view_release(ReadView* view); /** Clones the oldest view and stores it in view. No need to call view_close(). The caller owns the view that is passed in. This function is called by Purge to create it view. @param view Preallocated view, owned by the caller */ void clone_oldest_view(ReadView* view); /** @return the number of active views */ ulint size() const; /** @return true if the view is active and valid */ static bool is_view_active(ReadView* view); ... };
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.