WL#11720: InnoDB: Parallel read of index

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

Read the sub trees of an index in parallel only if the request is a non-locking SELECT COUNT(*).

Allow for "physical" read (a.k.a read uncommitted) and logical read using MVCC rules.

Additionally refactor the persistent cursor code.

Current scope is limited to providing sufficient infrastructure for DDL operations to read the data in parallel. Making the second phase of CHECK TABLE parallel is an added bonus for now. This speeds up CHECK TABLE a little.

The parallel SELECT COUNT(*) ...; is required because we would like to test the changes independent of any other external component.

We will not implement any locking by the parallel read threads. That is a bigger project and can be done as follow up work. This is because some assumptions inside InnoDB have to change and we will have to handle lock waits and coordinate the rollback in the reader threads.

FR1: SELECT COUNT(*) FROM TABLE T; will scan the index in parallel only if the scan is a non-locking scan and --innodb-parallel-read-threads > 1. Otherwise it will fallback to the old row by row scan.

FR2: The second phase of CHECK TABLE T; will also do a parallel scan.

FR3: Support MVCC semantics.

FR4: New session level variable --innodb-parallel-read-threads to control the number of threads to use for parallel SELECT COUNT(*) ...;

1. Minimum value 1
2. Default value 4
3. Maximum value 256.

NFR1: Speed up should be 10x for SELECT COUNT(*) FROM t; on relevant hardware (e.g., tetra02).

Start threads to scan the key ranges in parallel.

The number of threads used for scanning will be the minimum of innodb_parallel_read_threads and total number of sub-trees to scan.

The key range scan is a non-locking scan.

The pages read in during the scan are kept at tail end of the LRU list so that they can be discarded quickly when there is pressure on the buffer pool for free pages. The idea is avoid flooding the buffer pool during the scan.

The latching protocol for Key_reader is:

1. Lock the dict_index_t::lock in SX mode. 2. Create the key range cursors for all the sub trees. 3. Release the dict_index_t::lock.

The subtrees to scan in parallel are selected using the following algorithm:

1. Read the left most nodes at each level starting from the root node.

2. Search a level where the number of sub-trees is greater than the number of threads specified.

3. There are two variants after this.

  a. Physical scan of leaf pages based on page number partitioning. Implemented 

in the Phy_reader class.

  b. Logical scan of leaf pages based on key range partitioning using the 

persistent cursor (Btr_pcur). Implemented in the Key_reader class.

4. Each thread than follows the left most pointer down to the leaf node and starts to scan the records in the leaf pages.

5. The scan will skip delete marked rows for both Phy_reader and Key_reader.

6. The Key reader will use the read view assigned to the transaction if there is one set. It will build the previous version that is visible to the transaction if required.

7. Pass the row to a user specified callback which will return DB_SUCCESS or an error code.

8. The reader class will stop processing if the callback returns anything other than DB_SUCCESS.