WL#12978: InnoDB:Fix imbalance during parallel scan

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

This is a follow up to: WL#11720 - InnoDB: Parallel read of index.

In the original design the B+Tree is split into sub-trees at a level below the 
root that is >= number of configured parallel read threads. For very large data 
sets this results in under utilization of parallel read threads. As an example:

1. 4 read threads.
2. 5 sub-trees to scan in parallel

The first 4 sub-trees will be scanned in parallel, when the last (5th) sub-tree is 
being scanned the other threads will be idle. If it takes 1 minute to scan each 
sub-tree (assume parallel reads scale perfectly) then the total scan time will be 
2 minutes.

The solution is to split the "remainder" partitions once more into sub-trees. We 
can do this to any level, one level below the root of the sub-trees seems to work 
fine in practice.

This way the we can keep all the parallel threads busy and reduce the time to 1m + 
time to scan remainder sub-partitions in parallel.

Additional changes are related to pre-fetching pages during the parallel scan. 
This is to reduce the IO overhead on the parallel scan threads.

FR: No new user visible functionality.

NFR1: Parallel read threads should process similar number of rows during the range 
scan. Leading to a performance improvement for very large tables (e.g., >= 64GB).

NFR2: When doing scans where the entire table has to be read into the buffer pool 
the scan should only be limited by the disk/storage read rate. With perhaps 5-7% 
overhead.

1. Remove Phy_reader class, only support logical reads
2. Add support for scanning an arbitrary range
3. Simplify the design for secondary engine loading.
4. Main thread should do pre-fetch of data pages

Allow multiple parallel range scans at the same time. To achieve this split out 
the scan  context (Scan_ctx) from the execution context (Ctx). The Scan_ctx has 
the index  and transaction information and the Ctx keeps track of the cursor for a 
specific thread during the scan.

To start a scan we need to instantiate a Parallel_reader. A parallel reader can 
contain several Scan_ctx instances and a Scan_ctx can contain several Ctx 
instances. Its' the Ctx instances that are eventually executed.

The Parallel_reader will start one thread per Scan_ctx to service read ahead 
requests. Currently, the read ahead is a physical read ahead ie. it will read one 
extent at a time.

This design allows for a single Parallel_reader to scan multiple indexes at once. 
Each index range scan has to be added via its add_job() method. This functionality 
is required to handle parallel partition scans because partitions are separate 
indexes.

To solve the imbalance problem we dynamically split the sub-trees as and when 
required. e.g., If you have 5 sub-trees to scan and 4 threads then it will tag the 
5th sub-tree as "to_be_split" during phase I (add_job()), the first thread that 
finishes scanning the first set of 4 partitions will then dynamically split the 
5th sub-tree and add the newly created sub-trees to the execution context (Ctx) 
run queue in the Parallel_reader. As the other threads complete their sub-tree 
scans they will pick up more execution contexts from the Parallel_reader run queue 
and start scanning the sub-partitions as normal.