WL#12978: InnoDB:Fix imbalance during parallel scan
Affects: Server-8.0
—
Status: Complete
This is a follow up to: WL#11720 - InnoDB: Parallel read of index. In the original design the B+Tree is split into sub-trees at a level below the root that is >= number of configured parallel read threads. For very large data sets this results in under utilization of parallel read threads. As an example: 1. 4 read threads. 2. 5 sub-trees to scan in parallel The first 4 sub-trees will be scanned in parallel, when the last (5th) sub-tree is being scanned the other threads will be idle. If it takes 1 minute to scan each sub-tree (assume parallel reads scale perfectly) then the total scan time will be 2 minutes. The solution is to split the "remainder" partitions once more into sub-trees. We can do this to any level, one level below the root of the sub-trees seems to work fine in practice. This way the we can keep all the parallel threads busy and reduce the time to 1m + time to scan remainder sub-partitions in parallel. Additional changes are related to pre-fetching pages during the parallel scan. This is to reduce the IO overhead on the parallel scan threads.
FR: No new user visible functionality. NFR1: Parallel read threads should process similar number of rows during the range scan. Leading to a performance improvement for very large tables (e.g., >= 64GB). NFR2: When doing scans where the entire table has to be read into the buffer pool the scan should only be limited by the disk/storage read rate. With perhaps 5-7% overhead.
1. Remove Phy_reader class, only support logical reads 2. Add support for scanning an arbitrary range 3. Simplify the design for secondary engine loading. 4. Main thread should do pre-fetch of data pages
Allow multiple parallel range scans at the same time. To achieve this split out the scan context (Scan_ctx) from the execution context (Ctx). The Scan_ctx has the index and transaction information and the Ctx keeps track of the cursor for a specific thread during the scan. To start a scan we need to instantiate a Parallel_reader. A parallel reader can contain several Scan_ctx instances and a Scan_ctx can contain several Ctx instances. Its' the Ctx instances that are eventually executed. The Parallel_reader will start one thread per Scan_ctx to service read ahead requests. Currently, the read ahead is a physical read ahead ie. it will read one extent at a time. This design allows for a single Parallel_reader to scan multiple indexes at once. Each index range scan has to be added via its add_job() method. This functionality is required to handle parallel partition scans because partitions are separate indexes. To solve the imbalance problem we dynamically split the sub-trees as and when required. e.g., If you have 5 sub-trees to scan and 4 threads then it will tag the 5th sub-tree as "to_be_split" during phase I (add_job()), the first thread that finishes scanning the first set of 4 partitions will then dynamically split the 5th sub-tree and add the newly created sub-trees to the execution context (Ctx) run queue in the Parallel_reader. As the other threads complete their sub-tree scans they will pick up more execution contexts from the Parallel_reader run queue and start scanning the sub-partitions as normal.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.