WL#8777: InnoDB: Support for sampling table data for generating histograms
Affects: Server-8.0
—
Status: Complete
In order to generate accurate histograms we need to analyze a large part of the data stored in a table. For moderate sized tables, we will likely read the entire table but for huge tables this will be too costly. So for these tables we need to "sample" data by reading for instance 10% of the records or every tenth record or every tenth data page. It is important that the data we read is distributed over the table (ie, we should not read the first 10 percentage from the beginning of a table). This can be done completely inside the server code but it will cause a lot of data to be read and likely pull a lot of data into the InnoDB buffer. To do this sampling more efficiently, a new storage engine API (handler API) is implemented in WL#9127. This worklog is for implementing the needed support in InnoDB. InnoDB will support the handler API in WL#9127 by reading as efficiently as possibly data pages from the clustered index for the table and provide the records stored in these pages. This reading can be more or less done without needing transaction support or being inserted into buffer pool (InnoDB should be free to do any optimizations it finds useful). The SQL standard defines two methods of sampling: SYSTEM and BERNOULLI. The SYSTEM sampling is a page-level sampling where either all the records in the page are read or none at all as opposed to BERNOULLI which is a row-level sampling. Currently we support only SYSTEM sampling and this worklog aims to achieve efficiency only for this method of sampling.
Functional ---------- F-1: For a given sampling percentage of 100, all of the rows SHOULD be returned. F-2: For a given sampling percentage of 0, no rows should be returned. F-3: Given a sampling percentage S, the probability of a row being included in the result set MUST be S/100. F-4: If a single leaf page is present in a tree, then regardless of the sampling percentage the page should be sampled. F-5: If given the same sampling seed, subsequent calls to the sampling API SHOULD return the same rows given that the underlying table contents has not changed. That is, the sampling SHOULD be deterministic. F-6: The sampling should be without replacement. That is, the same row must not be returned multiple times. F-7: The sampling SHOULD be evenly distributed over the entire table. It is not sufficient to read i.e. the first 10% of the table. This could lead to important and significant data being left out. A simple example could be a table containing worldwide orders. 50% of the orders could probably be from the USA. If all of the USA rows are located last in the table, it could easily be missed if only the first 10% of the table is read. F-8: Counters to show number of pages sampled and number of pages skipped. The % calculation should roughly be equal to sampling percentage. (Only user visible change) Non-Functional -------------- NF-1: The performance gain against the default implementation (where the entire table is scanned) should be proportional to the sample size. For ex, for a given table of a fixed data size, with a sampling rate of 50% we should be able to fetch the sample records 2x faster. This should particularly hold true for large dataset (>= 1TB of data) as it's a requirement for RAPID. NF-2: For 100% sampling there shouldn't be a dip in the performance from the earlier implementation (full table scan).
Since we support only SYSTEM sampling, to decide whether a particular page needs to be read or not we make use of a PRNG random generator. The random generator would be initialized with the sampling seed as specified by the server to ensure F-4. We use the parallel read infrastructure of WL#11720 to place the cursor on level 1 of the B-tree (if level 0 is the leaf node) and read all the records of all the pages in that level. Since Level 1 records have the page_no information of the leaf child node, we make a decision of whether to read the leaf page or not by using the random generator. If the number generated by the random generator is less than the specified sampling percentage then we read the leaf page, else we skip it. This way we would ensure the functional requirements F-3 to F-6. A natural side-effect of this approach is that we may return approximate sampling percentage of the records and not exact, which is completely fine. In future if a more refined sampling is required we should consider implementing BERNOULLI (row-level sampling). Also, added counter sampled_pages_read, sampled_pages_skipped in the module "sampling" to track the number of pages read and skipped during the sampling. sampled_page_read/(sampled_pages_read + sampled_pages_skipped) would be roughly equal to the sampling percentage.
Handler Layer ============= WL#9127 introduced three new handler layer functions which will be overridden to provide InnoDB support for efficient sampling. -> ha_innobase::ha_sample_init(double sampling_percentage, int sampling_seed, enum_sampling_method sampling_method) Initialize the PRNG random generator with the specified sampling_seed and intialize the sampling. Make sure the sampling percentage is between 0 and 100 (inclusive) and the sampling_method is SYSTEM sampling. The function would return 0 on success, and any other return value indicates an error. -> ha_innobase::ha_sample_next(uchar *buffer) 1) If new page - Fetch a random number from the initialized random generator, and check if it less than or equal to the given sampling percentage. - If true, read the requested columns of the first record in the page and write it to the buffer 2) else - read the next record in the page - if the next record is supremum go to 1) - else read the requested columns of the next record and write it to the buffer 3) return 0 on success, else return error value. -> ha_innobase::ha_sample_end() End the sampling. Returns 0 on success, and any other return value indicates an error. Sampling Internals ================== We use the parallel read framework to do the sampling. Some background on parallel read -------------------------------- As the first step a) we try to create a range that needs to be read and each range is picked up by one of the threads. The range is basically denoted by a pair of persistent cursor placed on the leaf records (http://wl.no.oracle.com/worklog/InnoDB-Sprint/?catname=InnoDB-BackLog;tid=12978 has information on how the range is created.) As the second step b) each thread picks up a range and tries to read the leaf records belonging to that range. And since the tree could have been modified in between we reposition the persistent cursor which places the cursor on the appropriate leaf record after taking a s-latch on the node. The traversal of the leaf nodes is trivial as we just follow the pointers to the next page in the leaf level. Current work ------------ As part of the current work the parallel reader was enhanced to support reading of any arbitrary level i.e. to allow creating the range on a non-leaf record. The Histogram_sampler uses this support to request for reading of records of pages in level 1 (level above the leaf node). If the level is available then we read each of the record, which points to leaf child node, and decide whether to read the entire page or skip by the help of random generator as mentioned above. If it's not available then Histogram_sampler ends up reading all the records in the leaf page regardless of the sampling percentage as this would be the case only when root page is the only page present in a tree and reading a single page is not expensive. All the reading in step b) mentioned above happens with s_latch on the index lock for read_level > 0 (i.e., non-leaf node) to follow the locking protocol (http://wl.no.oracle.com/worklog/InnoDB-Sprint/?tid=6326).
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.