WL#8777: InnoDB: Support for sampling table data for generating histograms

Affects: Server-8.0   —   Status: Complete

In order to generate accurate histograms we need to
analyze a large part of the data stored in a table. For moderate sized tables,
we will likely read the entire table but for huge tables this will be too costly.
So for these tables we need to "sample" data by reading for instance 10% of the
records or every tenth record or every tenth data page. It is important that
the data we read is distributed over the table (ie, we should not read the first
10 percentage from the beginning of a table). This can be done completely inside
the server code but it will cause a lot of data to be read and likely pull a lot
of data into the InnoDB buffer. 

To do this sampling more efficiently, a new storage engine API 
(handler API) is implemented in WL#9127. This worklog is for
implementing the needed support in InnoDB.

InnoDB will support the handler API in WL#9127 by reading as efficiently as
possibly data pages from the clustered index for the table and provide the
records stored in these pages. This reading can be more or less done without
needing transaction support or being inserted into buffer pool (InnoDB should be
free to do any optimizations it finds useful).

The SQL standard defines two methods of sampling: SYSTEM and BERNOULLI. The
SYSTEM sampling is a page-level sampling where either all the records in the
page are read or none at all as opposed to BERNOULLI which is a row-level
sampling. Currently we support only SYSTEM sampling and this worklog aims to
achieve efficiency only for this method of sampling.
Functional
----------

F-1: For a given sampling percentage of 100, all of the rows SHOULD be returned.

F-2: For a given sampling percentage of 0, no rows should be returned.

F-3: Given a sampling percentage S, the probability of a row being included in
     the result set MUST be S/100.

F-4: If a single leaf page is present in a tree, then regardless of the sampling
     percentage the page should be sampled.

F-5: If given the same sampling seed, subsequent calls to the sampling API
     SHOULD return the same rows given that the underlying table contents has
     not changed. That is, the sampling SHOULD be deterministic.

F-6: The sampling should be without replacement. That is, the
     same row must not be returned multiple times.

F-7: The sampling SHOULD be evenly distributed over the entire table. It is not
     sufficient to read i.e. the first 10% of the table. This could lead to
     important and significant data being left out. A simple example could be a
     table containing worldwide orders. 50% of the orders could probably be from
     the USA. If all of the USA rows are located last in the table, it could
     easily be missed if only the first 10% of the table is read.

F-8: Counters to show number of pages sampled and number of pages skipped. The
     % calculation should roughly be equal to sampling percentage. (Only user
     visible change)

Non-Functional
--------------

NF-1: The performance gain against the default implementation (where the entire
      table is scanned) should be proportional to the sample size. For ex, for a
      given table of a fixed data size, with a sampling rate of 50% we should be
      able to fetch the sample records 2x faster. This should particularly hold 
      true for large dataset (>= 1TB of data) as it's a requirement for RAPID.

NF-2: For 100% sampling there shouldn't be a dip in the performance from the
      earlier implementation (full table scan).
Since we support only SYSTEM sampling, to decide whether a particular page needs
to be read or not we make use of a PRNG random generator. The random generator
would be initialized with the sampling seed as specified by the server to ensure
F-4.

We use the parallel read infrastructure of WL#11720 to place the cursor on level
1 of the B-tree (if level 0 is the leaf node) and read all the records of all the 
pages in that level. Since Level 1 records have the page_no information of the 
leaf child node, we make a decision of whether to read the leaf page or not by 
using the random generator. If the number generated by the random generator is 
less than the specified sampling percentage then we read the leaf page, else we 
skip it. This way we would ensure the functional requirements F-3 to F-6.

A natural side-effect of this approach is that we may return approximate
sampling percentage of the records and not exact, which is completely fine. In 
future if a more refined sampling is required we should consider
implementing BERNOULLI (row-level sampling).

Also, added counter sampled_pages_read, sampled_pages_skipped in the module "sampling" to track the number of pages read and skipped during the sampling. 
sampled_page_read/(sampled_pages_read + sampled_pages_skipped) would be roughly equal to the sampling percentage.
Handler Layer
=============

WL#9127 introduced three new handler layer functions which will be overridden to
provide InnoDB support for efficient sampling.
  
-> ha_innobase::ha_sample_init(double sampling_percentage, int sampling_seed, 
                            enum_sampling_method sampling_method)

Initialize the PRNG random generator with the specified sampling_seed and
intialize the sampling. Make sure the sampling percentage is between 0 and 100
(inclusive) and the sampling_method is SYSTEM sampling.

The function would return 0 on success, and any other return value indicates an
error.

-> ha_innobase::ha_sample_next(uchar *buffer)

1) If new page
  - Fetch a random number from the initialized random generator, and check if it
    less than or equal to the given sampling percentage.
  - If true, read the requested columns of the first record in the page and write 
    it to the buffer
2) else
  - read the next record in the page
  - if the next record is supremum go to 1)
  - else read the requested columns of the next record and write it to the buffer
3) return 0 on success, else return error value.


-> ha_innobase::ha_sample_end()

End the sampling. Returns 0 on success, and any other return value indicates an
error.

Sampling Internals
==================

We use the parallel read framework to do the sampling.

Some background on parallel read
--------------------------------

As the first step a) we try to create a range that needs to be read and each
range is picked up by one of the threads. The range is basically denoted by a
pair of persistent cursor placed on the leaf records
(http://wl.no.oracle.com/worklog/InnoDB-Sprint/?catname=InnoDB-BackLog;tid=12978
has information on how the range is created.)

As the second step b) each thread picks up a range and tries to read the leaf
records belonging to that range. And since the tree could have been modified in
between we reposition the persistent cursor which places the cursor on the
appropriate leaf record after taking a s-latch on the node. The traversal of the
leaf nodes is trivial as we just follow the pointers to the next page in the
leaf level.

Current work
------------

As part of the current work the parallel reader was enhanced to support reading
of any arbitrary level i.e. to allow creating the range on a non-leaf record.

The Histogram_sampler uses this support to request for reading of records of
pages in level 1 (level above the leaf node). If the level is available then we
read each of the record, which points to leaf child node, and decide whether to
read the entire page or skip by the help of random generator as mentioned above.
If it's not available then Histogram_sampler ends up reading all the records in
the leaf page regardless of the sampling percentage as this would be the case 
only when root page is the only page present in a tree and reading a single page 
is not expensive.

All the reading in step b) mentioned above happens with s_latch on the index 
lock for read_level > 0 (i.e., non-leaf node) to follow the locking protocol 
(http://wl.no.oracle.com/worklog/InnoDB-Sprint/?tid=6326).