WL#9127: Define new handler API for sampling

Affects: Server-8.0   —   Status: Complete

This worklog specifies a new handler API for doing sampling of a given table.
The main use case for the new API is for building histograms. Creating a
histogram for a large table (terabyte size) can be very costly. Building a
histogram over a sample of the data is much more efficient, and it will still
give a reasonably good histogram.

Another use case in the future, might be to implement the SQL standard feature
T613, Sampling. This feature allows for "TABLESAMPLE" to be specified after a
 clause, and is implemented by several major DBMS (SQL Server,
PostgreSQL, Oracle, DB2 etc.):

  SELECT * FROM t1 TABLESAMPLE SYSTEM (5);

The above example would return approximately 5 percentage of the data in the
table 't1'.

User Documentation
==================

No user-visible changes. No user documentation needed.
F-1: The sampling percentage MUST be within the range [0.0 100.0] (inclusive).

F-2: Given a sampling percentage S, the probability of a row being included in
     the result set MUST be S/100. This also means that given a sampling
     percentage of 0, no rows must be returned.

F-3: If given the same sampling seed, subsequent calls to the sampling API
     SHOULD return the same rows given that the underlying table contents has
     not changed. That is, the sampling SHOULD be deterministic.

F-4: The sampling API SHOULD have a default implementation which can be used
     if the storage engine has not provided an implementation.

F-5: The sampling method MUST be sampling without replacement. That is, the
     same row must not be returned multiple times.

F-6: If given a sampling percentage of 100, all of the rows SHOULD be returned.

F-7: The sampling SHOULD be evenly distributed over the entire table. It is not
     sufficient to read i.e. the first 10% of the table. This could lead to
     important and significant data being left out. A simple example could be a
     table containing worldwide orders. 50% of the orders could probably be from
     the USA. If all of the USA rows are located last in the table, it could
     easily be missed if only the first 10% of the table is read.
The SQL standard defines two methods of sampling: SYSTEM and BERNOULLI. For this
worklog, we will only provide support for SYSTEM. The difference between SYSTEM
and BERNOULLI is basically that SYSTEM is page level sampling while BERNOULLI is
row level sampling. We refer to the SQL standard for more details if wanted.

We introduce one new strongly typed enum and six new functions:

enum class enum_sampling_method
{
  SYSTEM
};

class handler
{
public:
  int handler::ha_sample_init(double sampling_percentage, int sampling_seed,
                              enum_sampling_method sampling_method);
  int handler::ha_sample_next(uchar *buf);
  int handler::ha_sample_end();

private:
  virtual int handler::sample_init();
  virtual int handler::sample_next(uchar *buf);
  virtual int handler::sample_end();
};

It follows the same pattern as other API functions for reading from tables. The
normal workflow will be something like the following code:

  // Specify through the tables "read_set" which columns we want to read.
  bitmap_set_all(table->read_set);
  bitmap_clear_all(table->write_set);

  // Initialize sampling, where approx. 10% of the rows will be read.
  table->file->ha_sample_init(10.0, 0, enum_sampling_method::SYSTEM);

  while (table->file->ha_sample_next(table->record[0]))
  {
    // Do stuff with the data.
  }

  // End the sampling/close the handler.
  table->file->ha_sample_end();
The three new functions prefixed with ha_* are the functions that will be used
by the server to read sampled data. The three new functions without the prefix,
are the virtual functions that the storage engines need to override in order to
support sampling. We will provide a default implementation for the three virtual
functions, which is described in more detail later.

Functions specification
-----------------------
-- int handler::ha_sample_init(double sampling_percentage, int sampling_seed,
                               enum_sampling_method sampling_method);

Initialize sampling, where approx "sampling_percentage" of the rows in the table
will be read. The parameter "sampling_percentage" must be greater than or equal
to 0.0, and less than or equal to 100.0. The function returns 0 on success, and
any other return value indicates an error.


-- int handler::ha_sample_next(uchar *buffer);

Read the next row via sampling. The row is read into the buffer given by the
parameter "buffer". Returns 0 on success and HA_ERR_END_OF_FILE if no more rows
are available. Any other return value indicates an error.


-- int handler::ha_sample_end();

Ends the sampling. Returns 0 on success, and any other return value indicates
an error.


-- virtual int handler::sample_init();

-- virtual int handler::sample_next(uchar *buf);

-- virtual int handler::sample_end();


Default implementation
----------------------
The three virtual functions will have a default implementation that will be used
if the storage engine has not provided an implementation. The following sections
describes the default implementation (basically a full table scan with
skip-read).

When calling "sample_init()", the following step happens:
  - Call and return the result from "ha_rnd_init(false)".

When calling "sample_next()", the following steps happend:
  - Fetch the next row from the table scan by calling "ha_rnd_next".
  - Return a random number from the rand-structure, and check if it less or
    equal than the given sampling percentage. Remember that the rand-structure
    will return values in the range [0.0 100.0] inclusive, and the probability
    of including a row is "sampling percentage" / 100.
  - If the above point is satisfied, return the last row fetched from table
    scan. If not, fetch the next row from table scan and do the above point
    again.
  - If an error is detected at any point, the error is returned from
    "sample_next()" immediately.

When calling "sample_end()", we simply call and return the result from
"ha_rnd_end()" to properly end/close the table scan.


All in all, the code will look something like the following:

  // *** CODE BEGIN *** \\

  // Extend existing enum with "SAMPLING".
  enum {NONE=0, INDEX, RND, SAMPLING} inited;

  // Two member variables, that can be used by engine specific implementations
  // if desired.
  double m_sampling_percentage;
  rand_struct m_sampling_rand;

  int handler::ha_sample_init(double sampling_percentage, double sampling_seed,
                              enum_sampling_method sampling_method)
  {
    DBUG_ENTER("handler::ha_sample_init");
    DBUG_ASSERT(sampling_percentage >= 0.0);
    DBUG_ASSERT(sampling_percentage <= 100.0);
    DBUG_ASSERT(inited == NONE);

    // Initialise the random structure.
    randominit(&m_sampling_rand, sampling_seed, sampling_seed);
    m_sampling_percentage= sampling_percentage;

    int result;
    inited= (result= sample_init()) ? NONE : SAMPLING;
    DBUG_RETURN(result);
  }

  int handler::ha_sample_end()
  {
    DBUG_ENTER("handler::ha_sample_end");
    DBUG_ASSERT(inited == SAMPLING);
    inited= NONE;
    DBUG_RETURN(sample_end());
  }

  int handler::ha_sample_next(uchar *buf)
  {
    DBUG_ENTER("handler::ha_sample_next");
    DBUG_ASSERT(inited == SAMPLING);
    m_update_generated_read_fields= table->has_gcol();
    int result;
    MYSQL_TABLE_IO_WAIT(PSI_TABLE_FETCH_ROW, MAX_KEY, result,
      { result= sample_next(buf); })
    DBUG_RETURN(result);
  }

  int handler::sample_init(double sampling_percentage, double sampling_seed,
                           enum_sampling_method sampling_method)
  {
    return ha_rnd_init(false);
  }

  int handler::sample_end()
  {
    // Temporary set inited to RND, since we are calling ha_rnd_end().
    inited= RND;
    int result= ha_rnd_end();
    DBUG_ASSERT(inited == NONE);
    return result;
  }

  int handler::sample_next(uchar *buf)
  {
    // Temporary set inited to RND, since we are calling ha_rnd_next().
    inited= RND;
    int res= ha_rnd_next(buf);

    // my_rnd will return a random number n in the range 0.0 <= n < 1.0.
    while (!res && my_rnd(&m_sampling_rand) > (m_sampling_percentage / 100.0))
      res= ha_rnd_next(buf);

    inited= SAMPLING;
    return res;
  }

  // *** CODE END *** \\


Other remarks
-------------
- The storage engine is free to return approximately correct number of rows.
  Returning 11 000 rows from an 1 000 000 row table given a sampling amount of
  10% is OK.

- The columns we want to read, will be marked by using the tables "read_set".