WL#7209: Handler interface changes for new cost model

Affects: Server-5.7 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

The handler provides cost estimates for table operations to the
optimizer's cost model. These cost estimates are calculated either in
the main handler class or by the storage engines. Most of the cost
estimates are calculated as a single double value represented as
"number of random IO operations". The main exception is the cost
estimate for multi-range read scans which uses the Cost_estimate
object that can contain both IO and CPU (and Memory and Communication)
cost.

The cost estimates will represent
"execution time" instead of being based on using the cost of a random
IO operation as cost unit. This worklog will change and extend the
handler interface for cost estimates so that the storage engines can
produce cost estimates using "execution time" as the new cost unit and return
the cost estimate in a Cost_estimate object.

User Documentation
==================

No functional changes.
No user documentation required.

This worklog should not introduce any functional changes. It is mostly
about extending the handler interface with new cost estimate functions
which in the initial version will map to the existing cost estimate
functions.

Due to the use of floating point arithmetic and existing casting
between integer and floating point values the in cost calculations,
minor rounding errors might cause changes to query plans in the case
where alternative query plans have almost identical cost estimates.

EXISTING INTERFACE:
===================

The existing handler interface has the following main functions for
providing cost estimates to the optimizer's cost model:

* Cost of a complete table scan:

  virtual double scan_time();

* Cost of reading a number records from a number of ranges using an index:

  virtual double read_time(uint index, uint ranges, ha_rows rows);

* Cost of reading a number of records from an index (without having
  to read data from the base table):

  virtual double index_only_read_time(uint keynr, double records);

* Cost of doing a multi-range read scan (either by the standard MRR
  implementation or by an engine specific MRR implementation, e.g,
  DS-MRR):

  virtual ha_rows multi_range_read_info_const(uint keyno, RANGE_SEQ_IF *seq,
                                              void *seq_init_param, 
                                              uint n_ranges, uint *bufsz,
                                              uint *flags, 
                                              Cost_estimate *cost);

  virtual ha_rows multi_range_read_info(uint keyno, uint n_ranges, uint keys,
                                        uint *bufsz, uint *flags, 
                                        Cost_estimate *cost);

Note that the first three functions returns the cost as a single
double value representing the cost as number of IO operations while the
two last functions returns the cost in a Cost_estimate object. This
makes it possible to distinguish between the cost of IO and the cost
of CPU. The current cost model for MRR returns both IO and CPU
estimates in the Cost_estimate object.

NEW INTERFACE:
==============

Replace the existing cost estimate functions with new corresponding functions.
Change the return value from being a double number to use a Cost_estimate
object. These new functions will when the new cost model is implemented return
the cost estimate as execution time instead of number of random IO operations.
Suggestions for function declarations:

  /**
    Cost estimate for doing a complete table scan.

    @returns the estimated cost
  */
  virtual Cost_estimate table_scan_cost();

  /**
    Cost estimate for reading a number of entries from an index.

    The cost estimate will only include the cost of reading data that
    is contained in the index. If the record needs to be read, use
    read_cost() instead.

    @param index     the index number
    @param ranges    the number of ranges to be read
    @param rows      total number of rows to be read

    @returns the estimated cost
  */
  
  virtual Cost_estimate index_scan_cost(uint index, double ranges, double rows);

  /**
    Cost estimate for reading a set of ranges from the table using an index
    to access it.

    @param index     the index number
    @param ranges    the number of ranges to be read
    @param rows      total number of rows to be read

    @returns the estimated cost
  */
  virtual Cost_estimate read_cost(uint index, double ranges, double rows);

For the multi_range_read_info() and multi_range_read_info_const(), the
suggestion is to not add new versions for these until it is actually needed.
These functions already return the cost estimate in form of a Cost_estimate
object and will in many cases already be implemented by utilizing the more basic
cost estimate functions.

Benefits:

-can implement the new cost estimate functions in handler and storage engines
while keeping the code for the old cost model mostly unchanged.

-replaces use of double as cost estimate with using the Cost_estimate object.
Using the Cost_estimate object gives the possibility for the handler and storage
engines to report cost estimates for IO, CPU usage, communication (and memory cost).

Changes that have been considered but not included:

1. Include a range count argument to index_scan_cost() to make it more similar
to the arguments that read_cost() take.

2. Combine index_scan_cost() and read_cost() into one function that takes as an
argument whether it will read from only the index or need to access fields not
contained in the index.

ALTERNATIVES:
=============

The following alternative interface changes have been considered:

Alternative 1:
==============

Keep the existing cost estimate functions but define that they instead of
returning the cost estimate as "number of random disk accesses", they will
return the cost estimate as estimated execution time.

Some disadvantages of this alternative are:

-since they return only a single double value it will not be possible for the
storage engine to distinguish between what contributes to the cost, eg. IO,
communication, CPU, memory (see the Cost_estimate object).

-it will make it more difficult to keep both the old and new implementation of
the cost model in the code base for 5.7.

-some third party storage engines do overload these cost functions, these will
be executed even if we do changes to the default implementation.


Alternative 2:
==============

Replace the existing cost estimate functions with new corresponding functions.
These new functions will return the cost estimates as execution time instead of
number of random IO operations. The main difference between this alternative and
the preferred alternative is that these functions return the cost as a double
instead of using the Cost_estimate object. Suggestions for function declarations:

  /**
    Cost estimate for doing a complete table scan.
  */
  virtual double table_scan_cost();

  /**
    Cost estimate for reading a number of entries from an index.
  */
  virtual double index_scan_cost(uint index, uint ranges, double rows);

  /**
    Cost estimate for reading a set of ranges from the table using and index
    of access it.
  */
  virtual double read_cost(uint index, uint ranges, double rows);

For multi_range_read_info() and multi_range_read_info_const(), the suggestion is
to not add new version for these until it is actually needed. These functions
already returns the cost estimate in form of a Cost_estimate object and will in
many cases already be implemented by utilizing the more basic cost estimate
functions.

The code has been pushed to the mysql-trunk-wl7209 tree.