WL#7182: Optimizer Cost Model API

Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

This worklog will introduce a cost model API and a cost model module to be used
by the optimizer. The main motivation for doing this is:

* remove use of hard coded cost constants from the optimizer and handler code.
The existing cost constants will after this worklog be moved inside the cost
module and only accessible through the new API.

* move common parts of cost calculations from the server and handler code into
the new cost module.

User Documentation
==================

No user documentation required.

This worklog should not introduce any functional changes. It is mostly about
moving hard coded cost constants and some of the current cost calculations from
the optimizer and handler code into a separate module with a new API.

Due to the use of floating point arithmetic in cost calculations minor rounding
errors might cause changes to query plans in the case where alternative query
plans have almost identical cost estimates.

Overview of the new API for cost estimates
==========================================

A high level presentation of the cost estimate API is given in WL#6564
together with a figure that shows how this API fits into the overall
task of refactoring cost estimation code and make the hard coded cost
constants configurable.

The new cost estimate API consists of two new classes, Cost_model_server
and Cost_model_table. These classes will be used by the optimizer,
handler and storage engines to get cost estimates for basic server and
storage engine operations. The initial implementation of these classes
will mostly provide functions for getting the "cost constants" for the
existing basic operations. These functions will replace the use of the
currently hard coded cost constants. Some of the current cost
calculations are also moved from the optimizer and handler code into
these new classes. This will allow us to support multiple valid versions
of the cost constants at the same time and allow for changing cost constants
on a running server without influence on currently optimized queries.

Later re-factoring and extensions of the cost model may move more parts of
the cost estimation code from the optimizer source code into these
classes. Thus, it should be expected that these classes will be
extended with more functionality.


1. Cost estimates for server operations
=======================================

The Cost_model_server class will be used for providing cost estimates
for operations that are not dependent on a particular table. For each
query there will be one object of this class. This object will be created as
part of the THD and initialized each time a new query starts. The THD object
will provide access to this object by a new function:

  const Cost_model_server* THD::cost_model() const;

The reason for letting each THD have its own object of this class is to let
different session run with different cost constants. This allows the server's
cost constants to be changed online without influencing on on-going optimizations.

The initial version of this class as implemented in this worklog will
provide the values for the existing cost constants and cost functions
for using temporary tables.


Interface:
==========

The initial implementation of the Cost_model_server class will have
the following public functions:

class Cost_model_server:

  /**
    Temporary table types that the cost model differentiate between.
  */
  enum enum_tmptable_type { MEMORY_TMPTABLE, DISK_TMPTABLE };

  /**
    Initialize the cost model object for a query.

    This function must be called before calling any cost estimation
    functions for a query. It should also be called when starting
    optimization of a new query in case any cost estimate constants
    have changed.
  */
  void init();

  /**
    Cost of processing a number of records and evaluating the query condition
    on the records.

    @param rows number of rows to evaluate

    @return Cost of evaluating the records
  */
  double row_evaluate_cost(double rows) const;

  /**
    Cost of doing a number of key compare operations.

    @param keys number of key compare operations

    @return Cost of comparing the keys
  */
  double key_compare_cost(double keys) const;

  /**
    Cost estimate for creating a temporary table.

    @param tmptable_type storage type for the temporary table

    @return Cost estimate
  */
  double tmptable_create_cost(enum_tmptable_type tmptable_type) const;

  /**
    Cost estimate for inserting and reading records from a
    temporary table.

    @param tmptable_type storage type for the temporary table
    @param write_rows    number of rows that will be written to table
    @param read_rows     number of rows that will be read from table

    @return The estimated cost
  */
  double tmptable_readwrite_cost(enum_tmptable_type tmptable_type, 
                                 double write_rows, double read_rows) const;


2. Cost estimates for operations on table data
==============================================

Some cost estimates will depend which table a particular operation is performed
on. This will mainly be cost estimates that depends the table definition (eg.
data type for columns or record size) or which storage engine or storage device
type the table is stored on. The Cost_model_table class will provide these
cost estimates. Each TABLE object will have a corresponding 
Cost_estimate_table object. This object will be available from the TABLE
object by a new function:

  Cost_model_table* TABLE::cost_model() const;

The Cost_model_table object must be initialized with a pointer to the query's
Cost_model_server object since it will do calls into the Cost_model_server
object for some of the "cost constants".

The initial version of this class will provide "cost constants" and cost
estimates for IO related functions. It will also provide wrapper functions for
the "cost constants" that the Cost_table_server class implements. The reason for
this is that in the handler, storage engines and in some of the lower parts of
the optimizer, the THD object is not directly available.


Interface:
==========

The initial implementation of the Cost_model_table class will have
the following public functions:

class Cost_model_table:

  /**
    Initializes the cost model object.

    This function must be called before calling any cost estimation
    functions for a query. It should also be called when starting
    optimization of a new query in case any cost estimate constants
    have changed.
  
    @param cost_model_server the main cost model object for this query
  */
  void init(const Cost_model_server *cost_model_server);

  /**
    Cost of processing a number of records and evaluating the query condition
    on the records.

    @param rows number of rows to evaluate

    @return Cost of evaluating the records
  */
  double row_evaluate_cost(double rows) const;

  /**
    Cost of doing a number of key compare operations.

    @param keys number of key compare operations

    @return Cost of comparing the keys
  */

  double key_compare_cost(double keys) const;

  /**
    Cost of reading a random block from disk.
  */
  double io_block_read_cost() const;

  /**
    Cost of reading a number of random blocks from a table.

    @param blocks number of blocks to read

    @return Cost estimate
  */
  double io_block_read_cost(double blocks) const;

  /**
    The fixed part of the cost for doing a sequential seek on disk.

    For a harddisk, this corresponds to half a rotation (see comment 
    for get_sweep_read_cost() in handler.cc).
  */
  double disk_seek_base_cost() const;

  /**
    Cost estimate for a sequential disk seek where a given number of blocks
    are skipped.

    @param seek_blocks number of blocks to seek past

    @return The cost estimate for the seek operation
  */
  double disk_seek_cost(double seek_blocks) const;


3. Cost estimates for Handler operations
========================================

The handler will continue to provide cost estimates for the main
operations that the storage engine executes. This is similar to how it
is done today. This worklog will not implement any changes in the
interface to the existing handler functions for cost estimates. Still,
see WL#7209 for proposed changes to this interface.

As an alternative, it would be possible to replace calling the handler
directly by adding corresponding functions to the Cost_model_table object.
These functions would then just forward the call to the handler
API. This would require us to store information about the handler
inside the Cost_model_table object and add some extra overhead.

Source code for new cost model API and cost module
==================================================

This worklog will implement the following two new classes:

  Cost_model_server
  Cost_model_table

The implementation of the new code will be in two new source code
files named:

  sql/opt_costmodel.h
  sql/opt_costmodel.cc


Creating and initialization of Cost_model_server object:
========================================================

The Cost_model_server object will be added as a private variable to
the THD class. The THD class is extended with a new member function that
will be used for initializing the Cost_model_server object.

  /**
    Initialize the optimizer cost model.
  
    This function should be called each time a new query is started.
  */
  void init_cost_model();

The Cost_model_server object will be initialized when starting a
new query. This initialization takes place in lex_start(). 

To get access to the Cost_model_server_object a new member function is
added to the THD object:
 
  const Cost_model_server* THD::cost_model() const;


Creating and initialization of Cost_model_table objects:
========================================================

The Cost_model_table object will be added as a private variable to the
TABLE class. To get access to this object the following function is added:

  Cost_model_table* cost_model();

The Cost_model_table object must be initialized with the query's
Cost_model_server object in order to get access to some of the "cost
constants". To initialize the Cost_model_table object, the following function is
added to the TABLE class:

  /**
    Initialize the optimizer cost model.
 
    This function should be called each time a new query is started.

    @param cost_model_server the main cost model object for this query
  */
  void init_cost_model(const Cost_model_server* cost_model_server);
 
This initialization must happen after the query is created
but before calling cost functions on the Cost_model_table
object. Since not all queries go through the main optimizer, this
initialization has to be done on several places:

* For most "normal" queries: make_join_statistics() in sql/sql_optimizer.cc

* Internally created temporary tables: create_tmp_table() in sql/sql_tmp_table.cc

* UPDATE statements: mysql_update() in sql/sql_update.cc

* DELETE statements: mysql_delete() in sql/sql_delete.cc

* "Queries" that list the help text: prepare_simple_select()


Using the new cost model API:
=============================

The optimizer and handler code will use the new API for the following cases:

* All (with one exception) use of hard coded cost constants are replaced
  by calls to the new API

* Some of the cost estimation for use of temporary tables are moved to the new
cost module

* Some cost estimation of disk accesses in the handler is moved into the new
cost module