WL#7168: API for estimates for how much of table and index data that is in memory buffer

Affects: Server-Prototype Only   —   Status: Complete

For storage engines that cache parts of the database in a main memory
buffer, the time to read data from a table or index will be very
dependent on whether the data is already in the memory buffer or has
to be read from disk.

In order to take this into account, the code for computing cost
estimates for data access needs to know whether data is in memory or will
be read from disk. This worklog will extend the handler with a new API
for providing estimates for where table and index data is stored. It
will also contain a default implementation that will be used if the
storage engine does not provide this information.

Separate worklogs should be created for the individual storage engines
to provide the estimates and for making them available to the handler.
The estimates will be used by the cost estimation functions that will
be implemented as part of improving the optimizer cost model.
This worklog will not introduce any functional or non-functional changes. It
just provides a handler API that storage engines will provide data for. When
this new API is taken into use, there will be changes to how the optimizer
selects between query plans.
The worklog extends the handler with:

 a) Data structures where storage engines can provide estimates for how much of
data from tables and indexes that are stored in a main memory buffer.
 b) An API that the cost model can use to access this information.
 c) A default implementation/heuristic that can be used if the storage engine
has not provided this information.


Data structures for storing estimates:
======================================

The following data structures will be extended in order for storage engines
to store information about how much of tables or indexes that are in a main
memory buffer;

For estimate of how much of the table that is in a memory buffer:

-extend the ha_statistics class with a new variable:

  /**
    Estimate for how much of the table that is availabe in a memory buffer.
    Valid range is [0..1]. If negative means that the storage engine
    has not supplied any value for it.
  */
  double table_in_mem_estimate;

The server will initialize this to -1.0 when creating new handler objects. This
will be used for detecting whether the storage engine has provided a value or not.

For estimate of how much of each index that is in a memory buffer:

-extend the st_key struct with a new variable:

  /**
    Estimate for how much of the index data that is currently available in
    a memory buffer. Valid range is [0..1]. This will be initialize to a 
    negative number. If it still has a negative values when used means that the
    storage engine has not supplied a value.
  */
  double in_memory_estimate;

The server will initialize this to -1.0 when creating new handler objects. This
will be used for detecting whether the storage engine has provided a value or not.

When the server calls handler::info() with the HA_STATUS_VARIABLE flag set,
it will be the storage engine's responsibility to update these variables with
correct estimates. handler::info() is called on the start of every query, and
it is important for the optimizer that these estimates do not change during the
optimization of a query.

Note about partition pruning: since handler::info(HA_STATUS_VARIABLE) will be
used for retrieving these estimates, this will allow estimates that correctly
takes into account partition pruning.


Handler API for use to get access to the estimates:
===================================================

Two new functions are added to the handler class. These should be used by the
cost calculation functions to access the estimates that the storage engines have
provided:

  /**
    Return an estimate of how much of the table that is currently stored
    in main memory.

    This estimate should be the fraction of the table that currently
    is available in a main memory buffer. The estimate should be in the
    range from 0.0 (nothing in memory) to 1.0 (entire table in memory).

    @return The fraction of the table in main memory buffer
  */
  double table_in_memory_estimate() const;

  /**
    Return an estimate of how much of the index that is currently stored
    in main memory.

    This estimate should be the fraction of the index that currently
    is available in a main memory buffer. The estimate should be in the
    range from 0.0 (nothing in memory) to 1.0 (entire index in memory).

    @param keyno the index to get an estimate for

    @return The fraction of the index in main memory buffer
  */
  double index_in_memory_estimate(uint keyno) const;


Default implementation:
=======================

If the storage engine does not implement support for estimating the
amount of the data and indexes that is in the memory buffer, the
handler will use a default implementation for this. This implementation
will be based on the size of the table/index and the size of the memory buffer
that the storage engine uses for caching data in (if available).

Proposed high level heuristics:

 a) if the handler knows the size of the main memory buffer, then if the size
table or index is less than 10 percent of this buffer, then assume the
table/index is cached in memory.
 b) if the handler does not know the size of the main memory buffer, then if the
table or index is less than 10 MB, then assume it is cached in a memory buffer.


An initial working prototype for this API has been implemented. In addition to
implement the interface as specified in the High-level specification section,
this also implements the needed support in the Memory engine.