WL#2474: Batched range read handler functions

Affects: Server-6.0   —   Status: Complete

This task is for a refinement of the Multi Read Range interface (WL #2126).
The new interface supports usage of pushed predicates and includes a special
handler function returning cost estimates for the batched range acccess.     
Contents
--------

1. Current MRR handler functions
2. Envisioned implementations of the MRR interface
3. Limitations of the current MRR interface
4. Batched key access method
5. Range sequence interface
6. New MRR handler functions


1. Current MRR handler functions
---------------------------------

At present Multi-Range Read (MRR) interface is represented by two virtual 
functions:
  virtual int read_multi_range_first(
    KEY_MULTI_RANGE **found_range_p,
    KEY_MULTI_RANGE *ranges,
    uint range_count,
    bool sorted,
    HANDLER_BUFFER *buffer);
  virtual int read_multi_range_next(
    KEY_MULTI_RANGE **found_range_p);

The following type definintions fully specify the KEY_MULTI_RANGE structure 
used by these functions:

enum ha_rkey_function {
  HA_READ_KEY_EXACT,      
  HA_READ_KEY_OR_NEXT,   
  HA_READ_KEY_OR_PREV,   
  HA_READ_AFTER_KEY, 
  HA_READ_BEFORE_KEY,
  HA_READ_PREFIX, 
  HA_READ_PREFIX_LAST, 
  HA_READ_PREFIX_LAST_OR_PREV, 
  HA_READ_MBR_CONTAIN,
  HA_READ_MBR_INTERSECT,
  HA_READ_MBR_WITHIN,
  HA_READ_MBR_DISJOINT,
  HA_READ_MBR_EQUAL
};

typedef struct st_key_range
{
  const byte *key;
  uint length;
  enum ha_rkey_function flag;
} key_range;

/* For key ranges */

#define NO_MIN_RANGE     1
#define NO_MAX_RANGE     2
#define NEAR_MIN	     4
#define NEAR_MAX	     8
#define UNIQUE_RANGE    16
#define EQ_RANG         32
#define NULL_RANGE      64
#define GEOM_FLAG      128

typedef struct st_key_multi_range
{
  key_range start_key;
  key_range end_key;
  char  *ptr;        
  uint  range_flag;   
} KEY_MULTI_RANGE;

A structure of the type KEY_MULTI_RANGE specifies an interval of an index. The 
first two fields represent two keys for the index as the endpoints of the 
interval. 
The last field characterizes the type of the interval:
open, closed, half-open etc :
1. (a,b) = { x | a < x < b }
2. [a,b] = { x | a <= x <= b } 
3. [a,b) = { x | a <= x < b } 
4. (a,b] = { x | a < x <= b } 
5. (a,inf) = { x | x > a }
6. [a,inf) = { x | x => a } 
7. (-inf,b) = { x | x < b } 
8. (-inf,b] = { x | x <= b } 
9. (-inf,inf)

Thus an interval of the type 3 is specified with range_flag = NEAR_MAX, of the 
type 9 with range_flag = (NO_MIN_RANGE | 
NO_MAX_RANGE).

A singleton is represented by a structure where the second field is ignored and 
the range_flag field is equal to EQ_RANGE.

The field 'ptr' of a KEY_MULTI_RANGE object currently is not used anywhere.

The handler functions read_multi_range_first and read_multi_range_next are 
iterator functions typical for MySQL handler interface. The first one is used 
to initiate an iteration process over the data records accessed in a multiple 
range scan and fetches the first record if there is any. The second function is 
used to iterate over this set records one by one. At each step of the iteration 
process the record being accessed are placed into the record buffer attached to 
the handler.

The specifics of MRR handler functions is that they support iteration over a 
records accessed as result of another iteration ¨C an iteration over a sequence 
of index ranges. This sequence is represented by an array of index ranges. 


2. Envisioned implementations of the MRR interface
--------------------------------------------------

MRR interface allows us to process requests for a key access to a table in a 
batch manner. 
It makes sense to use batch processing of requests for key access when 
performing massive operations that require accessing row data through index. 
Range index scan and equi-join that uses index to the join attributes of the 
inner table (see a paragraph on Batched Key Access below) are such operations.
  
At present we see two scenarios when such processing brings significant 
benefits.

Scenario A.
  - a portion of keys is accumulated in a buffer
  - the keys in the buffer are sorted by their rowid/pk
  - data records are accessed according to the sorted key sequence.
  	
MyISAM/InnoDb can capitalize on this scenario when performing both index range 
scans and equi-join operations.

Scenario B.
  - a portion of ranges, possibly single key ranges, is accumulated in a buffer
    on the central node to where the query is submitted
  - the ranges are sent to the execution nodes that access data records
  - the accessed records are packed into packages and sent back to the central
    node
  - the received packages with data records are placed to a buffer
  - data records are read from the buffer

NDB cluster currently works by this scenario when executing multiple range 
index scan. It also can benefit from this batch processing when performing an 
equi-join by an attribute.  
   

3. Limitations of the current MRR interface
-------------------------------------------

Let's consider the query
  SELECT * FROM t WHERE keyp1 >= 1000 AND keyp1 < 2000
                        AND keyp2=10000
with a compound index defined on (keyp1,keyp2).

Current optimizer will construct only one range for this query [{1000,10000}, 
{2000,MIN_INT}).
However, if we have at least 1 and on average 100 index entries for any value 
of keyp1 from [1000,2000) then scanning this range definitely is not a good 
choice. More efficient would be to scan the sequence of single point ranges 
[{1000,10000}],...[{1999,10000}]. 

To create an array of ranges for this sequence first we have to build 1000 
different keys that will take 8*1000 bytes in total. The array of range 
structures will take
additionally sizeof(MULTI_KEY_RANGE)*1000 bytes.

Yet, multi-range batch processing not necessarily requires materialization of 
the whole sequence of ranges.
At the scenario A we could build keys one by one in a key buffer and perform an 
index look-up for each new key.
At the scenario B we also don't need all ranges at once. We need them in 
portions small enough to be placed in the buffer sent to the execution node.

A requirement to represent a range exactly by a structure of the 
KEY_MULTI_RANGE type is another limitation of the current MRR interface. When 
using batched key access we have an array of single point ranges to be passed 
to the MRR interface. More exactly we have an array of pointers to the key 
values of a fixed size representing these ranges.
A representation of these ranges by an array of the KEY_MULTI_RANGE structures 
requires at least sizeof(key_range) extra bytes for each array element. Besides 
the info about the length and the type of the key (flag) is not needed to be 
stored in each element of the array as all keys are of the same size and form 
single point ranges.

For the batched key access any space wasted in the memory buffer is unwanted. 
This space can be used to store more partially joined records. That's why it's 
desirable to have as compact representation of the ranges as possible.

We foresee at least the following possible representation of single point 
ranges to be used by batched key access method.
  typedef st_mrr_single_point_ref
  {
    char *record_list;
    char * key;
  }
Here 'record_list' contains a head of list of pointers (or their offsets) to 
records in the batch buffer with the same key value while 'key' points to this 
value.
Another possible representation uses the structure: 
  typedef st_mrr_single_point_dir
  {
    char *record_list;
    char key[];
  };
It makes sense to utilize this representation in the cases when sizeof(key) < 
size(char *).


4. Batched key access method
----------------------------

We've already mentioned the batched key access method. It can be employed for 
join operations yielding a significant performance gains. The procedure by 
which this method works can be presented as follows.
1. Get the next result record of the partial join of tables T1,...,Tk that met 
the condition pushed for these tables and put it the in a memory buffer after 
already written records and, possibly, keys.

2. Extract a key from the record. Look for this key in a search structure 
placed at the end of the buffer. If such key is found then the record is linked 
to the list of records with this key. Otherwise an element pointing to the key 
is added to the search structure. At this moment the list of records associated 
with the key consists only of one lastly read record. If the image of the key 
cannot be found in the record we put it into buffer after the last record.

3. When the whole buffer has been filled the multi_range_read_init method is 
called with the sequence of ranges that can be extracted from the heap 
structure. 
The call initiates a procedure that accesses records of the next joined table T 
by keys from the heap structure.

4. When multi_range_read_init finishes the accessed records are ready to be 
joined to the records in the buffer. To accomplish the join operation the 
multi_range_read_next method is called repeatedly. At each its call it receives 
the next accessed record r from table T and a pointer to the list of records 
containing the keys by which record r has been accessed. The list consists of 
all matches for record r. The newly joined records either are put to another 
buffer for the next join operation if there is any, or are sent to the output 
stream.

In scenario A the record r is read from table T just before joining with 
matching records from the buffer.  In scenario B the record is read from 
internal MRR buffer. 

The following picture presents the layout of the data in the buffer used by 
Batched Key Access method.

| linked records [and keys] --->

      [key K]  
      ^        ...
      |         |
      |         +---->[ * ][record i with key K]
      | to first record |
      |   with key K ^  +---->[ * ][ record i+1 with key K]
      |              |          |           
      |              |         ...
      |    +------>[ * ][last record with key K]         
      |    |       
    [ * ][ * ]                  <--- search key structure |




5. Range sequence interface
----------------------------

All said above brings us to the idea of exploiting iterators when representing 
sequences of ranges rather than just static arrays of range structures. To 
ensure the utmost flexibility we'll represent a class of range sequences 
through its interface consisting of two member functions 'init' and 'next'.

If the MRR interface accepted objects of C++ classes we would define an 
abstract class RangeSeq with two virtual member functions named accordingly. 
The objects of the classes derived from this abstract class would serve as 
concrete range sequences and would be passed as parameters to the 
multi_range_read_init function. Having to work in the C environment of the 
handler implementations we cannot do this. Instead of this we define the 
following structure consisting of two callback functions: 
typedef struct st_range_seq_if
{
  range_seq (*init)(void *init_params, uint n, uint flags);
  uint (*next) (range_seq s, KEY_MULTI_RANGE *range);
} RANGE_SEQ_IF;
where range_seq is defined as follows:
typedef void *range_seq;

The 'init' member functions of the RANGE_SEQ_IF interface structure shall 
interpret the first parameter as a pointer to the structure that let us to 
perform all necessary setup for iteration through the ranges to be scanned. The 
structure can contain a pointer to an array of materialized representations of 
ranges to be run through or, alternatively, it can point to a structure by 
which this ranges can be generated. The specific info placed in this structure 
is to be used only by another member function ¨C 'next'. The 'init' function is 
to return a handler of the type range_seq to be passed to the successive calls 
of the 'next' member function. The latter shall fill in the KEY_MULTI_RANGE 
structure passed to it as the second parameter. The structure shall specify the 
range yielded at the next step of iteration trough the range sequence set 
by 'init' if the sequence has not ended yet. After this it will prepare to be 
able to fetch info for the following range in the sequence.

The second and third parameters the init function could be accessed through the 
first parameter. By some reasons we prefer separate those parameters 
from 'init_params' (see below notes on the default implementation of the range 
sequence interface). 

Although the 'next' function receives a general range structure it has not to 
fill all fields of the structure.
The parameter 'flags' dictates what fields are to be filled.

The flag HA_MRR_SINGLE_POINT says that all fields related to the second key are 
to be skipped as well as the 'range_flag' field. 
The flag HA_MRR_FIXED_KEY says that the length of the key is to be skipped.
The flag HA_MRR_NO_ASSOCIATION says that the field 'ptr' is not to be used by 
the MRR interface function and may be ignored by the 'next' function. Otherwise 
the 'ptr' value of the yielded range is to be remembered by MRR interface in 
order to be returned by any call of the multi_range_read_next function that 
reads records accessed by a key from the this range. This allows us to get 
matches for the records read in the batched key access procedure.

The 'next' member function returns 0, if it has succeeded to get the next 
range. Otherwise it returns 1.


6. New MRR handler functions
-------------------------------------
The new multi_read_range interface contains three functions:
  multi_range_read_init,
  multi_range_read_next,
  malti_range_read_info.

The synopsis for the virtual functions multi_range_read_init and 
multi_range_read_next is as follows:
int multi_range_read_init(
  RANGE_SEQ_IF *seq;    /* range sequence interface */
  void *seq_init_param; /* first param for seq.init */
  uint n,               /* number of ranges in seq. */
  uint mode,           /* flags                    */
  HANDLER_BUFFER *buf   /* memory buffer to be used */
)
int multi_range_read_next (
  char **range_info    /* out: info associated range */
)

When working by scenario A a typical implementation of the 
multi_range_read_init function first calls seq.init with 
parameters 'seq_init_param', 'n' and 'mode'. This call sets everything for 
getting ranges to be scanned. Then calling seq.next the method iterates through 
these ranges and when doing so it gets rowids for keys in these ranges and 
places them in the buffer 'buf'. When buffer is full the method prepares it for 
calls of the multi_range_read_next by sorting the rowids in the buffer to 
optimize the following access to the records referenced by these rowids. After 
this iterative calls of the multi_range_read_next method will read records with 
rowids from the buffer and will put them into the buffer table->record[0]. If 
this record is to be joined with matching records from the first join operand
the reference returned in the range_info is to be used.
Originally this reference is placed to the range structure by which key the 
record was accessed. Then this reference is attached to each rowid for this key 
and is placed to the buffer together with the rowid. When a call of 
multi_range_read_next discovers that there are no rowids in the buffer first it 
reads rowids for keys in the remaining ranges and fills the buffer, after this 
sorts the buffer, takes the top rowid and accesses the corresponding record.

The 'mode' parameter can contain not only flags interpreted by seq.init. 
Additionally it can contain flag HA_MRR_SORTED. This flag is used when the read 
records must be returned in order compatible with the index used for the 
ranges. It does not make sense to set this flag if the sequence of ranges 
generated by the range iterator does not comply with the index order.
 
At the first glance retrieving table records in rowid order cannot comply the 
requirement to return records in index order unless the index is defined as a 
clustered primary key. Actually the read records can be put first in a separate 
piece of memory (a part of that specified by the 'buf' parameter). After this 
they can be returned by calls of multi_range_read_next in the index order.
 
When working by scenario B a typical implementation of the 
multi_range_read_init function first also has to call the seq.init that 
performs initial setting for iteration through ranges. In a general case the 
iteration itself is triggered by a requirement to form a package of ranges to 
be sent to remote nodes. After the package has been sent a return package of 
read records is received. The records are
placed into the memory of the recipient and are read by       
multi_range_read_next calls. When the last of them has been read a new portion 
is requested. This is repeated until all ranges from the sent package are 
scanned. After this the next package of ranges is sent to remote nodes.

In this scenario the multi_range_read_next call can return not all records that 
are accessed by keys from the range sequence. If a condition is pushed to the 
accessed table then only the records that meet this condition are returned
to the query node.
A condition can be pushed to the table by a call of the cond_push handler 
function.

The 'mode' also can contains information on the type of the join to be 
performed. For semi-join and anti-join operations we don't need any record to 
be joined and the buffer table->record[0] has not to be filled.
The flag HA_MRR_SEMI_JOIN/HA_MRR_ANTI_JOIN says that the join operation is a 
semi-join/anti-join.
There is a special flag to specify outer join operations as well ¨C 
HA_MRR_OUTER_JOIN. Here a left outer join operation always is assumed.

The flags HA_MRR_SEMI_JOIN, HA_MRR_ANTI_JOIN, HA_MRR_OUTER_JOIN are used only 
for operations with equality join predicates. Thus if one of this flag is set 
the ranges in the range sequence must be singletons.  

When the flag HA_MRR_SEMI_JOIN is set a call of the multi_range_read_next 
function returns a reference to the list of records associated with the next 
scanned key for which there is a match in the inner table.
This is opposite to the situation when the flag HA_MRR_ANTI_JOIN is set. Here a 
call of the multi_range_read_next function returns a reference to the list of 
records associated with the next scanned key for which there is no match in the 
inner table.
In both cases the matched records are not even accessed.

With the flag HA_MRR_OUTER_JOIN a call of the multi_range_read_next function 
returns matched inner records, possibly projected to some fields, when there 
are matches. They are returned in addition to the list of outer records 
associated with the matching key. If the there is no matches for a key only a 
list of records associated with
the key is returned. The return code of the multi_range_read_next function says 
whether there was a match for a key or not.

There are some variants of join operation when only the first/last record with 
a given key is joined. The corresponding flags for these variant are to be 
added later.

When a matched record is found only fields that are referred in the query can 
be put into the buffer
table->record[0]. If all these fields can be read from the index the data 
record has not to be accessed at all.
The flag HA_MRR_INDEX_ONLY indicate this case.

In scenario A it's not necessary to allocate a buffer for rowids in this case.

If the 'seq' parameter is equal to NULL then it's assumed that seq_init_param 
is an array of universal range representations of type KEY_MULTI_RANGE and n is 
number of ranges in this array. This explains why in some cases we can't do 
without a separate parameter that specifies the number of ranges in the 
sequence.

The synopsis the multi_range_read_info handler function is:

int  multi_range_read_info(
  uint keyno,          /* index number                  */
  uint n,              /* est. number of ranges in seq. */
  uint keys,           /* est. number of keys in ranges */
  uint *bufsz,         /* size of the buffer to be used */          
  uint *flags,         /* in/out: flags                 */
  COST_VECT *cost      /* est. cost of the MRR access   */
)

where COST_VECT is defined as follows:
typedef st_cost_vect 
{ 
  double io_count;     /* number of I/O *               */
  double avg_io_cost;  /* cost of an average I/O oper.  */
  double cpu_cost;     /* cost of operations in CPU     */
  double mem_cost;     /* cost of used memory           */ 
  double import_cost;  /* cost of remote operations     */
} COST_VECT

The multi_range_read_info function is to be called by the optimizer. The 
function passes the following as input parameters:
  - the number of the index to be scanned   
  - an estimate of number of ranges in the range sequence 
  - an estimate for the number of keys in these ranges
  - a suggested size of the external buffer to be used in the MRR access method
  - suggested modes for the MRR method.

The function can return the size of external buffer that it considers a more 
appropriate. In particular it can return 0 in the 'bufsz' parameter if the MRR 
handler functions do not use any buffer at all of use an 'internal' buffer 
allocated by the method itself. If the flag HA_MRR_LIMITS is passed in 
the 'mode' parameter then the returned size of the buffer cannot be greater 
than the suggested one.
In general the 'mode' parameter should contain all flags to be passed to the 
multi_range_read_init method.

The 'cost' parameter is used to get the cost of the whole multi-range read 
operation. The cost C is calculated as:

  C = f1 * io_count * avg_io_cost + f2 * cpu_cost +
      f3 * mem_cost + f4 * import_cost
where f1, f2, f3, f4 are system variables that allow to tune the cost of usage 
of different resources ¨C CPU, main memory and nework/other servers ¨C with 
respect to the cost of IO operations.

All these components of the cost are supposed to be used by some future 
variants of the EXPLAIN command.

The rules to calculate these costs depend basically on a given implementation 
of the MRR handler functions. 

The function multi_range_read_info is supposed to get cost estimates in the 
cases when the range sequence is unknown yet or too long. 

If all members of the range sequence are known before the query is executed 
then another MRR function - multi_range_read_info_const - can be used to get 
cost estimates. This function is assumed to return better estimates as 
implementations of this function are allowed to perform index lookups for range 
endpoints.

The synopsis the multi_range_read_info_const handler function is:

int  multi_range_read_info_const(
  uint keyno,          /* index number                  */
  RANGE_SEQ_IF *seq;   /* range sequence interface      */
  void *seq_init_param;/* first param for seq.init      */
  uint n,              /* number of ranges in seq       */
  uint *bufsz,         /* size of the buffer to be used */          
  uint *flags,         /* in/out: flags                 */
  COST_VECT *cost      /* est. cost of the MRR access   */
)

The parameters seq, seq_init_param and n are used to walk through the sequence 
of ranges whose scan is to be estimated. These parameters have the same meaning 
as in the function multi_range_read_init.
The remaining parameters are quite similar to those of the same name used by 
the function multi_range_read_info.
CONTENTS
1. Interface description
1.1 Coding style-compliant MRR Interface definitions
1.2 Coding style-compliant Range Sequence interface definitions
1.3 Definitions of used HA_MRR_* flags
2. Range sequence interface implementations
2.1 The "array walker" implementation
2.2 SEL_ARG tree walker implementation
3. MRR interface implementations 
3.1 The default implementation
3.2 Changes in ha_ndbcluster.cc
4. SQL Layer changes
4.1 Update range analysis code to use the MRR interface
4.2 Update the QUICK_RANGE_SELECT code to use the new MRR interface
5. What will not be implemented


1. Interface description
========================
The MRR interface consists of two parts:

* The MRR interface itself
   MRR interface is implemented by
     - Generic implementation (converts MRR calls to non-MRR calls)
     - ha_ndblcuster-specific implementation
     - [in development] MyISAM/InnoDB/BDB implementation
   MRR interface is used by
     - Range access method code
     - [in development] Batched key access
     - [possible] Loose Index scan

* The Range sequence interface. This interface is part of MRR interface.
  Range sequence interface is implemented by
     - range access method optimizer
     - range access method row retrieval code
  Range sequence is used by 
     - Any table handler that implements MRR interface
   

1.1 Coding style-compliant MRR Interface definitions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

typedef struct st_key_range
{
  const byte *key;
  uint length;
} RANGE_ENDPOINT;


typedef struct st_key_multi_range
{
  RANGE_ENDPOINT start_key;
  RANGE_ENDPOINT end_key;
  enum ha_rkey_function function;
  char  *ptr;
  uint  range_flag;
} KEY_RANGE;

/*
  Get cost and other information about MRR scan over currently unknown ranges

  SYNOPSIS 
    multi_range_read_info()
      keyno           Index number
      n_ranges        Estimated number of ranges (i.e. intervals) in the 
                      range sequence.
      n_rows          Estimated total number of records contained within all 
                      of the ranges
      bufsz    INOUT  IN:  Size of the buffer available for use
                      OUT: Size of the buffer that will be actually used, or 
                           0 if buffer is not needed.
      flags    INOUT  A combination of HA_MRR_* flags
      cost     OUT    Estimated cost of MRR access

  DESCRIPTION
    This function obtains the cost and other info about about the MRR scan.
    The caller has some information about what kinds of ranges will be scanned
    (e.g. "1000 {keypart1=const} ranges"}, but the constant themselves are
    assumed not to be known when this function is called.

    The flags parameter is a combination of those flags: HA_MRR_SORTED, 
    HA_MRR_INDEX_ONLY, HA_MRR_NO_ASSOCIATION, HA_MRR_LIMITS.

  RETURN
    0     - OK, *cost contains cost of the scan, *bufsz and *flags contain scan
            parameters.
    other - Error or can't perform the requested scan
*/

int multi_range_read_info(uint keyno, uint n_ranges, uint n_rows, uint *bufsz,
                          uint *flags, COST_VECT *cost);


/*
  Get cost and other information about MRR scan over a known list of ranges

  SYNOPSIS
    multi_range_read_info_const()
      keyno           Index number
      seq             Range sequence to be traversed
      seq_init_param  First parameter for seq->init()
      n_ranges        Number of ranges in the sequence
      bufsz    INOUT  IN:  Size of the buffer available for use
                      OUT: Size of the buffer that is expected to be actually
                           used, or 0 if buffer is not needed.
      flags    INOUT  A combination of HA_MRR_* flags
      cost     OUT    Estimated cost of MRR access

  DESCRIPTION
    Obtain the cost and other information about an MRR scan when the list of
    ranges to be scanned is known.

  NOTE
    Reasons for having this function:
    * Let high-roundtrip-cost handlers, like NDB/Federated produce the 
      estimates in one roundtrip. (One might ask, why don't we use 
      condition-pushdown-based system for the purpose of estimating number of
      records that match the table condition?... good question... in any case
      there is another reason)
    * We need a specialized cost-of-MRR-over-constant-ranges function.
      = It must be different from non-MRR-cost function (since MRR is supposed
        to be better and have a smaller cost)
      = It must be different from multi_range_read_info() because here we know
        the ranges.
    Therefore, we have this function.

  RETURN
    0 - OK, *cost contains cost of the scan, *bufsz and *flags contain scan
        parameters.
    1 - Error or can't perform the requested scan
*/

int multi_range_read_info_const(uint keyno, RANGE_SEQ_IF *seq,
                                void *seq_init_param, uint n_ranges,
                                uint *bufsz, uint *flags, COST_VECT *cost);


/*
  Initialize the MRR scan

  SYNOPSIS
    multi_range_read_init()
      seq             Range sequence to be traversed
      seq_init_param  First parameter for seq->init()
      n_ranges        Number of ranges in the sequence
      mode            Flags, see the description for details
      buf             INOUT, memory buffer to be used.

  DESCRIPTION
    Initialize the MRR scan. This function may do heavyweight scan 
    initialization like row prefetching/sorting/etc (NOTE: but better not do
    it here as we may not need it, e.g. if we never satisfy WHERE clause on
    previous tables. For many implementations it would be natural to do such
    initializations in _next() call)

    mode is a combination of the following flags: HA_MRR_SORTED,
    HA_MRR_INDEX_ONLY, HA_MRR_NO_ASSOCIATION 

  NOTES
    One must have called index_init() before calling this function. Several
    multi_range_read_init() calls may be made in course of one query.

    Until WL#2623 is done (see section 3.2), the following will also hold:
    The caller will guarantee that if "seq->init == mrr_ranges_array_init"
    then seq_init_param will be an array of KEY_MULTI_RANGE structures (with
    n_ranges elements). This property will only be used by NDB handler until
    WL#2623 is done (see section 3.2 again).
     
    Buffer memory management is done according to the following scenario:
    The caller allocates the buffer and provides it to the callee by filling
    in the members of HANDLER_BUFFER structure (see current code for
    definition of HANDLER_BUFFER).
    The callee consumes all or some fraction of the provided buffer space, and
    sets the HANDLER_BUFFER members accordingly.
    The callee may use the buffer memory until the next multi_range_read_init()
    call is made, all records have been read, or until index_end() call is
    made, whichever comes first.

  RETURN
    0 - OK
    1 - Error
*/

int multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
                          uint n_ranges, uint mode, HANDLER_BUFFER *buf);


/*
  Get next record in MRR scan

  SYNOPSIS
    multi_range_read_next()
       range_info  OUT  If HA_MRR_NO_ASSOCIATION flag is in effect or an error
                        occured, nothing (but the caller is still allowed to 
                        store an arbitrary value there)
                        Otherwise, the opaque value associated with the range
                        that contains the returned record.

  NOTE
    It should be possible to process range lists of arbitrary length without
    any kind of overflows.
    
  RETURN
    0      OK
    other  Error code
*/

int multi_range_read_next(char **range_info);


/* 
  Operation cost vector class 
*/

class COST_VECT
{ 
public:
  double io_count;     /* number of I/O                 */
  double avg_io_cost;  /* cost of an average I/O oper.  */
  double cpu_cost;     /* cost of operations in CPU     */
  double mem_cost;     /* cost of used memory           */ 
  double import_cost;  /* cost of remote operations     */

  double total_cost() 
  {
    /* f{i} will be #define'd constants */
    return f1 * io_count * avg_io_cost + f2 * cpu_cost +
           f3 * mem_cost + f4 * import_cost;
  }
};


1.2 Coding style-compliant Range Sequence interface definitions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
typedef void *range_seq;

typedef struct st_range_seq_if
{
  /*
    Initialize the traversal of range sequence
    
    SYNOPSIS
      init()
        init_params  The seq_init_param parameter 
        n_ranges     The number of ranges obtained 
        flags        HA_MRR_SINGLE_POINT, HA_MRR_FIXED_KEY

    RETURN
      An opaque value to be used as RANGE_SEQ_IF::next() parameter
  */
  range_seq (*init)(void *init_params, uint n_ranges, uint flags);


  /*
    Get the next range in the range sequence

    SYNOPSIS
      next()
        seq    The value returned by RANGE_SEQ_IF::init()
        range  OUT
    
    RETURN
      0 - Ok, the range structure filled with info about the next range
      1 - No more ranges
  */
  uint (*next) (range_seq seq, KEY_RANGE *range);
} RANGE_SEQ_IF;


1.3 Definitions of used HA_MRR_* flags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/*
  The below two are not used (and not handled) in this milestone of this WL
  entry because there seems to be no use for them at this stage of
  implementation. (TODO is that so??)
*/
#define HA_MRR_SINGLE_POINT n
#define HA_MRR_FIXED_KEY  n

/* 
  Indicates that RANGE_SEQ_IF::next(&range) doesn't need to fill in the
  'range' parameter.
*/
#define HA_MRR_NO_ASSOCIATION n

/* 
  The MRR user will provide ranges in key order, and MRR implementation
  must return rows in key order.
*/
#define HA_MRR_SORTED n

/* Don't retrieve full records */
#define HA_MRR_INDEX_ONLY n

/* 
  The passed memory buffer is of maximum possible size, the caller can't
  assume larger buffer.
*/
#define HA_MRR_LIMITS n


2. Range sequence interface implementations
===========================================

2.1 The "array walker" implementation
-------------------------------------

This implementation will be used by QUICK_RANGE_SELECT code. When a
QUICK_RANGE_SELECT is constructed, the list of ranges to be scanned is
materialized in an array. This implementation will walk over this array.

There will be two RANGE_SEQ_IF-member-compliant functions:

range_seq is a pointer to 
struct
{
  QUICK_RANGE *cur;
  QUICK_RANGE *end;
};

/* Set the cur and end pointers */
mrr_ranges_array_init();

/* Advance over one element in the array */
mrr_ranges_array_next()
{
  if (cur == end)
    return 1; /* No more ranges */
  else
  {
    //fill range from *cur 
    cur++;
    return 0;
  }
}


2.2 SEL_ARG tree walker implementation
--------------------------------------
This range sequence interface implementation will traverse the SEL_ARG* tree
representation of range list.

struct state 
{
  SEL_ARG **stack;
  byte *min_key;
  byte *max_key;
};

init()
{
  /* Initialize the traversal */
}

next()
{
  /* Traverse the SEL_ARG tree until we've built the next interval */
}


3. MRR interface implementations 
================================

3.1 The default implementation
------------------------------
This is the only implementation that will be created in scope of this WL
entry. The planned implementations are WL#2475 (MyISAM/InnoDB/etc) and 
WL#2623 (NDB).

The purpose of this implementation is to allow the query executioner to always
use the MRR interface, no matter if the underlying engine supports MRR or not.
In a nutshell, this is "MRR calls to non-MRR calls adapter"

Below is a sketch of the functions to be implemented:

int
handler::multi_range_read_info(uint key_no, uint n_ranges, uint keys,
                               uint *bufsz, uint *flags, COST_VECT *cost)
{
  *bufsz= 0;  // No need for buffer

  /* Produce the same cost as current code does */
  if (*flags & HA_MRR_INDEX_ONLY)
  {
    /* get_index_only_read_time is implemented in opt_range.cc already */
    cost->io_cost= get_index_only_time(key_no, keys);
  }
  else
    cost->io_count= this->read_time(n_ranges, keys);

  cost->avg_io_cost= 1; // Assume random seeks
  cost->cpu_cost=    cost->import_cost= cost->mem_cost= 0;
  return 0;
}

int 
handler::multi_range_read_info_const(uint keyno, RANGE_SEQ_IF *seq,
                                     void *seq_init_param, uint n_ranges,
                                     uint *bufsz, uint *flags, COST_VECT *cost)
{
  KEY_RANGE range;
  range_seq seq_it;
  ha_rows rows, total_rows;
  *bufsz= 0;
  total_rows= 0;

  seq->init(seq_init_param, n_ranges, flags);
  while (!seq->next(seq_it, &range))
  {
    if (HA_POS_ERROR == (rows= this->records_in_range(keyno, &range.start_key,
                                                      &range.end_key)))
    {
      /* Can't scan one range => can't do MRR scan at all */
      return 1;
    }
    total_rows += rows;
  }

  cost->io_count=    this->read_time(n_ranges, total_rows);
  cost->avg_io_cost= 1; // Assume random seeks
  cost->cpu_cost=    cost->import_cost= cost->mem_cost= 0;
  return 0;
}


For handler::multi_range_read_init() and handler::multi_range_read_next(), 
adopt the code from handler::read_multi_range_first() and 
handler::read_multi_range_next(). It should be easy as that code uses simple
loops.


3.2 Changes in ha_ndbcluster.cc
-------------------------------
The proper changes will be done in WL#2623. Within the scope of this WL
entry the following will be done:
The ha_ndbcluster::read_multi_range_first and read_multi_range_next functions
will be changed to use the new interface. The new functions will not be fully
conformant implementations: They'll only handle the case where 
  "seq->init == mrr_ranges_array_init"

This limitation is expected to be lifted by implementation of WL#2623.

4. SQL Layer changes
====================

4.1 Update range analysis code to use the MRR interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is partially already done.

See section #(SEL_ARG tree) for one thing.

4.2 Update the QUICK_RANGE_SELECT code to use the new MRR interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Update the QUICK_*_SELECT to use the new MRR interface. 
Also remove the QUICK_RANGE class and replace all its uses with uses of 
KEY_RANGE as these two structures contain identical information.

//btw we will not ask the table handler to return the container interval.

5. What will not be implemented
===============================

Support for HA_MRR_SEMI_JOIN, ANTI_JOIN or OUTER_JOIN will not be implemented
in this WL entry as there is no way to test that functionality.