WL#2475: Batched range read functions for MyISAM/InnoDB

Affects: Server-6.0   —   Status: Complete   —   Priority: Low

This task is to implement batched range read functions for MyISAM and InnoDb 
engines.
The implementation shall optimize the batched index access of table rows for a 
sequence of disjoint ranges.
Eventually the functions has to be implemented according to specifications 
presented in WL#2474. But the first implementation should rather follow the 
interface of WL #2126 (this is a requirement of Monty).
  
<contents>
1. Introduction
2. Applicability criteria
3.1 The case with MRR/MyISAM for index_only scans
3.2 The case with MRR/MyISAM when there are BLOBs and sorted output is needed
4. Implementation of MRR interface functions
5. Cost model
5.1 Number and parameters of "sort and sweep" steps
5.2 Cost of one "sort and sweep" step
6. Memory costs per record
7. Note about index condition split up
8. How to fit it into current table handler functions
</contents>

1. Introduction
---------------

This worklog entry describes an implementation of MRR interface for MyISAM
and InnoDB table engines (It seems that everything said here can be applied
to BDB as well but for now we don't include BDB).

The advantage of MRR/MyISAM implementation over default (MRR-to-non-MRR calls
converter) MRR implementation will be better performance. The speedup will be
attained by accessing table records sequentially, in disk "sweeps" instead of
random disk accesses done by the default MRR implementation. In order to
arrange for sequential table record access, MRR/MyISAM will have to use more
memory and CPU time but we expect this added cost to be smaller then cost
spared by using faster disk IO.

The table access will be performed using the following algorithm:

while (have more ranges to scan)
{
  scan index range(s) and accumulate N rowids in a buffer;
  
  sort the rowids buffer;
  /* Access table records in a sweep: */
  for each rowid in the rowids buffer
  {
    retrieve full table record;
    pass it to output;
  }
}

Note that it doesn't make sense to use MRR/MyISAM if we don't need to access 
full table records.

Also note that the records are not passed to output in key order. It is
possible to design variations of this algorithm that do produce records in
key order (see a file attached to WL#2475), but this will not be implemented
in the scope of this Worklog entry.

Since a table handler must always provide the MRR interface, MRR/MyISAM and
MRR/InnoDB will have an option to bypass the "native" implementation and use
generic MRR implementation instead.

The decision between using the MRR/MyISAM implementation and bypassing it
will be based on whether MRR/MyISAM implementation is applicable and whether
it is cheaper to use than the default MRR implementation.


2. Applicability criteria
-------------------------
The described MRR/MyISAM algorithm will not be used in the following cases:

3.1 The case with MRR/MyISAM for index_only scans
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MRR/MyISAM doesn't provide any benefit when we only need to retrieve fields
covered by the index. Therefore, when index_only scan is requested
(with HA_MRR_INDEX_ONLY flag), we will use the default MRR implementation.

3.2 The case with MRR/MyISAM when there are BLOBs and sorted output is needed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Consider the case where
 * The output needs to be sorted in key order (HA_MRR_SORTED flag is set),
 * The set of needed fields includes a BLOB field.

Here there are two possible execution scenarios:

 A. Perform the scan (using default MRR implementation) that returns records
    in key order.
 
 B. Perform the scan (using MRR/MyISAM) that returns records in no particular
    order and then sort the records (using filesort).

Here is a pseudocode representation of actions that will be done in scenario
B:
{
  table->multi_read_range_init();
  while (table->multi_range_read_next()) (*)
  {
    if (WHERE condition is satisfied) (X)
      put (rowid, sort_value) pair into buffer;
  }
  
  // the following is done by filesort().
  sort the buffer of {rowid, sort_value} by sort_value;
  for each pair in the buffer
  {
    retrieve full table record; (**)
    pass it to output;
  }
}

As we can see in scenario B each satisfying full table record will be read two
times:
  1. first time, at location (*), using disk "sweeps".
  2. second time, at location (**), using random disk accesses.

Comparing this with scenario A (full table records retrieved once, using
random disk accesses), we see scenario B will be too expensive except for
[quite rare] cases where "condition X AND NOT range_condition" has very high
selectivity.

Conclusion: to avoid making some kinds of queries run slower, we will not use
MRR/MyISAM when we're requested to produce BLOBs in key order.

4. Implementation of MRR interface functions
--------------------------------------------

A general consideration: NoInterruptingIoAssumption:
we'll assume that amount and frequency of disk I/O between 
multi_range_read_next() calls will not be high enough to break the read sweep,
so we will not buffer ouput tuples, but rather read them one-by-one.
The reasoning here is that subqueries (evaluated in the table condition) will
be unnested, and since MRR will be used together with Batched Key Access,
there will be no interrupting I/O from the same thread.

Also Note: SergeyP experiments with interrupting I/O from other threads have 
shown that even certain amounts of interrupting do not "break the disk sweeps". 

*************************************
* Note after discussion with Monty: *
*************************************
It *is* better to read full table rows at once, as this disk sweep won't be
interrupted. This is outside of this WL entry, but in the future we should
make use of full table row (or field tuple) buffer.

Experiment: Reading two 200M files in 8K chunks one after another is about 
two times faster than reading them in a sequence.

/* 
  Note: the below pseudocode describes handling of all supported HA_MRR_* 
  flags 
*/
int multi_range_read_info(uint keyno, uint n_ranges, uint keys, uint *bufsz,
                          uint *flags, COST_VECT *cost)
{
  record the cost/flags/mem_usage of default  MRR implementation;

  /* Check the rules described in sections 3.1 and 3.2 */
  if (flags & (HA_MRR_INDEX_ONLY | HA_MRR_SORTED))
  {
    return cost/flags/buffer size of the default implementation;
  }

  /*
    We assume HA_MRR_LIMITS is always set - we don't request memory buffer
    larger then the one provided.
  */
 
  /*
    Get the cost of MRR/MyISAM scan as a function of: 
     - number of records
     - table parameters
     - available buffer size
     - test(HA_MRR_NO_ASSOCIATION): 
         if HA_MRR_NO_ASSOCIATION is used, 
           we'll store rowids in the buffer
         else
           we'll store (rowid, range_data_ptr) pairs
  */

  Compare the cost of default MRR implementation and MRR/MyISAM 
    implementation and choose/return the best.
}


ha_rows multi_range_read_info_const(uint keyno, RANGE_SEQ_IF *seq,
                                    void *seq_init_param, uint n_ranges,
                                    uint *bufsz, uint *flags, COST_VECT *cost)
{
  /*
    Call the default implementation: this will allow us to get parameters
    (cost/flags/mem_usage) of default MRR implementation, and will also return 
    E(number of records).
  */
  handler::multi_range_read_info_const(...);

  /* Check rules described in sections 3.1 and 3.2 */
  if (flags & (HA_MRR_INDEX_ONLY | HA_MRR_SORTED))
  {
    return cost/flags/buffer size of the default implementation;
  }

  /*
    Get the cost of MRR/MyISAM scan in the same way as it is done in
    multi_range_read_info()
  */

  Compare the cost of default MRR implementation and MRR/MyISAM 
    implementation and choose/return the best.
}


int multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
                          uint n_ranges, uint mode, HANDLER_BUFFER *buf)
{
  fill_buffer();
  /* 
    if the above call has scanned through all intervals in *seq, then adjust
    *buf to indicate that the remaining buffer space will not be used.
  */
  return 0;
}


int multi_range_read_next(char **range_info)
{
  if (!have rowids in the buffer && fill_buffer())
    return EOF;
  return read_full_row(buffer.pop_front());
}


fill_buffer()
{
  if (no more index tuples to read)
    return 1;

  read index tuples and fill the buffer (either with rowids, or with
                                         {rowid,range_id} pairs);
  sort the buffer contents by rowid;
  return 0;
}


5. Cost model
-------------

/* 
  The below formula is true for all components of COST_VECT except
  memory_cost:
*/
cost(MRR/MyISAM scan) = index_read_cost + sorts_and_sweeps_cost;

/* 
  This cost is is only disk-io cost, random seeks are assumed
*/
index_read_cost= {use handler::index_only_read_time() function}


5.1 Number and parameters of "sort and sweep" steps
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MRR/MyISAM uses several "sort and do sweep read" steps.

The number of steps is determined as follows:

/* Length of one entry in the buffer: */
buffer_entry_length = rowid_length + 
                      sizeof(void*) * (!test(HA_MRR_NO_ASSOCIATION))


/* Max. number of entries in the buffer: */
max_buff_entries = buffer_size / buffer_entry_length;

/* 
  Number of iterations we'll make with full buffer (here "range_records" is
  estimated total number of records in all ranges)
*/
n_full_steps= floor(range_records / max_buff_entries);
records_in_full_step= max_buff_entries;

/* 
  Number of records we'll be processing in last iteration, where the buffer
  will not be full 
*/
records_in_last_step= range_records % max_buff_entries;


For sorts_and_sweeps_cost we have:

if (n_full_steps != 0)
  sorts_and_sweeps_cost.mem_cost= buffer_size;
else
  sorts_and_sweeps_cost.mem_cost= records_in_last_step * buff_entry_length;

For all other components of the COST_VECT structure, regular addition rules
apply:

sorts_and_sweeps_cost = 
   n_full_steps * sort_and_sweep_cost(records_in_full_step) + 
   (records_in_last_step==0)? 0 : sort_and_sweep_cost(records_in_last_step);


5.2 Cost of one "sort and sweep" step
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The CPU cost component is the cost of qsort() call:

sort_and_sweep_cost(n_records).cpu_cost = 
  {cost of qsort() call on an array of n_records elements }

<todo>Come up with some formula for CPU cost of qsort. Cost of one comparison
  is rowid comparison cost: (1 / TIME_FOR_COMPARE_ROWID).
</todo>

The IO cost component is the cost of disk sweep.
The opt_range.cc code has the get_sweep_read_cost() function that calculates
the cost of one record retrieval sweep. We'll re-use the cost model from that 
function, but with one exception: get_sweep_read_cost() assumes that if the 
table is accessed within the join, then the sweep is "totally broken" and cost
of reading one block is the cost of random disk seek.  We will not use this 
assumption because we've agreed on using NoInterruptingIoAssumption (search 
above). So we end up with this cost model and formulas:

The cost model is as follows:

We need to retrieve n_rows rows from file that occupies n_blocks blocks.  We 
assume that offsets of rows we need are independent variates with uniform 
distribution in [0..max_file_offset] range.

We'll denote block as "busy" if it contains row(s) we need to retrieve it and
as "empty" if we don't need to retrieve it (i.e. doesn't contain rows we
need).

Probability that a block is empty is 

    (1 - 1/n_blocks)^n_rows 
    
(this applies to any block in file). Let x_i be a variate taking value of 1 if
block #i is empty and 0 otherwise.

  Then E(x_i) = (1 - 1/n_blocks)^n_rows;

  E(n_empty_blocks) = E(sum(x_i)) = sum(E(x_i)) =
    = n_blocks * ((1 - 1/n_blocks)^n_rows) ~= n_blocks * exp(-n_rows/n_blocks)

  E(n_busy_blocks) = n_blocks*(1 - (1 - 1/n_blocks)^n_rows) =
   ~= n_blocks * (1 - exp(-n_rows/n_blocks)).

Average size of "hole" between neighbor non-empty blocks is

  E(hole_size) = n_blocks/E(n_busy_blocks).

  The total cost of reading all needed blocks in one "sweep" is:

  E(n_busy_blocks) * (DISK_SEEK_BASE_COST + DISK_SEEK_PROP_COST * E(hole_size))

*************************************
* Note after discussion with Monty: *
*************************************
Actually the cost of one disk seek in the sweep is not proportional to disk 
seek distance. It has more to do with plate rotations. If the next page to
read is on the same cylinder, then the read is fast, otherwise we'll need to
wait for ~0.5 rotations. The results of a quick test (see attached
skip_read_results.png), at first glance, confirm this.

On the other hand, the rows we need to retrieve are not uniformly distributed,
the data is usually somewhat clustered (so median distance between pages we'll
need to read is smaller).

6. Memory costs per record
--------------------------
The following information might be required for WL#2771:

Assuming there are no limits on buffers size, memory usage per returned record 
is as follows:

For default MRR implementation 

  mem_per_record=0
    
as it uses O(1) memory.

For MRR/MyISAM 
   
  mem_per_record = rowid_length + 1 byte * !test(HA_MRR_NO_ASSOCIATION)
.


7. Note about index condition split up
--------------------------------------
The "Index condition split up" (1) optimization can be used together with
MRR/MyISAM: we could filter out non-matching index tuples in fill_buffer()
function. The question how to arrange "Index condition split up" to be taken
into account in MRR/MyISAM cost calculations remains uncleared (but this
should not be a significant problem).

The MRR/MyISAM implementation as described in this WL entry will not use 
"Index condition split up".

(1) no WL# entry exist yet for this optimization, tentative spec text exist 
as result of pair work of Timour and SergeyP at Malta DevConf.

8. How to fit it into current table handler functions
-----------------------------------------------------

Use of table handler functions: 
We can't use one table handler, because we'll need to make the following
sequences of calls:

handler->index_init()
handler->index_next();
 ...
handler->index_next(); // (*), rowid buffer is now full

handler->rnd_pos(); // retrieving full records after sorting
handler->rnd_pos();
handler->rnd_pos();

...
handler->index_next(); // (**)
...

The call (**) needs to continue scanning the index from the position it was at
right after the (*) call. This will not happen automatically - calling
rnd_pos() will cause the table handler to forget the previous position.

We see two solutions to this problem:

A. Restore the index scan position by calling
    
     handler->index_read({last_index_tuple_value, last_index_tuple_rowid}).

   (UPDATE: this just have worked with InnoDB and MyISAM code needed some
    small adjustments)

B. Create two handler objects, one to be used for scanning the index, and
   another for making rnd_pos() calls.
   (this is how index_merge works. This is not a very elegant solution as
   we're already "inside the table handler", and creating another table
   handler seems to be an overkill).

UPDATE: We're using method A.
Copyright (c) 2001-2007 by MySQL AB. All rights reserved.

WL#2475 Batched range read functions for MyISAM/InnoDb - LLD
============================================================

<contents>
0. Changes against title and HLS
1. Cost functions implementation
2. MRR ordered-full-record retrieval algorithm
2.1 Index scan position restore
2.1.1 MyISAM: position at {key_tuple, rowid} pair
2.1.2 InnoDB: position at {key_tuple, rowid} pair
2.1.3 For the future: introduce position save/restore functions
2.2 Record retrieval algorithm
3. EXPLAIN Output changes
4. Index condition pushdown
4.1 Index condition pushdown for default MRR implementation
4.1.1 Call index condition function in MyISAM
4.1.2 Call index condition function in InnoDB
4.1.3  SQL Layer support for Index condition pushdown
4.1.3.1 Execution
4.1.3.2 Optimization
4.1.3.2.1 Detect presense of index condition in the optimizer
4.1.3.2.1 Effects of index condition on MRR interface
4.2 Index condition pushdown with DS-MRR, executioner
5. Other notes

</contents>

TODO:
 - specify how condition pushdown will work.
 - add a note that we don't support SINGLE_MATCH
 - add a note about memory sizes calculation
 
0. Changes against title and HLS
--------------------------------
* Introduced new name "Disk-Sweep MRR, DS-MRR" so we don't have to repeat
  "MyISAM or InnoDB" everywhere.
* Added index condition pushdown.


1. Cost functions implementation
--------------------------------
Cost functions will be the same for MyISAM and InnoDB (The following 
half-pseudocode is produced by adding some more detail to the description of
cost calculations in the HLS):

int multi_range_read_info(uint keyno, uint n_ranges, uint keys, uint *bufsz,
                          uint *flags, COST_VECT *cost)
{
  record the cost/flags/mem_usage of default MRR implementation;
  
  if ((flags & HA_MRR_INDEX_ONLY) ||
      {doing a scan on clustered PK) ||
      ((flags & HA_MRR_SORTED) && output-field-set-contains-blobs))
  {
    return cost/flags/buffer size of the default implementation;
  }

  /* Indicate that we will return unordered output */
  *flags &= ~HA_MRR_SORTED;

  /* 
    The following call will also adjust *bufsz if the expectation is that only
    part of the buffer will be used.
  */
  cost= get_disk_sweep_mrr_cost();

  Compare the cost of default MRR implementation and MRR/MyISAM 
    implementation and choose/return the best.
}


ha_rows multi_range_read_info_const(uint keyno, RANGE_SEQ_IF *seq,
                                    void *seq_init_param, uint n_ranges,
                                    uint *bufsz, uint *flags, COST_VECT *cost)
{
  Get E(#records), cost/flags/mem_usage of default MRR implementation;

  Do the same that multi_range_read_info() does.
}


/* 
  Return cost of Disk-Sweep MRR scan. This function itself is engine-angostic.
*/

get_disk_sweep_mrr_cost(int &buffer_size)
{
  /* Calculate number of passes: */
  buffer_entry_length = rowid_length + 
                        sizeof(void*) * (!test(HA_MRR_NO_ASSOCIATION))

  /* Max. number of entries in the buffer: */
  max_buff_entries = buffer_size / buffer_entry_length;

  /* 
    Number of iterations we'll make with full buffer (here "range_records" is
    estimated total number of records in all ranges)
  */
  n_full_steps= floor(range_records / max_buff_entries);
  records_in_full_step= max_buff_entries;

  /* 
    Number of records we'll be processing in last iteration, where the buffer
    will not be full 
  */
  records_in_last_step= range_records % max_buff_entries;

  if (!n_full_steps)
    buffer_size= records_in_last_step * buffer_entry_length;

  sorts_and_sweeps_cost= 
     n_full_steps * sort_and_sweep_cost(records_in_full_step) + 
     (records_in_last_step==0)? 0 : sort_and_sweep_cost(records_in_last_step);

  if (n_full_steps != 0)
    sorts_and_sweeps_cost.mem_cost= buffer_size;
  else
    sorts_and_sweeps_cost.mem_cost= records_in_last_step * buff_entry_length;

  index_read_cost= handler::index_only_read_time(...);

  cost(MRR/MyISAM scan) = index_read_cost + sorts_and_sweeps_cost;
  return cost(MRR/MyISAM scan);
}

/* Get cost of one sort-and-sweep step */
get_sort_and_sweep_cost(n_records)
{
  COST_VECT res;
  res.cpu_cost = {cost of qsort() call on an array of n_records elements }
  (res.io_count, res.avg_io_cost)= 
    get_sweep_read_cost(); //adopt from index_merge implementation
  return res;
}


2. MRR ordered-full-record retrieval algorithm
----------------------------------------------


2.1 Index scan position restore
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We'll need to be able to make the following sequences of calls:

  handler->index_init()                \
  handler->index_next[_same]();         |- fill the rowids buffer
  --//--                                | 
  handler->index_next[_same](); // (*) /

  handler->rnd_pos();   \
  --//--                 |-- retrieve full records in a disk sweep
  handler->rnd_pos();   /
  
  handler->index_next(); // (**)  \
  ...                              |-- fill rowids buffer with next portion
  handler->index_next();          /
  
The call (**) will need to continue scanning the index from the position it 
was at right after the (*) call. This will not happen automatically. We'll
make it possible to do in the following way (following the answer #A to
dilemma posed in Section 8 of the HLS):

  handler->index_init()                \
  handler->index_next[_same]();         |- fill the rowids buffer
  --//--                                | 
  handler->index_next[_same](); // (*) /

  // remember the {key_tuple, rowid} pair.

  handler->rnd_init();  \
  handler->rnd_pos();    |-- retrieve full records in a disk sweep
  --//--                 |
  handler->rnd_pos();   /
  
  handler->index_init();                        \
  handler->index_read({key_tuple, rowid}); (***) |
  handler->index_next(); // (**)                 |-- fill rowids buffer with
  ...                                            |   next portion
  handler->index_next();                        /

That is, we'll "restore the index scan" by making the call (***). Some
effort will be required to make this work, see following sections.

CAVEAT: I'm not sure the above will work for R-TREE indexes... If it will
not we'll disable MRR/MyISAM for them.

2.1.1 MyISAM: position at {key_tuple, rowid} pair
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With the current code, calling
  ha_myisam::index_read({key_tuple, rowid}, 
                        length(key_tuple)+length(rowid),
                        HA_READ_KEY_EXACT); // or whatever flag
                        
will not produce the desired effect, because ha_key_cmp function will not
compare the rowid keypart. ha_key_cmp() will compare the rowid keypart if it
is passed appropriate comparison flags.  We'll change ha_myisam::index_read
to check if it has been passed a {key_tuple, rowid} pair, and if yes, arrange
for appropriate flag to be passed down to ha_key_cmp().

2.1.2 InnoDB: position at {key_tuple, rowid} pair
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Experiments in the debugger show that if one calls

  ha_innobase::index_read({key_tuple, rowid}, 
                          length(key_tuple)+length(rowid),
                          HA_READ_KEY_EXACT);
  
then InnoDB will find the {key_tuple, rowid} pair and "restore the scan
position" correctly (looking at the InnoDB code along the execution path
also gives some evidence that calls like the above will do what is desired)

2.1.3 For the future: introduce position save/restore functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the future we should introduce a position save/restore functions into the
handler interface, so a table handler will be able to save and later on
restore index scan position. This is more efficient than the approach used
in this WL.  This task will not be executed in scope of this WL (but when it
is done, this WL code will need to be changed to use save/restore functions).

(position save/restore could also be used for index_merge code, so that code
doesn't use the unclean way to create table handler objects that it is
currently using. However, a cloning handler::fork() function would be a better 
approach for index_merge code).


2.2 Record retrieval algorithm
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

int multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
                          uint n_ranges, uint mode, HANDLER_BUFFER *buf)
{
  fill_buffer();
  /* 
    if the above call has scanned through all intervals in *seq, then adjust
    *buf to indicate that the remaining buffer space will not be used.
  */
  return 0;
}


int multi_range_read_next(char **range_info)
{
  if (!have rowids in the buffer && fill_buffer())
    return EOF;
  return read_full_row(buffer.pop_front());
}


fill_buffer()
{
  restore index position; // <-- added in LLD 
  if (no more index tuples to read)
    return 1;
  
  while (rowid buffer is not full)
  {
    get next {index_tuple, rowid} pair;
    if (index condition doesn't hold)   // See below for details about 
      continue;                         // "index condition"
    put rowid, or {rowid, range_id} pair into the buffer;
  }
  save index position; // <-- added in LLD 
  sort the buffer contents by rowid;
  return 0;
}


3. EXPLAIN Output changes
-------------------------
The "Extra" column in EXPLAIN output will reflect: 

* If Disk-Sweep-MRR (= MRR/MyISAM or MRR/InnoDB) is used. If it is,
  the "Extra" column will contain "Using MultiRangeRead"

* If Condition pushdown is used. If it is, the "Extra" column will contain
  "Using condition pushdown; "


4. Index condition pushdown
---------------------------
We will implement "Index Condition Pushdown" (ICP) within the scope of this
worklog entry. "Index Condition" here means a part of the WHERE condition
that can be evaluated when we have only fields covered by the index. Note
that the condition doesn't need to be SARGable.

ICP will work when
 - DS-MRR implementation is used, or
 - default MRR implementation is used for accessing MyISAM or InnoDB table.

In both cases, cost functions will take ICP into account.

4.1 Index condition pushdown for default MRR implementation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Codewise, it is difficult to implement ICP entirely on SQL layer in the 
default MRR implementation, because we'll have to do something like the
following:

  get_next_record()
  {
     do {
       // get {key_tuple,rowid} for next row;
       
     while (!index condition is satisfied);
     
     //save index scan position;
     file->rnd_init();
     file->rnd_pos();
     file->index_init();
     //restore scan position;     
  }

Performing all of the above operations is likely to cause much CPU overhead.
It seems that an easier solution would be that SQL layer provides table
engine a pointer to C function that performs the filtering. The calling
contract for the function will be as follows:

/*
  Perform index condition filtering, assuming the fields covered by the index
  are in table->record[0].

  RETURN
    TRUE  - Ok, index condition is satisfied
    FALSE - Index condition is not satisfied
*/
my_bool (*index_cond_func)();


4.1.1 Call index condition function in MyISAM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The call should be placed somewhere in mi_rkey() function.


4.1.2 Call index condition function in InnoDB
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The index condition check function call should be placed in 
row_search_for_mysql() function. InnoDB calls 
row_sel_get_clust_rec_for_mysql() to access full table record.


4.1.3  SQL Layer support for Index condition pushdown
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We need to make two provisions on SQL layer to support Index Condition
Pushdown:

4.1.3.1 Execution
~~~~~~~~~~~~~~~~~
The operation of "pushing the condition down" will be peformed within the
handler->cond_push() call. SQL layer will provide make_cond_for_index() 
function that will extract index condition from the table condition.

make_cond_for_index() shall produce a condition that has the following
properties:
 - It is a most complete expression that can be evaluated when you have 
   only index fields tuple.
 - It may depend on
    = constants and fields of "const" tables
    = fields that are fully covered by the given index.

Implementation of make_cond_for_index() will be modeled after 
make_cond_for_table():
 * It will use self-recursion for AND/OR, like make_cond_for_table() does.
 * For all other kinds of Items, it will recurse down the item tree
   (ultimately reaching Item_field* and item_field->field->part_of_key)
   and determine if a particular expression can be evaluated only using the
   given index.


4.1.3.2 Optimization
~~~~~~~~~~~~~~~~~~~~
We need to know which fraction of records retrieved by scanning the ranges
will be filtered out by index condition (as the total cost of MRR scan depends
on number of full table records that we'll need to retrieve). That is rather
difficult to achieve because:
  a. We don't have any solution to the problem "Produce a selectivity
     estimate for some condition C when it will be used to filter out records
     within list of ranges R".
  b. Index condition extraction is an expensive operation, and we'd like to
     avoid doing it at optimization phase.
     
Therefore we will not extract index condition but do the following instead:

4.1.3.2.1 Detect presense of index condition in the optimizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We'll modify the range optimizer to try to detect if the index condition will
be present.

Within the range optimizer (i.e. in get_mm_tree):

Introduce RANGE_OPT_PARAM::index_conds_bitmap, 
  (index_conds_bitmap[i]==TRUE) <=> "There is extra condition for index i".

At range optimizer start, index_conds_bitmap[i] == FALSE;

Within the range optimizer:

When processing a condition cond:


# For SARGable elementary conditions (i.e. leafs AND/OR tree):
get_mm_leaf(expr)
{
  ...
  /* 
    The condition will be guaranteed to be true for rows that fit within
    ranges.
  */
  index_conds_bitmap= 0; 
}


# For non-SARGable elementary conditions (i.e. leafs AND/OR tree):
<somewhere in get_mmt_tree() >{
   if (cond->used_tables() & ~(head->table-map | const_table_map))
   {
     // Ok, the condition uses only this table. Find out if we 
     // can check it for index tuples of some indexes.

     context.index_cond_present_bitmap= 0;
     context.index_cond_bitmap= ~0;
       
     cond->walk(&context,
                Item_field::process() {
                  if (this field is not covered by any index)
                    return TRUE; //abort
                  indexes= {bitmap of indexes that cover this field};
                  context->index_cond_present_bitmap |= indexes;
                  context->index_cond_bitmap         &= indexes;
                });
     index_conds_bitmap= context.index_cond_bitmap |
                         context.index_conds_present_bitmap;
   }
}


# For AND:
tree_and(tree1, tree2)
{
  index_conds_bitmap= 
    tree1.index_conds_bitmap(tree1) union
    tree2.index_conds_bitmap(tree2);

  for each index IDX
  {
    /*
      Check if produced list of intervals is not equivalent to the condition
      on that index. This is the case when the resulting SEL_ARG* tree for the
      index IDX represents a condition in form:
       
        non_singlepoint_range(keypart_N) AND some_range(keypart_M) AND other_kp
       
      (here N > M, other_kp doesn't refer to keypart_N or keypart_M).
    */

    if (the resulting SEL_ARG* tree has the above form)
      index_conds_bitmap[IDX] = TRUE;
  }
}

# For OR:
tree_or(tree1, tree2)
{
  /*
    (index_cond1 AND non_index_cond1) OR (index_cond2 AND non_index_cond2) =     
     = (index_cond1 OR index_cond2) OR non_index_result_cond
     
    Here we assume that if index_cond_1 and index_cond_2 are non-confluent
    (not just "TRUE"), then "index_cond_1 OR index_cond_2" will be
    non-confluent too.
  */
  index_conds_bitmap= 
    tree1.index_conds_bitmap intersect tree2.index_conds_bitmap;

  /*
    tree_or also can perform this conversion: 
    
    (key1condA AND key2cond) OR (key1condB AND key3cond) = 
    (key1condA OR key1condB) AND remainder_cond --> (key1condA OR key1condB).

    Here we should do 
    index_conds_bitmap[key2]= TRUE;
    index_conds_bitmap[key3]= TRUE;
  */
}

4.1.3.2.1 Effects of index condition on MRR interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
multi_range_read_info_[const]() functions will accept additional flag: 
HA_MRR_IDX_COND_SELECTIVITY_05. The SQL layer will set the flag depending on
the results of range analysis (see previous section).  The MRR implementation
will analyze the flag and make obvious adjustments to the estimate of number of
full records it is going to retrieve.

4.2 Index condition pushdown with DS-MRR, executioner
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since DS-MRR will perform index-only scans, table handler will not check ICP
condition itself. Therefore DS-MRR function will have to check the condition.


5. Other notes
--------------
* In scope of this WorkLog entry we will not support semi-join-related
  HA_MRR_* flags.

* The implementation will correctly process the presense/absense of 
  HA_MRR_NO_ASSOCIATION flag, but there will be no way to check that.

* In scope of the first milestone of this worklog entry we will not
  implement MRR for MERGE table type.

* It would be nice to see if our cost model matches the reality. We could 
  run some benchmark in a simple settings (one query on the system, synthetic
  dataset, etc) and collect the data to plot dependency of query execution
  time from certain parameters. While running the benchmark, we could also
  collect the numbers obtained by DS-MRR cost functions, and see if there is a
  reasonable correspondence between the two.

SCHEDULE
========
 - Implement Disk-Sweep-MRR executioner code       24 hrs
MS1: Disk-Sweep MRR works for InnoDB Table - Apr,3  End-of-day

Index position restore for MyISAM
 - Implementation                                  16 hrs
 - Testing                                          8 hrs
MS2: Disk-Sweep MRR works for MyISAM Table - Apr,11 End-of-day

Reflect DS-MRR in EXPLAIN Output
 - Implementation and testing                      12 hrs

 - Implement Disk-Sweep-MRR cost functions         16 hrs
MS3: DS-MRR works for both engines, the choice is
     based on mrr_*info() functions - Apr,20 End-of-day

Implement Index Condition Split-up
 - Make a generic Split-up function                16 hrs
 - Perform filtering on execution in MRR/DS         4 hrs
 - Account for filtering in cost function           4 hrs

MS4: Index Condition Split-up works - Apr,28 End-of-day

Other checks
 - Check if MRR works for GEOM, disable if not      8 hrs
- Testing and fixing whatever pops up              24 hrs
MS5: Code ready for review -  May,12 End-of-day

- Post-review fixes                                20 hrs
MS6: Post-review fixes done May,22 - May,26, End-of-day