WL#6986: Make switching of index due to small limit cost-based

Affects: Server-Prototype Only   —   Status: Complete   —   Priority: Medium

Decision to switch to an index which gives results in the order requested by
"ORDER BY" is not cost based at all places. 
For queries having "ORDER BY .... LIMIT N", the optimizer tries to switch to use
an index that supports the "ORDER BY" order. This happens in two places
currently and one is in make_join_select(). Here switching of index is not
cost-based. If an index exists and can be used, the optimizer will switch to use
it. This can lead to poor performance in cases where the query will need to read
a large part or the entire index in order to find qualifying records.

The goal of this worklog is to make the decision in make_join_select()
of whether to switch to a new index in order to support "ORDER BY
... LIMIT N" cost-based. Note that since we do not have very much
knowledge about how much data that actually will need to be read until
the query has produced N records, this worklog will not prevent
queries from ending up reading the entire index but it should prevent
us from switching in cases where it is likely that we have to read a
lot from the index.
Functional requirements:
------------------------
F1) The feature shall use cost model present in “test_if_cheaper_ordering” in
“make_join_select” to swicth to an index to give the results in requested order.
This may result in changed query execution plans.
  
Non-functional requirements:
----------------------------
NF1) The performance shall be as good as or better than before for 95%
        of all test queries.
NF2) A performance degradation of 5% for up to 5% of the queries is 
        tolerable in the general case
NF3) A higher performance degradation clearly caused by skewed or 
        covariant data [1] is acceptable as long as NF1 is satisfied.
For queries that have "ORDER BY ... LIMIT N" it will in many cases be
beneficial to use an index that supports the "order by" order since we
(a) can avoid having to do file sort of the result and (b) be able to
stop after having produced the first N result records. The optimizer
already supports this by checking if it is possible to switch to use
such an index. This test is done after the join order has been decided
and is today done in two different places in the code:

1. In make_join_select(): If the current access method is via a quick
   select, the query has a low limit value, and there exists an alternative
   index that can be used for reading in "order by" order, then we switch
   to use this index. This switch is not cost-based and we do the switch
   even in cases where we might have to read a large portion of the new
   index.

2. When calling test_if_skip_sort_order() we use
   test_if_cheaper_ordering() to decide if we should switch to an
   alternative index in order to avoid file sort. The decision to switch
   index is cost based and tries to take into account an estimate of how
   much it will likely need to read from the index in the case where a
   limit is given.

The switching to a new index done in make_join_select() can cause
increased execution time if the query needs to read much or the entire
index in order to find enough qualifying records.

Problem Statement:

Consider the following query - 

SELECT patientId, time
FROM t1
WHERE patientId > 41 AND patientId < 43 AND time > 0 AND illness="Headache"
ORDER BY time LIMIT 1;

If the table has indices on “patientId” and “time”, optimizer can  choose
between an index for the range “patientId > 41 and patientId < 43” or an index
for the range “time > 0”.  Also it can choose index “time” to give the ordered
results as requested in the order by clause.
In this case, say the number of rows for the range on patientId is 5 and number
of rows for the range “time” is 1000. If order by was not present, index
“patientId” would have been chosen.
But based on that index “time” would give the results in the requested order and
limit is very small, optimizer decides to switch to use index “time”. This
assumption is
probably made on the fact that we will likely encounter one row which satisfies the
condition “patientId >41 and patientId < 43” and “illness = headache” as early
as possible resulting in a optimal plan.
This currently leads to the problem we have in hand. What if the range 
“patientId > 41 and patientId < 43” does not give any rows? If optimizer chooses
index "time", the one record which we are looking for might make the query
execution to scan
the entire index which definitely is not a good thing to do. In that case it
would have been
more wise to choose index “patientId”. 

Even in cases where we have two possible ranges giving us comparable number of
rows, (unlike in this case), it is always better to do a cost based switch when
optimizer has to go with an index that gives the records in the requested order. 
So with the example above, if we had estimated the cost, based on the possible
number of rows that we might have to scan if “time” was chosen based on
“quick_condition_rows” which in this case would be 0, then optimizer could have
avoided the switch.

Cost based switching in test_if_cheaper_ordering:

With respect to the example above, say patientId's range results in 5 records
and “time” results in 1000.  If we need to switch to “time” to get results in
requested order, we might have to scan a minimum of 200 records to get the one
record that we need as specified by the limit clause.

To arrive at 200, it is presumed that data is randomly distributed and the
following equation is used:

Number of records to be scanned = select_limit * quick_rows for this
index/quick_condition_rows.

In the example above, select limit is 1
quick_rows for “time” index is 1000
quick_condition_rows is 5

And from this, the optimizer can calculate the cost between scanning 200 records or
scanning 5 records and doing filesort. Which in this case would be scanning 5
records and doing filesort there by choosing “patientId” over “time”.

So it would be more wise to use this cost model in “make_join_select” as well.

As stated above, there is a presumption made about the distribution of data in
the cost model present in “test_if_cheaper_ordering”.

In the example presented above let us modify the statistics a little :

Lets say using index on patiedId would give “200” records and using index
“time” would give “1000” records. So using the above equation, optimizer arrives
at “5” as the total number of records to be scanned for the index “time”. Since
the number for records satisfying the range on “patientId” is more, probablity
of finding the records early is high. Hence index "time" is chosen.

Now what if data is distributed in such a way that, these 200 records are found
at the end of the index when “time” is used i.e what if these 200 records will
be the “801st to 1000th” records when “time” is used. This results in scanning
801 records to get the one record needed by the query.

This again results in a bad plan. 
So, ideally the cost model in "test_if_cheaper_ordering” should be altered to
get more accurate estimate.
But with the current information available about distribution of data, 
optimizer can only live with this assumption. Although, customer has suggested
us to give a configuration variable to turn off this optimization which could be
considered.

This worklog should make the choice of whether to switch index or not
in make_join_select() cost-based in order to reduce the likelihood of
switching in cases where a large part of the index will be read. To address this
problem, we can use the cost-model that is implemented in
test_if_cheaper_ordering(). 

Also this worklog should be able to give user a configuration variable to turn
off the optimization if needed.

Related bugs:

BUG#57001 INCORRECT OPTIMIZATION OF ORDER BY CLUSTERED_KEY LIMIT 
Implementation:

Currently in make_join_select, we have the following heuristic to choose the
range scan, when there is order by with low limit.

For all the possible “usable_keys”, check if any key gives results in the order
requested by “order by” clause and give this as input to “test_quick_select” to
re-check for a possible range scan.

This would be changed to use “test_if_cheaper_ordering”
 if (recheck_reason == LOW_LIMIT)
            {

              /*
                Do a cost based search on all the keys for an index which can
                give ordered results
               */

              test_if_cheaper_ordering(tab, join->order, tab->table, usable_keys,
                                       -1, select_limit,
                                       &best_key, &read_direction,
                                       &select_limit);

              /*
               * test_if_cheaper_ordering compares range/ref scan vs choosing
               * index for ordering.  But, in this case we might have table
               * scan/index scan in hand.  So re-check if there is
               * a possibility for a range scan that can give ordered results
               * even though it was not the cheapest.
               */

              if(best_key < 0)
              {
                for (uint idx= 0; idx < tab->table->s->keys; idx++)
                {
                  if (usable_keys.is_set(idx) &&
                      (tab->table->quick_keys.is_set(idx) &&
                       tab->table->quick_rows[idx] ==
tab->table->quick_condition_rows))
                  {
                    const int read_direction=
                      test_if_order_by_key(join->order, tab->table, idx);
                    if (read_direction != 0)
                      best_key= idx;
                    break;
                  }
                }
              }

              interesting_order= (read_direction == -1 ? ORDER::ORDER_DESC :
                                  ORDER::ORDER_ASC);

              if (best_key < 0)
                recheck_reason= DONT_RECHECK; // No usasable keys

              /*
                If the current plan is to use a range access on an
                index that provides the order dictated by the ORDER BY
                clause there is no need to recheck index usage; we
                already know from the former call to
                test_quick_select() that a range scan on the chosen
                index is cheapest. Note that previous calls to
                test_quick_select() did not take order direction
                (ASC/DESC) into account, so in case of DESC ordering
                we still need to recheck.
               */
              if (sel->quick && (sel->quick->index != MAX_KEY) &&
                  best_key == (int)sel->quick->index &&
                  (interesting_order != ORDER::ORDER_DESC ||
                   sel->quick->reverse_sorted()))
              {
                recheck_reason= DONT_RECHECK;
              }

              if (best_key >= 0)
              {
                usable_keys.clear_all();
                usable_keys.set_bit(best_key);
              }
            }

The main change from the previous implementation is in the call to
“test_if_cheaper_ordering”. Unlike the previous implementation, this is how the
index would be chosen using test_if_cheaper_ordering.

for (nr=0; nr < table->s->keys ; nr++)
  {
    int direction;
    uint used_key_parts;

    if (keys.is_set(nr) &&
        (direction= test_if_order_by_key(order, table, nr, &used_key_parts)))
    {
      /* 
        Don't use an index scan with ORDER BY without limit.
        For GROUP BY without limit always use index scan
        if there is a suitable index. 
        Why we hold to this asymmetry hardly can be explained
        rationally. It's easy to demonstrate that using
        temporary table + filesort could be cheaper for grouping
        queries too.
      */ 
      if (is_covering ||
          select_limit != HA_POS_ERROR || 
          (ref_key < 0 && (group || table->force_index)))
      { 
        double rec_per_key;
        double index_scan_time;
        KEY *keyinfo= table->key_info+nr;
        if (select_limit == HA_POS_ERROR)
          select_limit= table_records;
        /* 
          If tab=tk is not the last joined table tn then to get first
          L records from the result set we can expect to retrieve
          only L/fanout(tk,tn) where fanout(tk,tn) says how many
          rows in the record set on average will match each row tk.
          Usually our estimates for fanouts are too pessimistic.
          So the estimate for L/fanout(tk,tn) will be too optimistic
          and as result we'll choose an index scan when using ref/range
          access + filesort will be cheaper.
        */
        select_limit= (ha_rows) (select_limit < fanout ?
                                 1 : select_limit/fanout);
        /*
          We assume that each of the tested indexes is not correlated
          with ref_key. Thus, to select first N records we have to scan
          N/selectivity(ref_key) index entries. 
          selectivity(ref_key) = #scanned_records/#table_records =
          refkey_rows_estimate/table_records.
          In any case we can't select more than #table_records.
          N/(refkey_rows_estimate/table_records) > table_records
          <=> N > refkey_rows_estimate.
         */
        if (select_limit > refkey_rows_estimate)
          select_limit= table_records;
        else
          select_limit= (ha_rows) (select_limit *
                                   (double) table_records /
                                    refkey_rows_estimate);
        rec_per_key= keyinfo->rec_per_key[keyinfo->user_defined_key_parts - 1];
        set_if_bigger(rec_per_key, 1);
        /*
          Here we take into account the fact that rows are
          accessed in sequences rec_per_key records in each.
          Rows in such a sequence are supposed to be ordered
          by rowid/primary key. When reading the data
          in a sequence we'll touch not more pages than the
          table file contains.
          TODO. Use the formula for a disk sweep sequential access
          to calculate the cost of accessing data rows for one 
          index entry.
        */
        index_scan_time= select_limit/rec_per_key *
                         min(rec_per_key, table->file->scan_time());
        if ((ref_key < 0 && is_covering) || 
            (ref_key < 0 && (group || table->force_index)) ||
            index_scan_time < read_time)
        {
          ha_rows quick_records= table_records;
          ha_rows refkey_select_limit= (ref_key >= 0 &&
                                        table->covering_keys.is_set(ref_key)) ?
                                        refkey_rows_estimate :
                                        HA_POS_ERROR;
          if ((is_best_covering && !is_covering) ||
              (is_covering && refkey_select_limit < select_limit))
            continue;
          if (table->quick_keys.is_set(nr))
            quick_records= table->quick_rows[nr];
          if (best_key < 0 ||
              (select_limit <= min(quick_records,best_records) ?
               keyinfo->user_defined_key_parts < best_key_parts :
               quick_records < best_records))
          {
            best_key= nr;
            best_key_parts= keyinfo->user_defined_key_parts;
            if (saved_best_key_parts)
              *saved_best_key_parts= used_key_parts;
            best_records= quick_records;
            is_best_covering= is_covering;
            best_key_direction= direction; 
            best_select_limit= select_limit;
         }
        }   
      }      
    }
  }
}