WL#1393: Optimizing filesort with small limit

Affects: Server-5.6   —   Status: Complete

Many web customers have to do
"SELECT ... ORDER BY non_index_column LIMIT X",

When X *  is smaller than sort_buff_size we can use
the following algoritm to speed up the sort:

- Create a queue to hold 'limit' keys.
- Scan through the table and store the first (last if DESC) keys in the queue
- Return values from queue

This is much faster than the current algoritm that works as:
-- Repat for all matching row in table
  - Scan until we have filled sort_buffer
  - Write 'limit' keys to merge file
- Merge buffers, until we found limit keys

[ Note added by Peter Gulutzan 2006-11-21, modified 2009-08-31 ]

Tom Kyte wrote an article in Oracle Magazine, apparently extracting
from one of his books. He describes Oracle's trick
for their LIMIT equivalent, that is, for
"SELECT ... FROM (SELECT ... ORDER BY x) ... WHERE ROWNUM < n".
He says that when n is small, Oracle will
(a) fetch n rows into array in fast memory
(b) sort array
(c) repeat till no more rows:
    fetch
    if fetched value < last value in array
      put fetched value in appropriate place in array,
      shifting following items forward,
      letting last value in array be removed
Perhaps this is the same thing that Monty was trying to say, I can't tell.

I have attached a C program file wl1393.c to this worklog.
It demonstrates the proposed method and uses gettimeofday to compare
the speed of the old versus proposed methods. The proposed method
is shown to be, on an x86-64 after compiling with gcc, 10 to 20 times
faster than a quicksort when total-number-of-rows = 1,000,000 and limit = 50,
except in the worst case where the original record set is totally reverse order.

Queries which are amenable to the optimization are:

SELECT  FROM  ORDER BY  LIMIT N.

We should also handle sub-queries of the above form.

The function find_all_keys() is extended with an optional priority
queue argument (PQ), and becomes

A>   while (get_next_sortkey())
     {
       if (using priority queue)
B>       push sort key into queue
       else
       {
         if (no free space in sort_keys buffers)
         {
           sort sort_keys buffer;
C>         dump sorted sequence to 'tempfile';
           dump BUFFPEK describing sequence location into 'buffpek_pointers';
         }
D>       put sort key into 'sort_keys';
       }
     }
     if (sort_keys has some elements && dumped at least once)
       sort-dump-dump as above;
     else
E>     don't sort, leave sort_keys array to be sorted by caller.

The choice of whether to use PQ is cost based, considerations are:
A> the initial table scan has the same I/O cost for both algorithms,
B> 'push into sort queue' has CPU cost O(log(LIMIT)) for each key we push
C> significant savings in I/O cost can be assumed if we dump lots of
   buffers to tempfile. We will also save I/O and CPU cost by omitting
   the subsequent merge operation.
D> 'put sort key into 'sort_keys'' has no cost
E> If all rows are in memory, the final sorting by the caller is
   always done (using quicksort or similar) 
   this has CPU cost O(LIMIT * log(LIMIT))


For small input size (everything fits in memory) and high limit, we
should *not* use PQ, since a simple quick-sort will be cheaper.

For larger input size, and small limit, PQ will always outperform
sort/merge, as long as the PQ and sort_key buffer both fit in memory.

If the rows contain large fields, which are not part of the sort, we
may create a PQ for the sort-fields and keys only, and then do a final
lookup in the input table before returning the result.

If PQ is determined to not yield any performance gain, we fallback to
the original sort/merge algorithm.


The current cost analysis in the optimizer does not calculate the cost
of a filesort. Rather than extending the current cost model, we
propose to let filesort make a local decision whether to use PQ or
not.

EXPLAIN will not show whether the PQ algorithm is actually in use.
This will be handled by WL#5558

The implementation needs to maintain the semantics of
SQL_CALC_FOUND_ROWS, for both PQ and sort/merge. 
This can be implemented by counting rows during the initial file scan.

/**
  Function for making sort-key from input data.
  @param param Sort parameters.
  @param to    Where to put the key.
  @param pos   The input data.
 */
typedef void (*keymaker_function)(Sort_param *param, uchar *to, uchar *pos);

/**
  A priority queue with a fixed, limited size.

  This is a wrapper on top of QUEUE and the queue_xxx() functions.
  It keeps the top N elements which are inserted.
 */
class Bounded_queue
{
public:
  Bounded_queue();
  ~Bounded_queue();

  /**
    Initialize the queue.

    @param max_elements   The size of the queue.
    @param offset_to_key  Offset to key in elements stored in the queue.
    @param max_at_top     Set to 1 if you want biggest element on top.
    @param compare        Compare function for elements, takes 3 arguments.
    @parem compare_length Lenght of the data used for sorting.
    @param keymaker       Function which generates keys for elements.
    @param sort_param     Sort parameters.
    @param sort_keys      Array of pointers to keys to sort.

    @retval 0 OK, 1 Could not allocate memory.

    We do *not* take ownership of any of the input pointer arguments.
   */
  int init(ha_rows max_elements, uint offset_to_key, pbool max_at_top,
           queue_compare compare, size_t compare_length,
           keymaker_function keymaker, Sort_param *sort_param,
           uchar **sort_keys);

  /**
    Pushes an element on the queue.
    If the queue is already full, we discard the smallest (or biggest) element.

    @param element        The element to be pushed.
   */
  void push(uchar *element);

  /**
    Removes an element from the queue.

    @retval Pointer to the removed element.
   */
  uchar *pop();

  /**
    The number of elements in the queue.
   */
  uint num_elements() const { return m_queue.elements; }

  /**
    Is the queue initialized?
   */
  bool is_initialized() const { return m_queue.max_elements > 0; }
};



/*
  Calculate cost of merge sort

    @param num_buffers    # of merge buffers
    @param max_n_elems    max # of keys in memory buffer
    @param last_n_elems   # of keys in last buffer
    @param elem_size      Size of each element.

    Calculates cost of merge sort by simulating call to merge_many_buff().

  @retval
    computed cost of merge sort
*/

double get_merge_many_buffs_cost_fast(ha_rows num_buffers, ha_rows max_n_elems,
                                      ha_rows last_n_elems, uint elem_size);


/**
  Test whether priority queue is worth using to get top elements of an
  ordered result set. If it is then allocates buffer for required amount of
  records

  @param param     Sort parameters.
  @param info      Filesort information.
  @param table     Table to sort.
  @param num_rows  Estimate of number of rows in source record set.
  @param memavl    Memory available for sorting.

  DESCRIPTION
    Function tests applicability of PQ to get L top records for queries like
    SELECT ... FROM t ORDER BY a1,...,an LIMIT max_rows;
    It tests whether there is enough memory to contain max_rows elements,
    whether max_rows is small enough to make PQ perform faster that qsort
    on table containing 'num_rows' records. If available memory is enough to
    store only 'max_rows' sort keys without addon data, it modifies 'table'
    to sort with references to tuples instead of additional data. In evaluations
    it uses 'param' to get sort key, addon data and reference sizes, 'table' -
    to get table scan time.

   @retval
    allocated buffer - if it's ok to use PQ
    NULL - PQ will be slower than merge-sort, or there is not enough memory.
*/

uchar **check_if_pq_applicable(Sort_param *param, FILESORT_INFO *info,
                               TABLE *table, ha_rows num_rows, ulong memavl);


/**
  Provides an appropriate EXPLAIN text for filesort().

  @param thd        Thread handle.
  @param join       Join handle.
  @param table      Table to sort.

  @retval "; Using filesort" or "; Using filesort(PQ)"

  Performs the same cost analysis as filesort() in order to determine
  whether priority queue (PQ) is appliccable for the sort or not.

  @note
  Some joins are actually simplified during execution, @sa make_simple_join()
  and may thus be sorted using a PQ. This will not be shown in the explain text.
 */
const char *explain_filesort(THD *thd, JOIN *join, TABLE *table);