WL#1509: Pack values of non-sorted fields in the sort buffer

Status: Complete

For queries with group by and order by clauses an external file sorting might 
be performed. The way how this operation is accomplished for a query depends on 
the value of the max-length-for-sort-data parameter (system variable).
If the value is greater than the total length of the sorted fields + total 
length of the fields to be retrieved the values of all these fields are read 
into the sort buffer. Otherwise only the values of sort fields + a reference to 
the row are read into the buffer.
The former requires more time for sorting but guaranties a fast retrieval of 
the result set, as values to be returned are read sequentially from an external 
sort file.

Currently the values of the additional fields appended to the values of the 
sort fields are not packed. As a result a lot of space in the sort buffer and 
in the external sort file is wasted. E.g. a value of the type varchar(255) of 
the length 3 characters takes 255 bytes. It causes a significant slow down of 
the external sort procedure, as less sorted data can be allocated in the sort 
buffer and consequently more merge passes have to be performed.

Thus, values of the appended fields of flexible types should be packed. If 
total length of the sorted fields and the appended packed values for a row 
exceeds the value of max-length-for-sort-data a reference to the row is 
appended to the values of sorted fields.

User Documentation
==================

http://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html
http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-3.html

Functional Requirements

F-1: In order to see whether the fields are packed or not, the user
     must inspect the optimizer trace. The current format is:

            "filesort_summary": {
              "rows": 100,
              "examined_rows": 100,
              "number_of_tmp_files": 0,
              "sort_buffer_size": 25080,
              "sort_mode": ""
            }

     The "sort_mode" is "" when building a sorted index.
     This will be extended with "".


Non-functional Requirements.

NF-1: No performance regressions in queries that do not use filesort.
NF-2: No performance regressions in our standard performance tests
      (sysbench and DBT3)
NF-3: No performance regressions for filesort queries that do not use
      "addon fields"
NF-4: Significant performance improvement for data where records can be
      packed, and this packing avoids a buffer overflow.
NF-5: An equally significant performance degradation must be expected
      for the cases where we the extra 2-byte length field results in a buffer
      overflow (columns are packable, but actual data is not).
NF-6: If all data fits in buffer: no queries should be slower.

The exact meaning of "significant" is determined by QA.
I assume they use some statistics: variance, confidence intervals, ...

This worklog is a pure optimization, i.e. no new externally visible
functionality will be implemented. By packing the records in the
filesort buffer, we expect major performance improvements for sorting
of variable-sized data (CHAR/VARCHAR). Note that all NULLABLE fields
can also be packed.

The actual in-memory sorting is not changed with this WL, we keep the
same fixed-size, byte-by-byte comparable, keys.

The packing is not applicable if filesort decides to use a priority
queue for sorting ("order by with limit" optimization).

For background, see the reference manual, chapter
11.2. How MySQL Does Sorting (filesort)
http://dev.mysql.com/doc/internals/en/filesort.html

The packing is only applicable if filesort decides to use "addon
fields" (modified algorithm in refman), rather than "record
references" (original algorithm in refman).

The actual performance benefit depends on the data to be sorted. If
the data contains many strings that are shorter than their maximum
length, or many NULL values for nullable columns, we may potentially
gain a lot, both for main-memory sorting (more records fit in the
buffer) and disk based sorting (fewer write-read-merge-write cycles).

For pathological cases we may actually see a performance
degradation: if all strings have max-length, and/or there are no NULL
values. The records to be sorted must be extended with an "actual
length" field, which means we might have space for fewer records in
main-memory, and must resort to disk based sort-merge.

"Addon fields" are only used if max record size is smaller than
max_length_for_sort_data. The default value for this variable is 1024,
and has been unchanged since it was introduced 10 years ago.
We *could* increase this value: we are now able to compress records,
and most notably, people have a lot more ram these days than 10 years ago.
For this WL, we have decided not do alter max_length_for_sort_data.

We could make the new feature user-controllable
(e.g. optimizer_switch_pack_filesort).
We have decided not to do this.


The patch for
Bug#12856915 VALGRIND FAILURE IN FILESORT/CREATE_SORT_INDEX
introduced valgrind suppressions for writing sort buffers to disk.
This worklog should obviate the need for such suppressions.


FUTURE ENHANCEMENTS, *NOT* COVERED BY THIS WORKLOG:

If "addon fields" are used, there is data duplication. The fields in the ORDER
BY list are stored twice: first as byte-by-byte-comparable fields used for
sorting, then in ordinary field format in the "addon fields". We could save some
space by skipping them in the "addon fields", but would have to undo the effects
of Field::make_sort_key() when reading the sorted result.

For some queries we could also pack the sort keys. If the ORDER BY list has
trailing char/varchar column, we could do 
  memcmp(key1, key2, min(len(key1), len(key2)))
Benchmarking shows that this speeds up sorting if the actual key size is smaller
than 90% of the max key size. (Results measured on intel, where reading of
un-aligned integers is very fast).

The major changes to be implemented are:
 - each record to be sorted is stored in a packed format
 - consecutive records in the sort buffer are packed
 - the sort buffer now holds an unknown number of records,
   so the fixed allocation of record pointers needs to change.
 - unpacking of records (in records.cc) must be adapted to the
   new variable size format.
 - the "merge chunks" will also be variable sized.

There is actually very little new code to be written. The main work is
to adapt existing code to the new record format, and the new sort
buffer format:

/**
  There are two record formats for sorting:
    |...||
    /  sort_length    / ref_l /

  or with "addon fields"
    |...||...|
    /  sort_length    /         addon_length            /

  The packed format for "addon fields"
    |...|||...|
    /  sort_length    /         addon_length                     /

         Fields are fixed-size, specially encoded with
              Field::make_sort_key() so we can do byte-by-byte compare.
      Contains the *actual* packed length (after packing) of
              everything after the sort keys.
              The size of the length field is 2 bytes,
              which should cover most use cases: addon data <= 65535 bytes.
              This is the same as max record size in MySQL.
   One bit for each nullable field, indicating whether the field
              is null or not. May have size zero if no fields are nullable.
    Are stored with field->pack(), and retrieved with field->unpack().
              Addon fields within a record are stored consecutively, with no
              "holes" or padding. They will have zero size for NULL values.

 */


/**
  A wrapper class around the buffer used by filesort().
  The sort buffer is a contiguous chunk of memory,
  containing both records to be sorted, and pointers to said records:

  
  |rec 0|record 1  |rec 2|  ............  |ptr to rec2|ptr to rec1|ptr to rec0|

  Records will be inserted "left-to-right". Records are not necessarily
  fixed-size, they can be packed, and will be stored without any "gaps".

  Record pointers will be inserted "right-to-left", as a side-effect
  of inserting the actual records.

  We wrap the buffer in order to be able to do lazy initialization of the
  pointers: the buffer is often much larger than what we actually need.

  With this allocation scheme, and lazy initialization of the pointers,
  we are able to pack variable-sized records in the buffer,
  and thus possibly have space for more records than we initially estimated.
 */

The new, dynamic, record format should *not* be used if no packing is
possible, i.e. if all addon fields are non-nullable, fixed size
Fields. Note that char(n) is packed the same way as varchar(n), and
can be packed.

The size of the sort buffer is still computed as 
max_keys_per_buffer * max_size_of_each_record + space_for_record_pointers.

The actual length of each record is computed by make_sortkey() and
stored within the record, as indicated above. This means that
find_all_keys() may be able to squeeze in more records in the buffer
than max_keys_per_buffer. We move the reponsibility for memory
management of the sort buffer from find_all_keys() to the sort buffer
itself.

find_all_keys() asks the sort buffer for a fixed-size buffer, 
one-at-a-time for each record. make_sortkey() generates the key, and
addon fields, and returns the actual length used.
find_all_keys() then tells the sort buffer how much space was actually used.

As before: if the entire sorted result fits in memory,
save_index() copies data into a new buffer, which is unpacked by
rr_unpack_from_buffer().

With larger data sets: merge_buffers() must store fields and records
consecutively in the temporary files. This means that the "merge
chunks" are no longer fixed size, but dependent on the actual data.
The existing BUFFPEK struct is no longer appropriate, and is replaced
by a new class Merge_chunk which can handle variable-sized records.
rr_unpack_from_file() needs to be adapted to the variable-sized record
format. 

The new Merge_chunk structure looks like this:
/**
  Descriptor for a merge chunk to be sort-merged.

  It is a POD because
   - we need offsetof() for the merge done using a priority QUEUE.
   - we read/write them from/to files.

  We have accessors (getters/setters) for the following members:

  m_file_position Current position in the (temporary) file to be sorted.
  m_buffer_start  Start of main-memory buffer for this chunk.
  m_buffer_end    End of main-memory buffer for this chunk.
  m_current_key   The current key for this chunk.
  m_rowcount      Number of unread rows in this chunk.
  m_mem_count     Number of rows in the main-memory buffer.
  m_max_keys      If we have fixed-size rows: max number of rows in buffer.
 */

This is essentially the same as the old BUFFPEK, except that we have added the
member m_buffer_end in order to support variable-sized chunks. For better code
readability/maintainability, all members have been made private, with
appropriate accessor functions instead.