WL#11590: More flexible filesort

Affects: Server-8.0 — Status: Complete

Description
Requirements
Low Level Design

When index cannot be used to satisfy an ORDER BY clause, filesort algorithms are
used to produce ordered results.

This WL makes filesort more flexible and less oriented around fixed-size rows by:

 * Allocating buffers incrementally instead of the maximum possible up-front.

 * Allowing to start a sort even though merging could fail in the worst case,
   as the worst case is unlikely in practice.

 * Generating sort keys optimistically, ie., trying to write the sort key to
   the buffers even if it would not fit in the worst case (starting new buffers
   only if it fails).

The latter also enables:

 * Improving correctness by obsoleting max_sort_length for modern (NO PAD)
   collations.

The primary user-visible benefit of this worklog is that a user can freely set
the default sort buffer size much higher than today, without having to worry
about excessive memory usage for small sorts. This means that large ORDER BY
and GROUP BY queries will become much faster, while still keeping small such
queries at low memory usage (which also makes them faster, since allocating
memory takes time).

   
Functional requirements

 F1 All values should sort correctly (ie., as before, except ignoring 
    max_sort_length). String values of PAD SPACE collations are exempt
    from this requirement.
    
 F2 It should be possible to sort short values in long fields
    (e.g. VARCHAR(1000) COLLATE utf8mb4_0900_ai_ci) with small sort buffers
    without getting an error.
    
Non-functional requirements

 NF1 Sort speed should be substantially unchanged (within ~1%) or better
     than the existing code for the case where the sort buffer is correctly
     sized (e.g. sorting 30 kB of data in a 32 kB sort buffer).
     Windows is exempt from this requirement.
     
 NF2 Sort speed should be better than the existing code for the case where
     the sort buffer is too large (e.g. sorting 30 kB of data in a 2 MB sort
     buffer). Windows is exempt from this requirement.
     
 NF3 Memory usage when sorting should be proprotional to the data being
     sorted, except when the data is larger than the sort buffer; if so,
     memory usage should be proportional to the sort buffer.

Filesort was traditionally very much oriented around fixed-size rows.
Recent improvements have increased flexibility, but there is still a
fair amount of pessimization:

 * Filesort buffers are always allocated in one go, at the largest possible
   size (the maximum specified by the administrator). A more flexible buffer
   allocation strategy would mean that an administrator could safely set a
   higher maximum buffer size, without having to worry that sorting 100 small
   rows would allocate 32 MB of RAM (which is both wasteful and takes time).

 * Several pessimistic assumptions are made during initialization, causing
   _any_ sort to fail if there is not room for at least 15 maximum-size rows
   in the sort buffer (we merge up to 15 disk files in one go, and need to
   keep at least one row for each in memory at any given time). Given that
   newer collations have huge maximum sizes but almost never produce such
   rows in practice (e.g. VARCHAR(1000) COLLATE utf8mb4_ja_0900_as_cs can in
   theory produce over 64 kB per row, but in reality, most typical rows will
   be around 100–200 bytes), this means one needs to specify large sort
   buffers for the sort to start at all, even if it is completely unneeded.

 * As a band-aid for the previous point, there is a variable called
   max_sort_length (default 1024 bytes), which truncates longer sort keys
   as needed. However, this silently and unpredictable causes wrong sorting;
   its only remaining purpose is to cap the worst case so that sorting of
   long values can go through at all. In particular, blobs, geography objects
   or JSON can produce effectively unbounded sort keys, but rarely do so
   in practical sorts.

This worklog aims to remove all these pessimizations and move to a true
variable-length sort:

 * Filesort buffers should be allocated incrementally as needed; if you only
   have 64 kB of data, you should not allocate 1 MB.

 * The check for running out of room in the merge phase (needing 15 rows)
   should be delayed to runtime, ie., only fail if we actually _do_ run out
   of room, as opposed to failing if there exists any conceivable case where
   we run out of room.

 * max_sort_length should become a no-op; all values should be correctly
   and deterministically sorted, even long ones. If the user really wants
   to sort on a substring (e.g. the first 1024 bytes of a blob), they can
   do this themselves by sorting on an expression. We can remove it later,
   and/or add a deprecation warning when it's being set.

   There is one exception to this: As long as we need to serialize everything
   into sort keys (ie., we don't support collation-aware comparison in the
   sort/merge stage itself, long string values (typically CHAR/VARCHAR/BLOB)
   with PAD SPACE collations (ie., all the pre-8.0 collations) will still
   need to be affected by max_sort_length, as they all need to be padded
   to the same width and that width could become potentially infinite,
   particularly for blobs.

The primary user-visible benefit of this worklog is that a user can freely set
the default sort buffer size much higher than today, without having to worry
about excessive memory usage for small sorts. This means that large ORDER BY
and GROUP BY queries will become much faster, while still keeping small such
queries at low memory usage (which also makes them faster, since allocating
memory takes time).

Note that as a side effect of (mostly) removing max_sort_length, top-N through
the priority queue will apply in somewhat fewer situations, as our current
PQ implementation relies on fixed-length rows. In particular, ORDER BY 
LIMIT  could previously be done using the priority queue (as the sort keys
would be truncated to 1024 bytes, and thus bounded), but now needs to be
sorted using filesort. One could implement top-N through PQ with fallback to
filesort if the values exceed the maximum size, but it is outside the scope
of this worklog.