WL#11590: More flexible filesort
Affects: Server-8.0
—
Status: Complete
When index cannot be used to satisfy an ORDER BY clause, filesort algorithms are used to produce ordered results. This WL makes filesort more flexible and less oriented around fixed-size rows by: * Allocating buffers incrementally instead of the maximum possible up-front. * Allowing to start a sort even though merging could fail in the worst case, as the worst case is unlikely in practice. * Generating sort keys optimistically, ie., trying to write the sort key to the buffers even if it would not fit in the worst case (starting new buffers only if it fails). The latter also enables: * Improving correctness by obsoleting max_sort_length for modern (NO PAD) collations. The primary user-visible benefit of this worklog is that a user can freely set the default sort buffer size much higher than today, without having to worry about excessive memory usage for small sorts. This means that large ORDER BY and GROUP BY queries will become much faster, while still keeping small such queries at low memory usage (which also makes them faster, since allocating memory takes time).
Functional requirements F1 All values should sort correctly (ie., as before, except ignoring max_sort_length). String values of PAD SPACE collations are exempt from this requirement. F2 It should be possible to sort short values in long fields (e.g. VARCHAR(1000) COLLATE utf8mb4_0900_ai_ci) with small sort buffers without getting an error. Non-functional requirements NF1 Sort speed should be substantially unchanged (within ~1%) or better than the existing code for the case where the sort buffer is correctly sized (e.g. sorting 30 kB of data in a 32 kB sort buffer). Windows is exempt from this requirement. NF2 Sort speed should be better than the existing code for the case where the sort buffer is too large (e.g. sorting 30 kB of data in a 2 MB sort buffer). Windows is exempt from this requirement. NF3 Memory usage when sorting should be proprotional to the data being sorted, except when the data is larger than the sort buffer; if so, memory usage should be proportional to the sort buffer.
Filesort was traditionally very much oriented around fixed-size rows. Recent improvements have increased flexibility, but there is still a fair amount of pessimization: * Filesort buffers are always allocated in one go, at the largest possible size (the maximum specified by the administrator). A more flexible buffer allocation strategy would mean that an administrator could safely set a higher maximum buffer size, without having to worry that sorting 100 small rows would allocate 32 MB of RAM (which is both wasteful and takes time). * Several pessimistic assumptions are made during initialization, causing _any_ sort to fail if there is not room for at least 15 maximum-size rows in the sort buffer (we merge up to 15 disk files in one go, and need to keep at least one row for each in memory at any given time). Given that newer collations have huge maximum sizes but almost never produce such rows in practice (e.g. VARCHAR(1000) COLLATE utf8mb4_ja_0900_as_cs can in theory produce over 64 kB per row, but in reality, most typical rows will be around 100–200 bytes), this means one needs to specify large sort buffers for the sort to start at all, even if it is completely unneeded. * As a band-aid for the previous point, there is a variable called max_sort_length (default 1024 bytes), which truncates longer sort keys as needed. However, this silently and unpredictable causes wrong sorting; its only remaining purpose is to cap the worst case so that sorting of long values can go through at all. In particular, blobs, geography objects or JSON can produce effectively unbounded sort keys, but rarely do so in practical sorts. This worklog aims to remove all these pessimizations and move to a true variable-length sort: * Filesort buffers should be allocated incrementally as needed; if you only have 64 kB of data, you should not allocate 1 MB. * The check for running out of room in the merge phase (needing 15 rows) should be delayed to runtime, ie., only fail if we actually _do_ run out of room, as opposed to failing if there exists any conceivable case where we run out of room. * max_sort_length should become a no-op; all values should be correctly and deterministically sorted, even long ones. If the user really wants to sort on a substring (e.g. the first 1024 bytes of a blob), they can do this themselves by sorting on an expression. We can remove it later, and/or add a deprecation warning when it's being set. There is one exception to this: As long as we need to serialize everything into sort keys (ie., we don't support collation-aware comparison in the sort/merge stage itself, long string values (typically CHAR/VARCHAR/BLOB) with PAD SPACE collations (ie., all the pre-8.0 collations) will still need to be affected by max_sort_length, as they all need to be padded to the same width and that width could become potentially infinite, particularly for blobs. The primary user-visible benefit of this worklog is that a user can freely set the default sort buffer size much higher than today, without having to worry about excessive memory usage for small sorts. This means that large ORDER BY and GROUP BY queries will become much faster, while still keeping small such queries at low memory usage (which also makes them faster, since allocating memory takes time). Note that as a side effect of (mostly) removing max_sort_length, top-N through the priority queue will apply in somewhat fewer situations, as our current PQ implementation relies on fixed-length rows. In particular, ORDER BYLIMIT could previously be done using the priority queue (as the sort keys would be truncated to 1024 bytes, and thus bounded), but now needs to be sorted using filesort. One could implement top-N through PQ with fallback to filesort if the values exceed the maximum size, but it is outside the scope of this worklog.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.