WL#1509: Pack values of non-sorted fields in the sort buffer
Status: Complete
For queries with group by and order by clauses an external file sorting might be performed. The way how this operation is accomplished for a query depends on the value of the max-length-for-sort-data parameter (system variable). If the value is greater than the total length of the sorted fields + total length of the fields to be retrieved the values of all these fields are read into the sort buffer. Otherwise only the values of sort fields + a reference to the row are read into the buffer. The former requires more time for sorting but guaranties a fast retrieval of the result set, as values to be returned are read sequentially from an external sort file. Currently the values of the additional fields appended to the values of the sort fields are not packed. As a result a lot of space in the sort buffer and in the external sort file is wasted. E.g. a value of the type varchar(255) of the length 3 characters takes 255 bytes. It causes a significant slow down of the external sort procedure, as less sorted data can be allocated in the sort buffer and consequently more merge passes have to be performed. Thus, values of the appended fields of flexible types should be packed. If total length of the sorted fields and the appended packed values for a row exceeds the value of max-length-for-sort-data a reference to the row is appended to the values of sorted fields. User Documentation ================== http://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-3.html
Functional Requirements F-1: In order to see whether the fields are packed or not, the user must inspect the optimizer trace. The current format is: "filesort_summary": { "rows": 100, "examined_rows": 100, "number_of_tmp_files": 0, "sort_buffer_size": 25080, "sort_mode": "" } The "sort_mode" is " " when building a sorted index. This will be extended with " ". Non-functional Requirements. NF-1: No performance regressions in queries that do not use filesort. NF-2: No performance regressions in our standard performance tests (sysbench and DBT3) NF-3: No performance regressions for filesort queries that do not use "addon fields" NF-4: Significant performance improvement for data where records can be packed, and this packing avoids a buffer overflow. NF-5: An equally significant performance degradation must be expected for the cases where we the extra 2-byte length field results in a buffer overflow (columns are packable, but actual data is not). NF-6: If all data fits in buffer: no queries should be slower. The exact meaning of "significant" is determined by QA. I assume they use some statistics: variance, confidence intervals, ...
This worklog is a pure optimization, i.e. no new externally visible functionality will be implemented. By packing the records in the filesort buffer, we expect major performance improvements for sorting of variable-sized data (CHAR/VARCHAR). Note that all NULLABLE fields can also be packed. The actual in-memory sorting is not changed with this WL, we keep the same fixed-size, byte-by-byte comparable, keys. The packing is not applicable if filesort decides to use a priority queue for sorting ("order by with limit" optimization). For background, see the reference manual, chapter 11.2. How MySQL Does Sorting (filesort) http://dev.mysql.com/doc/internals/en/filesort.html The packing is only applicable if filesort decides to use "addon fields" (modified algorithm in refman), rather than "record references" (original algorithm in refman). The actual performance benefit depends on the data to be sorted. If the data contains many strings that are shorter than their maximum length, or many NULL values for nullable columns, we may potentially gain a lot, both for main-memory sorting (more records fit in the buffer) and disk based sorting (fewer write-read-merge-write cycles). For pathological cases we may actually see a performance degradation: if all strings have max-length, and/or there are no NULL values. The records to be sorted must be extended with an "actual length" field, which means we might have space for fewer records in main-memory, and must resort to disk based sort-merge. "Addon fields" are only used if max record size is smaller than max_length_for_sort_data. The default value for this variable is 1024, and has been unchanged since it was introduced 10 years ago. We *could* increase this value: we are now able to compress records, and most notably, people have a lot more ram these days than 10 years ago. For this WL, we have decided not do alter max_length_for_sort_data. We could make the new feature user-controllable (e.g. optimizer_switch_pack_filesort). We have decided not to do this. The patch for Bug#12856915 VALGRIND FAILURE IN FILESORT/CREATE_SORT_INDEX introduced valgrind suppressions for writing sort buffers to disk. This worklog should obviate the need for such suppressions. FUTURE ENHANCEMENTS, *NOT* COVERED BY THIS WORKLOG: If "addon fields" are used, there is data duplication. The fields in the ORDER BY list are stored twice: first as byte-by-byte-comparable fields used for sorting, then in ordinary field format in the "addon fields". We could save some space by skipping them in the "addon fields", but would have to undo the effects of Field::make_sort_key() when reading the sorted result. For some queries we could also pack the sort keys. If the ORDER BY list has trailing char/varchar column, we could do memcmp(key1, key2, min(len(key1), len(key2))) Benchmarking shows that this speeds up sorting if the actual key size is smaller than 90% of the max key size. (Results measured on intel, where reading of un-aligned integers is very fast).
The major changes to be implemented are: - each record to be sorted is stored in a packed format - consecutive records in the sort buffer are packed - the sort buffer now holds an unknown number of records, so the fixed allocation of record pointers needs to change. - unpacking of records (in records.cc) must be adapted to the new variable size format. - the "merge chunks" will also be variable sized. There is actually very little new code to be written. The main work is to adapt existing code to the new record format, and the new sort buffer format: /** There are two record formats for sorting: |...| | / sort_length / ref_l / or with "addon fields" | ...| | ...| / sort_length / addon_length / The packed format for "addon fields" | ...| | | ...| / sort_length / addon_length / Fields are fixed-size, specially encoded with Field::make_sort_key() so we can do byte-by-byte compare. Contains the *actual* packed length (after packing) of everything after the sort keys. The size of the length field is 2 bytes, which should cover most use cases: addon data <= 65535 bytes. This is the same as max record size in MySQL. One bit for each nullable field, indicating whether the field is null or not. May have size zero if no fields are nullable. Are stored with field->pack(), and retrieved with field->unpack(). Addon fields within a record are stored consecutively, with no "holes" or padding. They will have zero size for NULL values. */ /** A wrapper class around the buffer used by filesort(). The sort buffer is a contiguous chunk of memory, containing both records to be sorted, and pointers to said records: |rec 0|record 1 |rec 2| ............ |ptr to rec2|ptr to rec1|ptr to rec0| Records will be inserted "left-to-right". Records are not necessarily fixed-size, they can be packed, and will be stored without any "gaps". Record pointers will be inserted "right-to-left", as a side-effect of inserting the actual records. We wrap the buffer in order to be able to do lazy initialization of the pointers: the buffer is often much larger than what we actually need. With this allocation scheme, and lazy initialization of the pointers, we are able to pack variable-sized records in the buffer, and thus possibly have space for more records than we initially estimated. */ The new, dynamic, record format should *not* be used if no packing is possible, i.e. if all addon fields are non-nullable, fixed size Fields. Note that char(n) is packed the same way as varchar(n), and can be packed. The size of the sort buffer is still computed as max_keys_per_buffer * max_size_of_each_record + space_for_record_pointers. The actual length of each record is computed by make_sortkey() and stored within the record, as indicated above. This means that find_all_keys() may be able to squeeze in more records in the buffer than max_keys_per_buffer. We move the reponsibility for memory management of the sort buffer from find_all_keys() to the sort buffer itself. find_all_keys() asks the sort buffer for a fixed-size buffer, one-at-a-time for each record. make_sortkey() generates the key, and addon fields, and returns the actual length used. find_all_keys() then tells the sort buffer how much space was actually used. As before: if the entire sorted result fits in memory, save_index() copies data into a new buffer, which is unpacked by rr_unpack_from_buffer(). With larger data sets: merge_buffers() must store fields and records consecutively in the temporary files. This means that the "merge chunks" are no longer fixed size, but dependent on the actual data. The existing BUFFPEK struct is no longer appropriate, and is replaced by a new class Merge_chunk which can handle variable-sized records. rr_unpack_from_file() needs to be adapted to the variable-sized record format. The new Merge_chunk structure looks like this: /** Descriptor for a merge chunk to be sort-merged. It is a POD because - we need offsetof() for the merge done using a priority QUEUE. - we read/write them from/to files. We have accessors (getters/setters) for the following members: m_file_position Current position in the (temporary) file to be sorted. m_buffer_start Start of main-memory buffer for this chunk. m_buffer_end End of main-memory buffer for this chunk. m_current_key The current key for this chunk. m_rowcount Number of unread rows in this chunk. m_mem_count Number of rows in the main-memory buffer. m_max_keys If we have fixed-size rows: max number of rows in buffer. */ This is essentially the same as the old BUFFPEK, except that we have added the member m_buffer_end in order to support variable-sized chunks. For better code readability/maintainability, all members have been made private, with appropriate accessor functions instead.
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.