The world's most popular open source database

Documentation Downloads MySQL.com

Developer Zone

Section Menu:

About Worklog
MySQL Worklogs are design specifications for changes that may define past work, or be considered for future development.

WL#9554: Varlen keys for sorting multibyte String values

Affects: Server-8.0 — Status: Complete

Description
Requirements

Unicode collations are sorted by means of variable-length weight strings. In the 
worst case, these weight strings can become very long; for instance, for 
utf8mb4_0900_as_cs, we have a strnxfrm_multiply of 24, which is interpreted as 
every byte in the input string becoming potentially 24 bytes. In other words, a 
VARCHAR(100), which can be up to 400 bytes, gets 9600 bytes allocated for 
filesort, even if it just contains a simple 'a' (which is six bytes of weight 
plus some level separators). The default set max_sort_length=1024 masks this 
problem somewhat by truncating the weight strings, at the cost of incorrect and 
unpredictable sorting when actually sorting strings that need long weight 
strings.

This WL aims to introduce variable-length keys when sorting NO PAD collations. 
(PAD collations still need to be fixed length, because they are conceptually 
extended to infinity.) It builds on the existing semantics for sorting JSON using 
variable-length keys; it doesn't try to replace strnxfrm with strnncollsp, which 
would also be an interesting avenue, but can happen in a later WL.

User Documentation
==================

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-2.html

F1 - Correct sort order should be preserved for all types. (It is not required to 
have equal sort order where the existing sort is already nondeterministic, like 
hitting limits for max_sort_length or on equal keys.)

NF1 - Sysbench sorting benchmarks should be not markedly slower than before.

NF2 - Sorting of sparse keys (e.g. a list of names in VARCHAR(100) COLLATE 
utf8mb4_0900_as_cs) should be significantly faster than before.