WL#9554: Varlen keys for sorting multibyte String values
Affects: Server-8.0
—
Status: Complete
Unicode collations are sorted by means of variable-length weight strings. In the worst case, these weight strings can become very long; for instance, for utf8mb4_0900_as_cs, we have a strnxfrm_multiply of 24, which is interpreted as every byte in the input string becoming potentially 24 bytes. In other words, a VARCHAR(100), which can be up to 400 bytes, gets 9600 bytes allocated for filesort, even if it just contains a simple 'a' (which is six bytes of weight plus some level separators). The default set max_sort_length=1024 masks this problem somewhat by truncating the weight strings, at the cost of incorrect and unpredictable sorting when actually sorting strings that need long weight strings. This WL aims to introduce variable-length keys when sorting NO PAD collations. (PAD collations still need to be fixed length, because they are conceptually extended to infinity.) It builds on the existing semantics for sorting JSON using variable-length keys; it doesn't try to replace strnxfrm with strnncollsp, which would also be an interesting avenue, but can happen in a later WL. User Documentation ================== https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-2.html
F1 - Correct sort order should be preserved for all types. (It is not required to have equal sort order where the existing sort is already nondeterministic, like hitting limits for max_sort_length or on equal keys.) NF1 - Sysbench sorting benchmarks should be not markedly slower than before. NF2 - Sorting of sparse keys (e.g. a list of names in VARCHAR(100) COLLATE utf8mb4_0900_as_cs) should be significantly faster than before.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.