WL#3664: strnxfrm() changes for prefix keys and NOPAD
Affects: Server-5.6 — Status: Complete — Priority: Low
This is to solve BUG#24858. Padding in indexes cause problems for Falcon. The my_strnxfrm functions will take additional arguments: weight count and flags. The patch quality assurrance (i.e. that it works for Falcon) is provided by Kevin Lewis.
The filesort algorithm (which is executed when processing queries having ORDER BY on unindexed columns) uses cs->coll->strnxfrm() calls to create binary sortable keys. A cs->coll->strnxfrm() call resolves to a call of one of the following functions depending on collation: my_strnxfrm_8bit_bin my_strnxfrm_any_uca my_strnxfrm_big5 my_strnxfrm_bin my_strnxfrm_cp932 my_strnxfrm_czech my_strnxfrm_gbk my_strnxfrm_latin1_de my_strnxfrm_mb_bin my_strnxfrm_simple my_strnxfrm_sjis my_strnxfrm_tis620 my_strnxfrm_ucs2 my_strnxfrm_ucs2_bin my_strnxfrm_ucs2_uca, my_strnxfrm_utf8 my_strnxfrm_win1250ch As filesort supports only fixed length records, all the above functions currently pad the result up to maximun length with weights corresponding to space character. (0x20 or 0x0020 or 0x0209, never 0x00). Unlike other storage engines, Falcon stores a byte sortable version of the string in the index according to the specified collation. Variable length strings are fine, but space padding causes inconsistent results, especially with prefix searches (partial length indexes). We need to do some changes in the above functions to be able to switch off padding to reuse these functions for Falcon. It is not claimed that, if used in indexes, the results will be useful for LIKE or for covering indexes. Function definition ------------------- Proposed new prototype is: int (*strnxfrm)(CHARSET_INFO *cs, uchar *dst, uint dstlen, uint nweights, const uchar *src, uint srclen, uint flags); There will be no change for cs or dst or dstlen or src or srclen. Proposed new arguments are: 1. "flags", to pass a bit mask of flags. Currently only one flag is going to be supported: #define MY_STRNXFRM_FLAG_PAD 1 /* Whether to pad */ But it is very likely that we'll need more flags in the future: #define MY_STRNXFRM_FLAG_xxx 2 /* Another flag */ #define MY_STRNXFRM_FLAG_yyy 4 /* Yet another flag */ #define MY_STRNXFRM_FLAG_zzz 8 /* and so on */ The name is not certain. Peter prefers PAD_WITH_SPACE because that's the WL#896 term. 2. "nweights" - to create prefix keys. nweights will be the number from prefix key definition: For example, it will be 3 for KEY(c1(3)). One may pass nweights = 1000000 as a convention, that is, 1000000 is an impossibly-high number and so if you pass 1000000 then (in Kevin Lewis's words) "all the weights from the src that will fit into dstLen should be done." In response to a suggestion of using "the maximum 32-bit unsigned value, 4294967295", instead of 1000000, Bar replied: "This number is ok." The meaning of "nweights" ------------------------- Although nweights should mean "number of weights", that needs more clarification. The following doesn't correspond to the UCA definition, so MySQL may use a different term for documentation of prefix keys. For this context (only), "weight" means: the primary weight of a single character, but not counting when the weight is ignorable, and counting double when one character is equivalent to two (as with german2 and umlauts), and counting half when two characters are equivalent to one (as with czech ch). Here are some tricky examples, where MY_STRNXFRM_FLAG_PAD equals 1 in all cases, and dstlen = 1000 in all cases except Example #5. Tricky Example #1: latin1_german2_ci weight, with expansion. Suppose: character set latin1, collation latin1_german2_ci. Suppose: string value = 'ü'. The number of weights is: 2. 0x55, 0x45. Therefore, if nweights = 1, strnxfrm() produces 0x55! Tricky Example #1a: latin1_german2_ci weight, with expansion. Suppose: character set latin1, collation latin1_german2_ci. Suppose: string value = 'ü', repeated 1100 times. The number of weights is: 2200. 0x55, 0x45, ... Therefore, if maximum Falcon key length = 1100, it's illegal! Tricky Example #2: ucs2_unicode_ci weight, with ignorable. Suppose: character set ucs2, collation latin1_unicode_ci. Suppose: string value = 0x0000. The number of weights is: 0. Therefore, if nweights = 1, strnxfrm() produces 0x0209! Tricky Example #3: ucs2_unicode_ci weight, with ignorable. Suppose: character set ucs2, collation latin1_unicode_ci. Suppose: string value = 0x0000, 0x0020. The number of weights is: 1. 0x0209. Therefore, if nweights = 1, strnxfrm() produces 0x0209! Tricky Example #4: utf8_czech_ci weight, with contraction. Suppose: character set utf8, collation utf8_czech_ci. Suppose: string value = 'ch'. The number of weights is: 1. 0x0ee2. Therefore, if nweights = 1, strnxfrm() produces 0x0ee2! Tricky Example #5: ucs2_general_ci, with bad dstlen. Suppose: character set ucs2, collation ucs2_general_ci. Suppose: string value = 0x2020, the Unicode for 'dagger'. The number of weights is: 1. 0x2020. But -- to make things hard -- suppose dstlen = 1 (bytes). Therefore, if nweights = 1, strnxfrm() produces 0x20! It won't really happen, though, because specification of KEY(s1(N)) will cause space allocation of (N times maximum weight length) bytes, so dstlen can't be bad. Tricky Example #6: ucs2_general_ci, with 0x20 in weight. Suppose: character set ucs2, collation ucs2_unicode_ci. Suppose: string value = YI SYLLABLE HNIEP. The number of weights is: 1. 0x2020. Therefore, if nweights = 1, strnxfrm() produces 0x2020. (Okay, maybe this last one isn't tricky, but let's hope nobody dreams of truncating trailing 0x20 or 0x00.) Use for prefix keys ------------------- The plan is to apply "nweights" for prefix key definitions as well, for example KEY(s1(3)) will no longer mean "the three leftmost characters" but "three weights". See also these bug reports: BUG#20353 "Prefix key search return incorrect results for ignorable characters" BUG#20447 "Problem with prefix keys with contractions and expansions" Falcon may use strnxfrm functions for any indexing, prefix or not. Relation to WL#896 ------------------ The above changes will also help to implement the WEIGHT_STRING() SQL function which is a part of WL#896: Syntax: WEIGHT_STRING ( string /* input */ AS CHAR ( int ) ] /* for padding */ [ FROM int ] /* first output weight */ [ FOR int ] /* number of output weights */ [ flags ] /* special effects */ ) For example, WEIGHT_STRING(c1 AS CHAR(3) PAD_WITH_SPACE) will internally do this call: cs->cset->strnxfrm(cs, dst, dstlen, 3, src, srclen, MY_STRNXFRM_FLAG_PAD); Falcon has what WL#896 refers to as a "one-level index of weights". The "nweights" parameter is what WL#896 refers to as "number of output weights". UCA version ----------- UCA weights are from ftp://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt which is not the current version. If/when WL#2673 happens, users will have to rebuild all UCA-based Falcon indexes. Stall ----- There must be no violation of the principle, as Bar states it: "ORDER BY" must always return exactly the same order in all these cases: - no index - prefix index - full-length index Currently, in the Falcon storage engine, for multi-column indexes, all columns except the last are padded with 0x00 to 5-byte boundaries. There are some unhandled doubts about how padding will work. It's safest to wait until everyone knows what Falcon indexes will really look like.
Above design discussed in detail by PeterG, Bar, Kevin, Ann, JimS. No further LLD needed,=.
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.