WL#3716: WEIGHT_STRING function
Affects: Server-5.6
—
Status: Complete
Allow a function named WEIGHT_STRING(), which takes a string and returns the collation weights. This task was formerly part of WL#896 Primary, Secondary and Tertiary Sorts. The function is useful for testing the results from WL#3664 strnxfrm() changes for prefix keys and NOPAD. ----- DOCS NOTE (Email from Bar, 2007-10-08): > Also, I have a suggestion for 5.2 manual, for the same page, > after the last paragraph telling about "sort value": > http://dev.mysql.com/doc/refman/5.2/en/case-sensitivity.html > Please put a link to the WEIGHT_STRING article... ----- [ Note added by Peter Gulutzan 2009-10-18 ] Code is complete. But priority is lower than it was before. So we can delay putting this in a production release. A good time would be: along with WL#896 "Primary, Secondary and Tertiary Sorts". This decision follows Alexander Barkov's email today "Re: WL#3716 WEIGHT_STRING function".
The WEIGHT_STRING function will return a binary string of weights. Comparable functions are Oracle's NLSSORT and Microsoft's tertiary_weights. Syntax ------ WEIGHT_STRING ( string /* input */ [ AS { CHAR | BINARY} ( int ) ] /* for padding */ [ LEVEL numeric-list-or-range /* e.g. "LEVEL 1,2,3" or "LEVEL 1-3" */ [ flags ] /* special effects */ ) The "string" should be an expression that returns a string data type. Ordinarily it will be CHAR or VARCHAR. If it's TEXT, it's treated the same as CHAR or VARCHAR. If it's BLOB or BINARY or VARBINARY, then output = input. If it's something else, e.g. FLOAT or INT or DECIMAL or ENUM, then accept it and return the bytes of numbers as big-endian. Generally, show what MySQL uses for filesort. "AS CHAR ( int )" causes padding on the right. The pad character is always space, that is, 0x20 or 0x0020. The padding occurs before the conversion to a weight. The value of "int" is the number of characters, not the number of bytes. The minimum value of 'int' is 1. If 'int' is smaller than the length of 'string', truncation will occur with no warning. If "AS CHAR ( int )" is omitted, there is no padding and no truncation. "AS BINARY ( int )" is like CHAR(int) but causes padding of 0x00, so one doesn't have to use "CAST(string AS BINARY(int))". "LEVEL numeric-list-or-range" is the start weight and following weights for the output string. A list is a comma-separated series of positive integers: "first,second,third,...", for example "1,2". A range is a pair of integers separated by "-": "first-last", for example "1-3" or (spaces don't matter) "1 - 6". If any number is less than 1, it is treated as 1. If the second number in a range is less than the first, it is treated as the first. If any number is greater than the maximum, it is treated as the maximum. (The "maximum" varies; it's never greater than 6; with some collations it's only 1.) If LEVEL is omitted, MySQL assumes: LEVEL 1 - maximum If and only if one specifies "LEVEL numeric-list" (not "range"), one may follow any "number" with [ASC|DESC][REVERSE] If one says "DESC", then the weights come out NOTed or negated for that level. Probably "DESC" is useful for Level 3 for putting Lowercase before Uppercase, for Level 4 for putting Katakana before Hiragana. If one says "REVERSE", then the weights come out in reverse order for that level, that is, starting with the last character and ending with the first character. (REVERSE is a good word because it's like MySQL's REVERSE function, but unicode.org UCA document calls this the "backward" feature.) Probably "REVERSE" is only useful for Level 2 for the (supposed) French sorting of accents right-to-left. If one does not say "DESC" or "REVERSE", the default is: weights come out not NOTed, and not reversed. And eventually (though not for this task) there could be combinations: LEVEL ALL /* Every level the collation can have */ LEVEL ALL FOR COMPARE /* All normally used for "WHERE" */ LEVEL ALL FOR ORDER /* All normally used for "ORDER BY" */ ("FOR COMPARE" means "for WHERE, DISTINCT, MIN, etc."; we know that ORDER BY involves comparison too but couldn't think of any better words, after considering FOR WHERE and FOR EQUALITY.) Incidentally, "LEVEL ALL" could be the same as "LEVEL 1,2,3,4,5,6" for a Japanese collation containing the four levels of WL#2555, plus a fifth-level comparison for putting halfwidth before fullwidth, plus level6 = where "ignorable" values would be shifted. It's very likely that 6 is the "maximum maximum". "flags" is something that we'll use for special cases. In version 5.2 there are, as yet, no defined flags. For the long term Peter and Bar should think about: VARIABLE_BLANK | VARIABLE_NON_IGNORABLE | VARIABLE_SHIFTED What do do with UCA "Variable Weighting" characters. NULLS_HIGH | NULLS_LOW | NULLS_FIRST | NULLS_LAST | NULLS_NULL Return for NULL can be maximum, minimum, suprememum, infimum, or NULL. Probably this only has practical meaning for Level 1. COMPRESS Don't take a full byte for a tertiary upper|lower case difference. This is primarily useful for Level 2 and later, since the number of choices (different kinds of accent, different kinds of case, etc.) is less than 256. VERSION_UCA_4_0 | VERSION_UCA_5_0 The version of Unicode allkeys, 4.0.0 or 5.0.0. NORMALIZE IBM's ICU allows "normalization off/on" flags. Examples -------- WEIGHT_STRING ('w' COLLATE latin2_czech_ci) WEIGHT_STRING (@a LEVEL 1,2,3) WEIGHT_STRING ('' LEVEL 2 REVERSE) select hex(weight_string(_ucs2 ox0061 collate ucs2_unicode_ci as char(2))); weight_string(convert(latin1_general_ci_column using ucs2) collate ucs2_unicode_w4a levels 2-2) "LEVEL x, x - 1" ---------------- Peter's original hope was that expresions like "LEVEL 3,2" would mean that one treated level 3 as being higher priority than level 2 (good for compatibility with Microsoft Japanese quirks). But that is not to be. Instead, MySQL will give a syntax error if, in a list of levels, it encounters a level which has a lower number than a level mentioned earlier in the list. Output ------ If the result length is less than or equal to the maximum possible length for the VARBINARY data type, then the result data type is VARBINARY. Otherwise the result data type is BLOB. The actual return length varies depending on input length, number of weights, first weight, flag values, and MySQL version. If input is NULL, then output is NULL. In non-strict mode, errors may cause the return value to be NULL. WEIGHT_STRING will not be a new reserved word. See also: WL#3804 WEIGHT_STRING deferred features
Design has been discussed, in detail, by Falcon team (primarily Kevin), Bar and PeterG; no further LLD needed.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.