WL#3716: WEIGHT_STRING function

Affects: Server-5.6   —   Status: Complete

Allow a function named WEIGHT_STRING(),
which takes a string and returns the
collation weights.

This task was formerly part of WL#896
Primary, Secondary and Tertiary Sorts.

The function is useful for testing the results from
WL#3664 strnxfrm() changes for prefix keys and NOPAD.


DOCS NOTE (Email from Bar, 2007-10-08): 

> Also, I have a suggestion for 5.2 manual, for the same page,
> after the last paragraph telling about "sort value":

> http://dev.mysql.com/doc/refman/5.2/en/case-sensitivity.html

> Please put a link to the WEIGHT_STRING article...


[ Note added by Peter Gulutzan 2009-10-18 ]

Code is complete. But priority is lower than it was before.
So we can delay putting this in a production release.
A good time would be: along with WL#896 "Primary, Secondary
and Tertiary Sorts". This decision follows Alexander Barkov's
email today "Re: WL#3716 WEIGHT_STRING function".
The WEIGHT_STRING function will return a binary string    
of weights. Comparable functions are Oracle's NLSSORT    
and Microsoft's tertiary_weights.    


string                            /* input */    
[ AS { CHAR | BINARY} ( int ) ]   /* for padding */
[ LEVEL numeric-list-or-range     /* e.g. "LEVEL 1,2,3" or "LEVEL 1-3" */    
[ flags ]                         /* special effects */    
The "string" should be an expression that returns a string    
data type. Ordinarily it will be CHAR or VARCHAR.    
If it's TEXT, it's treated the same as CHAR or VARCHAR.    
If it's BLOB or BINARY or VARBINARY, then output = input.
If it's something else, e.g. FLOAT or INT or DECIMAL    
or ENUM, then accept it and return the bytes of numbers
as big-endian. Generally, show what MySQL uses for filesort.
"AS CHAR ( int )" causes padding on the right. The pad    
character is always space, that is, 0x20 or 0x0020.    
The padding occurs before the conversion to a weight.    
The value of "int" is the number of characters, not the    
number of bytes. The minimum value of 'int' is 1.    
If 'int' is smaller than the length of 'string',    
truncation will occur with no warning.    
If "AS CHAR ( int )" is omitted, there is no padding    
and no truncation.    
"AS BINARY ( int )" is like CHAR(int) but causes
padding of 0x00, so one doesn't have to use 
"CAST(string AS BINARY(int))".    

"LEVEL numeric-list-or-range" is the start weight and
following weights for the output string. A list is a
comma-separated series of positive integers:
"first,second,third,...", for example "1,2". A range
is a pair of integers separated by "-":
"first-last", for example "1-3" or (spaces don't matter)
"1 - 6". If any number is less than 1, it is treated as
1. If the second number in a range is less than the
first, it is treated as the first. If any number is
greater than the maximum, it is treated as the maximum.
(The "maximum" varies; it's never greater than 6; with
some collations it's only 1.)
If LEVEL is omitted, MySQL assumes:
LEVEL 1 - maximum

If and only if one specifies "LEVEL numeric-list"
(not "range"), one may follow any "number" with
If one says "DESC", then the weights come out NOTed
or negated for that level. Probably "DESC" is useful
for Level 3 for putting Lowercase before Uppercase,
for Level 4 for putting Katakana before Hiragana.
If one says "REVERSE", then the weights come out in
reverse order for that level, that is, starting with
the last character and ending with the first character.
(REVERSE is a good word because it's like MySQL's
REVERSE function, but unicode.org UCA document calls
this the "backward" feature.)
Probably "REVERSE" is only useful for Level 2 for the
(supposed) French sorting of accents right-to-left.
If one does not say "DESC" or "REVERSE", the default is:
weights come out not NOTed, and not reversed.

And eventually (though not for this task) there could be combinations:
LEVEL ALL                  /* Every level the collation can have */
LEVEL ALL FOR COMPARE      /* All normally used for "WHERE" */
LEVEL ALL FOR ORDER        /* All normally used for "ORDER BY" */
("FOR COMPARE" means "for WHERE, DISTINCT, MIN, etc."; we know
that ORDER BY involves comparison too but couldn't think of any
better words, after considering FOR WHERE and FOR EQUALITY.)
Incidentally, "LEVEL ALL" could be the same as "LEVEL 1,2,3,4,5,6"
for a Japanese collation containing the four levels of WL#2555, plus
a fifth-level comparison for putting halfwidth before
fullwidth, plus level6 = where "ignorable" values
would be shifted. It's very likely that 6 is the "maximum maximum".

"flags" is something that we'll use for special cases.
In version 5.2 there are, as yet, no defined flags.
For the long term Peter and Bar should think about:

What do do with UCA "Variable Weighting" characters.

Return for NULL can be maximum, minimum, suprememum, infimum, or NULL.
Probably this only has practical meaning for Level 1.

Don't take a full byte for a tertiary upper|lower case difference.
This is primarily useful for Level 2 and later, since the number
of choices (different kinds of accent, different kinds of case, etc.)
is less than 256.

The version of Unicode allkeys, 4.0.0 or 5.0.0.

IBM's ICU allows "normalization off/on" flags.

WEIGHT_STRING ('w' COLLATE latin2_czech_ci)    
WEIGHT_STRING (@a LEVEL 1,2,3)    

select hex(weight_string(_ucs2 ox0061    
collate ucs2_unicode_ci    
as char(2)));    
weight_string(convert(latin1_general_ci_column using ucs2)    
collate ucs2_unicode_w4a levels 2-2)  

"LEVEL x, x - 1"

Peter's original hope was that expresions like
"LEVEL 3,2" would mean that one treated level
3 as being higher priority than level 2 (good
for compatibility with Microsoft Japanese quirks).
But that is not to be. Instead, MySQL will give a
syntax error if, in a list of levels, it encounters
a level which has a lower number than a level
mentioned earlier in the list.


If the result length is less than or equal to the    
maximum possible length for the VARBINARY data type,    
then the result data type is VARBINARY. Otherwise    
the result data type is BLOB.    
The actual return length varies depending on input length,    
number of weights, first weight, flag values, and    
MySQL version.    
If input is NULL, then output is NULL.    
In non-strict mode, errors may cause the return    
value to be NULL.    
WEIGHT_STRING will not be a new reserved word.    

See also:
WL#3804 WEIGHT_STRING deferred features

Design has been discussed, in detail, by Falcon 
team (primarily Kevin), Bar and PeterG; no further
LLD needed.