WL#3664: strnxfrm() changes for prefix keys and NOPAD

Affects: Server-5.6   —   Status: Complete

This is to solve BUG#24858.  Padding in indexes cause problems 
for Falcon. The my_strnxfrm functions will take additional
arguments: weight count and flags.

The patch quality assurrance (i.e. that it works for Falcon) is
provided by Kevin Lewis.
The filesort algorithm (which is executed when processing 
queries having ORDER BY on unindexed columns) uses
cs->coll->strnxfrm() calls to create binary sortable keys.
A cs->coll->strnxfrm() call resolves to a call of one of
the following functions depending on collation:

my_strnxfrm_8bit_bin
my_strnxfrm_any_uca
my_strnxfrm_big5
my_strnxfrm_bin
my_strnxfrm_cp932
my_strnxfrm_czech
my_strnxfrm_gbk
my_strnxfrm_latin1_de
my_strnxfrm_mb_bin
my_strnxfrm_simple
my_strnxfrm_sjis
my_strnxfrm_tis620
my_strnxfrm_ucs2
my_strnxfrm_ucs2_bin
my_strnxfrm_ucs2_uca,
my_strnxfrm_utf8
my_strnxfrm_win1250ch

As filesort supports only fixed length records,
all the above functions currently pad the result
up to maximun length with weights corresponding to space character.
(0x20 or 0x0020 or 0x0209, never 0x00).

Unlike other storage engines, Falcon stores a byte sortable version of the
string in the index according to the specified collation. Variable length
strings are fine, but space padding causes inconsistent results, especially
with prefix searches (partial length indexes).

We need to do some changes in the above functions
to be able to switch off padding to reuse these
functions for Falcon.

It is not claimed that, if used in indexes, the
results will be useful for LIKE or for covering
indexes.

Function definition
-------------------

Proposed new prototype is:

int (*strnxfrm)(CHARSET_INFO *cs,
                uchar *dst, uint dstlen,
                uint nweights,
                const uchar *src, uint srclen,
                uint flags); 

There will be no change for cs or dst or dstlen
or src or srclen.

Proposed new arguments are:

1. "flags", to pass a bit mask of flags.
Currently only one flag is going to be supported:

#define MY_STRNXFRM_FLAG_PAD 1 /* Whether to pad   */

But it is very likely that we'll need more flags in the future:

#define MY_STRNXFRM_FLAG_xxx 2 /* Another flag     */
#define MY_STRNXFRM_FLAG_yyy 4 /* Yet another flag */
#define MY_STRNXFRM_FLAG_zzz 8 /* and so on        */

The name is not certain. Peter prefers PAD_WITH_SPACE
because that's the WL#896 term.

2. "nweights" - to create prefix keys.
nweights will be the number from prefix key definition:
For example, it will be 3 for KEY(c1(3)).

One may pass nweights = 1000000 as a convention, that is,
1000000 is an impossibly-high number and so if you pass
1000000 then (in Kevin Lewis's words) "all the
weights from the src that will fit into dstLen should be
done." In response to a suggestion of using
"the maximum 32-bit unsigned value, 4294967295",
instead of 1000000, Bar replied: "This number is ok."

The meaning of "nweights"
-------------------------

Although nweights should mean "number of
weights", that needs more clarification. The
following doesn't correspond to the UCA definition,
so MySQL may use a different term for documentation
of prefix keys.

For this context (only), "weight" means: the primary weight
of a single character, but not counting when the weight is
ignorable, and counting double when one character
is equivalent to two (as with german2 and umlauts),
and counting half when two characters are equivalent
to one (as with czech ch).

Here are some tricky examples, where MY_STRNXFRM_FLAG_PAD
equals 1 in all cases, and dstlen = 1000 in all cases
except Example #5.

Tricky Example #1: latin1_german2_ci weight, with expansion.
Suppose: character set latin1, collation latin1_german2_ci.
Suppose: string value = 'ü'.
The number of weights is: 2. 0x55, 0x45.
Therefore, if nweights = 1, strnxfrm() produces 0x55!

Tricky Example #1a: latin1_german2_ci weight, with expansion.
Suppose: character set latin1, collation latin1_german2_ci.
Suppose: string value = 'ü', repeated 1100 times.
The number of weights is: 2200. 0x55, 0x45, ...
Therefore, if maximum Falcon key length = 1100, it's illegal!

Tricky Example #2: ucs2_unicode_ci weight, with ignorable.
Suppose: character set ucs2, collation latin1_unicode_ci.
Suppose: string value = 0x0000.
The number of weights is: 0.
Therefore, if nweights = 1, strnxfrm() produces 0x0209!

Tricky Example #3: ucs2_unicode_ci weight, with ignorable.
Suppose: character set ucs2, collation latin1_unicode_ci.
Suppose: string value = 0x0000, 0x0020.
The number of weights is: 1. 0x0209.
Therefore, if nweights = 1, strnxfrm() produces 0x0209!

Tricky Example #4: utf8_czech_ci weight, with contraction.
Suppose: character set utf8, collation utf8_czech_ci.
Suppose: string value = 'ch'.
The number of weights is: 1. 0x0ee2.
Therefore, if nweights = 1, strnxfrm() produces 0x0ee2!

Tricky Example #5: ucs2_general_ci, with bad dstlen.
Suppose: character set ucs2, collation ucs2_general_ci.
Suppose: string value = 0x2020, the Unicode for 'dagger'.
The number of weights is: 1. 0x2020.
But -- to make things hard -- suppose dstlen = 1 (bytes).
Therefore, if nweights = 1, strnxfrm() produces 0x20!
It won't really happen, though, because specification of
KEY(s1(N)) will cause space allocation of (N times
maximum weight length) bytes, so dstlen can't be bad.

Tricky Example #6: ucs2_general_ci, with 0x20 in weight.
Suppose: character set ucs2, collation ucs2_unicode_ci.
Suppose: string value = YI SYLLABLE HNIEP.
The number of weights is: 1. 0x2020.
Therefore, if nweights = 1, strnxfrm() produces 0x2020.
(Okay, maybe this last one isn't tricky, but let's hope
nobody dreams of truncating trailing 0x20 or 0x00.)

Use for prefix keys
-------------------

The plan is to apply "nweights" for prefix key
definitions as well, for example KEY(s1(3)) will
no longer mean "the three leftmost characters"
but "three weights". See also these bug reports:
BUG#20353 "Prefix key search return incorrect results for ignorable characters"
BUG#20447 "Problem with prefix keys with contractions and expansions"

Falcon may use strnxfrm functions for any indexing,
prefix or not.

Relation to WL#896
------------------

The above changes will also help to implement the WEIGHT_STRING()
SQL function which is a part of WL#896:

Syntax:
WEIGHT_STRING
(
string                      /* input */    
AS CHAR ( int ) ]           /* for padding */    
[ FROM int ]                /* first output weight */    
[ FOR int ]                 /* number of output weights */    
[ flags ]                   /* special effects */    
)

For example, WEIGHT_STRING(c1 AS CHAR(3) PAD_WITH_SPACE)
will internally do this call:

cs->cset->strnxfrm(cs, dst, dstlen, 3, 
                   src, srclen, MY_STRNXFRM_FLAG_PAD); 

Falcon has what WL#896 refers to as a
"one-level index of weights".

The "nweights" parameter is what WL#896 refers to as
"number of output weights".

UCA version
-----------

UCA weights are from
ftp://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt
which is not the current version. If/when WL#2673 happens,
users will have to rebuild all UCA-based Falcon indexes.

Stall
-----

There must be no violation of the principle, as Bar states it:
"ORDER BY" must always return exactly the same order in all
these cases:
- no index
- prefix index
- full-length index

Currently, in the Falcon storage engine, for multi-column
indexes, all columns except the last are padded with 0x00
to 5-byte boundaries. There are some unhandled doubts
about how padding will work. It's safest to wait until
everyone knows what Falcon indexes will really look like.
Above design discussed in detail by PeterG,
Bar, Kevin, Ann, JimS. No further LLD needed,=.