WL#896: Primary, Secondary and Tertiary Sorts

Affects: Server-6.x — Status: Assigned

Description
Dependent Tasks
High Level Architecture
Low Level Design

The Unicode Collation Algorithm has guidelines for sorting  
characters. An introductory article is:  
"Collations"
http://web.archive.org/web/20030401193640/www.dbazine.com/gulutzen1.html
Our authoritative text is:  
"Unicode Technical Standard #10 Unicode Collation Algorithm"  
http://www.unicode.org/reports/tr10/  
which we'll call "The UCA document".  
  
Trying to summarize in one sentence:  
For sort keys and indexes, compare the weights of the
characters, so that collation will be possible (approximately)
as the UCA document requires, with up to four levels.
Multi-level comparisons work thus:
  Compare the level-1 weights.
  If the result is equal, compare the level-2 weights.
  If the result is still equal, compare the level-3 weights.
  And so on, for the number of weights in the collation.
Variation is possible, for example comparing
the level-3 weights before the level-2 weights.
  
Alexander Barkov (Bar) and Sergei Golubchik believed this task
(WL#896) could be in 5.0, which was the original plan. Clearly
it will be later.  
  
This has some possible interest for a Japanese customer  
[ name deleted, see Progress Reports ] which suggested a patch
for MySQL 4.1, but (as Sergei points out) they can use their own
patch with 4.1 while we work toward WL#896.  

Sergei, Bar, and I (Peter Gulutzan) agree that the Japanese-customer
patch (also known as utf8_general_cs)  
is good for their particular purpose but not for our usual  
customers. They do two passes: sort according to the MySQL  
official collation (which I suppose is sjis_japanese_ci),  
then for all values which are "equal" in that collation,  
compare using memcmp. This means that a level-3 difference  
trumps a level-2 difference, e.g. 'Aé' < 'ae'. The  
Japanese-customer solution is good -- these are intelligent
people and the solution works -- but we merely say that our
plan must fit other, non-Japanese, situations as well.

WL#896 is a prerequisite for new Hungarian collations
(WL#2993) and standard Japanese (WL#2555).


The "allkeys.txt" file discussed in the HLS is here:
http://www.unicode.org/Public/UCA/latest/allkeys.txt

See also:
BUG#34130 incorrect french order in utf8_unicode_ci

WL#2673: Unicode Collation Algorithm new version
WL#3664: strnxfrm() changes for prefix keys and NOPAD

Issues
------

The mandatory new collations will be utf8_unicode_600_w1234,
ucs2_unicode_600_w1234, utf8mb4_unicode_600_w1234,
utf16_unicode_600_w1234, utf32_unicode_600_w1234.
See WL#5844 for the naming convention details.
They will follow exactly the allkeys.txt guideline for level-1,
including "lower case first" (see below).
The implementor can decide what version of allkeys.txt to use,
that is, '600' is merely an assumption about Unicode 6.0.0.
If WL#5704 happens first, the only new collation is
unicode_600_w1234 (no need for character set names).

LOWER CASE FIRST: in allkeys.txt, lower case precedes      
upper case. We'll follow allkeys.txt. The behaviour
will not be optional, that is, we will not allow for
the "caseFirst" attribute mentioned in the UCA document.

It is possible to define any column with a _w123      
collation. However, for conditional expressions and      
distinct and grouping, there is no      
difference between ucs2_unicode_w123 and ucs2_unicode_ci.      
You only see a difference with ORDER BY expression,      
where expression -- either because a column is defined      
with a _w123 collation or because there is a COLLATE      
clause -- has a _w123 collation.      
      
We do not expect to see two-level collations, and
we do not expect to see four-level collations outside Japanese.     
We do expect to continue with some one-level collations.
(One-level collations are the norm now, and   
utf8_unicode_w1 is similar to the old utf8_unicode_ci.    

A _w123 collation is always compatible with a _ci      
collation, because we will always use the _ci collation.      
For example, one can concatenate a _w3 value      
with a _ci value, the result will have a _ci collation.      
      
Bar will try compression tricks to keep the sort key      
length reasonable, but we realize      
that lots of memory is necessary. One suggestion is:      
make an initial assumption "multiplier = 6", try to      
make keys, if overflow occurs say "multiplier++" and      
try again or else flag the keys that will need special      
handling after the sort pass. Details are not part of      
this description but should be in low-level architecture      
or in a separate worklog task. Another suggestion is:     
since the later levels are usually only one or two     
bits, and the first level is less than 16 bits, we     
could try to store all levels in one 16-bit word,     
and do multiple passes with appropriate ANDs / ORs.     
      
PAD SPACE: The new collations are all "PAD SPACE".  
      
w123a      
-----      
      
In _w123 collations the level-2 and level-3      
weights are applicable only for ORDER BY.      
      
Variant collations which have the suffix _w123a      
("weight 123 always" or "weight 123 strong") are applicable for all      
operations, including UNIQUE, > and = and <      
comparisons (but not LIKE), GROUP BY, ORDER BY.      
      
The UCA document allows the option, saying:      
" For comparison, most of the time a three-level strength      
will need to be used. In some cases, a larger number of      
levels will be needed, while in others - especially in      
searching - fewer levels will be desired."      
...      
"Users of search algorithms should be allowed to modify      
the comparison strength, thus excluding differences at      
less significant levels. This is especially useful for      
searching, but can also apply to comparison."      
      
Bar points out that latin2_czech_ci already works      
like a _w123a collation, and the code that he already      
has done is for a _w123a collation, and this would be      
compatible with the Japanese-customer patch.      
     
The standard Japanese collation (WL#2555) is _w1234a.        
    
WEIGHT_STRING    
-------------
    
The WEIGHT_STRING function will return a binary string    
of weights. Comparable functions are Oracle's NLSSORT    
and Microsoft's tertiary_weights.    
    
This is now a separate completed task, WL#3716 WEIGHT_STRING function.
See also WL#3804 WEIGHT_STRING deferred features.
 
By default, there will be no compression, we will not    
implement any of the suggestion in the UCA description    
http://www.unicode.org/reports/tr10/tr10-12.html    
section 6.1 "Reducing sort key lengths".    
    
Indexing      
--------      
      
Peter would be happy if we did WL#896 with no indexes,      
but apparently Bar has already done something, and      
Peter supposes that the Japanese customer won't be happy unless we      
have a solution with indexing.      
      
[ Note added 2011-03-04 ]
Peter asked six questions, of varying quality.
For one question, the answer was to refer to WL#826.
For four questions, answers by Serg and/or Bar were wrong
and are hereby superseded. One question was silly and is removed.

Prefix indexes, which Peter wrongly called partial indexes, are disallowed
for multi-level collations. For example, the following statements
(which involve an implicit prefix index because InnoDB has a
767-byte limit until WL#5743 "InnoDB: Lift the limit of index
key prefixes" is done) will cause an error:
CREATE TABLE t1 (a VARCHAR(1000) CHARACTER SET utf8 COLLATE utf8_unicode_600_w1234);
CREATE INDEX i ON t1 (a));
Result: ERROR xxxx (42000): prefix index for multi-level collation.
Bar believes that we may be able to do more with
prefix indexes after doing
WL#5615 Prefix key changes for tricky collations.
      
The index can be used as a covering index, because
the value which will be stored is the original string,
not the weight of the string. Peter objected that
storing values instead of weights makes searches
substantially slower (perhaps 30% - 40% if one
guesses after looking at latin2_czech_cs examples).
But this is an implementor decision, and Bar
decided we'll store values.

Expressions like "LIKE 'A_'" will not use the index for
columns that have a multi-level collation.
There are some problems that would have to be solved with
the optimizer first. An example of the problems is:
Assume a w1234 index.
It's ordered thus:
     a
     A
     AB
     aC
I'm searching with LIKE 'a_'.
Assume that we apply a method like the one
that's been used in the past for non-w1234 indexes:
start with the first key that is >= 'a'
(so it finds 'a'), and stop with the first
key that is > 'a' (so it doesn't find 'aC').
If we applied that idea in the w1234 index,
Bar explains that it should work as follows:
1. either scan all records between
   'a...'
   and
   'a...'
   using primary level only.
   where  - smallest primary non-ignorable character
   and  - biggest primary non-ignorable character in the collation.
2. or make like_range generate the range as follows:
   'a...'
    and
    'Ḁ...'
    notice capital A WITH RING BELOW in the second value - which
    is the greatest among all 'A's in DUCET.
With this range we can use all levels to scan the range.
But I think, this type of range generation is quite tricky.
It will need operation which is able to:
"Find a letter which is the same on the primary
level and greatest on the other levels", which
in the above example will convert "SMALL LETTER A"
to "CAPITAL LETTER A WITH RING BELOW".
Implementing this feature will need some extra tables.
So I think N1 is a better solution.
That means we should either support one-level search in a multi-level
index under terms of WL#896, or we'll have to disallow LIKE
optimization.

Generally speaking, a UNIQUE index is compared for level-1 only,
except for Hungarian and Japanese. We use comparison strength.
So in a _w1234a index 'A' and 'a' are duplicates.
In a _w1234 index 'A' and 'a' are not duplicates.

Open questions
--------------
- multi-level collations and InnoDB
- multi-level collations and NDB
- multi-level collations and partitioning

Trailing spaces
---------------

Trailing spaces (U+0020) will be represented at level-1.
Trailing spaces will not be represented at level-2 or level-3.
Trailing spaces will be represented at level-4.


Email archives      
--------------      
      
The contents of this task come from some emails      
exchanged in November 2004 between Alexander Barkov,      
Sergei Golubchik, and Peter Gulutzan, threads      
"MySQL 4.1 for [name of another Japanese customer deleted,
see progress reports]" and "MySQL 4.1 case sensitive patch for
[name of another Japanese customer deleted, see progress reports]".      
Earlier discussions are in scrum-tasks, scrum-daily, and support archives:      
      
Thread "Character sets" [January 2004]      
https://intranet.mysql.com/secure/mailarchive/mail.php?folder=76&mail=1080      
      
Thread "Bar's proposed scrum todo for February" [February 2004]      
https://intranet.mysql.com/secure/mailarchive/mail.php?folder=78&mail=898      
      
Thread "[req] utf8_general_cs patch" [July 2004]      
https://intranet.mysql.com/secure/mailarchive/mail.php?folder=12&mail=71247 

Thread "Re: WL#896 Primary, Secondary and Tertiary Sorts"
search in dev-private email archive

"What the @!#$% is the TERTIARY_WEIGHTS() function for?" (Microsoft blogger)
https://blogs.msdn.com/michkap/archive/2006/06/02/615571.aspx

Detailed design discussions have taken place via
email; no further LLD needed.