WL#896: Primary, Secondary and Tertiary Sorts

Affects: Server-6.x   —   Status: Assigned

The Unicode Collation Algorithm has guidelines for sorting  
characters. An introductory article is:  
"Collations"
http://web.archive.org/web/20030401193640/www.dbazine.com/gulutzen1.html
Our authoritative text is:  
"Unicode Technical Standard #10 Unicode Collation Algorithm"  
http://www.unicode.org/reports/tr10/  
which we'll call "The UCA document".  
  
Trying to summarize in one sentence:  
For sort keys and indexes, compare the weights of the
characters, so that collation will be possible (approximately)
as the UCA document requires, with up to four levels.
Multi-level comparisons work thus:
  Compare the level-1 weights.
  If the result is equal, compare the level-2 weights.
  If the result is still equal, compare the level-3 weights.
  And so on, for the number of weights in the collation.
Variation is possible, for example comparing
the level-3 weights before the level-2 weights.
  
Alexander Barkov (Bar) and Sergei Golubchik believed this task
(WL#896) could be in 5.0, which was the original plan. Clearly
it will be later.  
  
This has some possible interest for a Japanese customer  
[ name deleted, see Progress Reports ] which suggested a patch
for MySQL 4.1, but (as Sergei points out) they can use their own
patch with 4.1 while we work toward WL#896.  

Sergei, Bar, and I (Peter Gulutzan) agree that the Japanese-customer
patch (also known as utf8_general_cs)  
is good for their particular purpose but not for our usual  
customers. They do two passes: sort according to the MySQL  
official collation (which I suppose is sjis_japanese_ci),  
then for all values which are "equal" in that collation,  
compare using memcmp. This means that a level-3 difference  
trumps a level-2 difference, e.g. 'Aé' < 'ae'. The  
Japanese-customer solution is good -- these are intelligent
people and the solution works -- but we merely say that our
plan must fit other, non-Japanese, situations as well.

WL#896 is a prerequisite for new Hungarian collations
(WL#2993) and standard Japanese (WL#2555).


The "allkeys.txt" file discussed in the HLS is here:
http://www.unicode.org/Public/UCA/latest/allkeys.txt

See also:
BUG#34130 incorrect french order in utf8_unicode_ci