The Unicode Collation Algorithm has guidelines for sorting characters. An introductory article is: "Collations" http://web.archive.org/web/20030401193640/www.dbazine.com/gulutzen1.html Our authoritative text is: "Unicode Technical Standard #10 Unicode Collation Algorithm" http://www.unicode.org/reports/tr10/ which we'll call "The UCA document". Trying to summarize in one sentence: For sort keys and indexes, compare the weights of the characters, so that collation will be possible (approximately) as the UCA document requires, with up to four levels. Multi-level comparisons work thus: Compare the level-1 weights. If the result is equal, compare the level-2 weights. If the result is still equal, compare the level-3 weights. And so on, for the number of weights in the collation. Variation is possible, for example comparing the level-3 weights before the level-2 weights. Alexander Barkov (Bar) and Sergei Golubchik believed this task (WL#896) could be in 5.0, which was the original plan. Clearly it will be later. This has some possible interest for a Japanese customer [ name deleted, see Progress Reports ] which suggested a patch for MySQL 4.1, but (as Sergei points out) they can use their own patch with 4.1 while we work toward WL#896. Sergei, Bar, and I (Peter Gulutzan) agree that the Japanese-customer patch (also known as utf8_general_cs) is good for their particular purpose but not for our usual customers. They do two passes: sort according to the MySQL official collation (which I suppose is sjis_japanese_ci), then for all values which are "equal" in that collation, compare using memcmp. This means that a level-3 difference trumps a level-2 difference, e.g. 'Aé' < 'ae'. The Japanese-customer solution is good -- these are intelligent people and the solution works -- but we merely say that our plan must fit other, non-Japanese, situations as well. WL#896 is a prerequisite for new Hungarian collations (WL#2993) and standard Japanese (WL#2555). The "allkeys.txt" file discussed in the HLS is here: http://www.unicode.org/Public/UCA/latest/allkeys.txt See also: BUG#34130 incorrect french order in utf8_unicode_ci
Issues ------ The mandatory new collations will be utf8_unicode_600_w1234, ucs2_unicode_600_w1234, utf8mb4_unicode_600_w1234, utf16_unicode_600_w1234, utf32_unicode_600_w1234. See WL#5844 for the naming convention details. They will follow exactly the allkeys.txt guideline for level-1, including "lower case first" (see below). The implementor can decide what version of allkeys.txt to use, that is, '600' is merely an assumption about Unicode 6.0.0. If WL#5704 happens first, the only new collation is unicode_600_w1234 (no need for character set names). LOWER CASE FIRST: in allkeys.txt, lower case precedes upper case. We'll follow allkeys.txt. The behaviour will not be optional, that is, we will not allow for the "caseFirst" attribute mentioned in the UCA document. It is possible to define any column with a _w123 collation. However, for conditional expressions and distinct and grouping, there is no difference between ucs2_unicode_w123 and ucs2_unicode_ci. You only see a difference with ORDER BY expression, where expression -- either because a column is defined with a _w123 collation or because there is a COLLATE clause -- has a _w123 collation. We do not expect to see two-level collations, and we do not expect to see four-level collations outside Japanese. We do expect to continue with some one-level collations. (One-level collations are the norm now, and utf8_unicode_w1 is similar to the old utf8_unicode_ci. A _w123 collation is always compatible with a _ci collation, because we will always use the _ci collation. For example, one can concatenate a _w3 value with a _ci value, the result will have a _ci collation. Bar will try compression tricks to keep the sort key length reasonable, but we realize that lots of memory is necessary. One suggestion is: make an initial assumption "multiplier = 6", try to make keys, if overflow occurs say "multiplier++" and try again or else flag the keys that will need special handling after the sort pass. Details are not part of this description but should be in low-level architecture or in a separate worklog task. Another suggestion is: since the later levels are usually only one or two bits, and the first level is less than 16 bits, we could try to store all levels in one 16-bit word, and do multiple passes with appropriate ANDs / ORs. PAD SPACE: The new collations are all "PAD SPACE". w123a ----- In _w123 collations the level-2 and level-3 weights are applicable only for ORDER BY. Variant collations which have the suffix _w123a ("weight 123 always" or "weight 123 strong") are applicable for all operations, including UNIQUE, > and = and < comparisons (but not LIKE), GROUP BY, ORDER BY. The UCA document allows the option, saying: " For comparison, most of the time a three-level strength will need to be used. In some cases, a larger number of levels will be needed, while in others - especially in searching - fewer levels will be desired." ... "Users of search algorithms should be allowed to modify the comparison strength, thus excluding differences at less significant levels. This is especially useful for searching, but can also apply to comparison." Bar points out that latin2_czech_ci already works like a _w123a collation, and the code that he already has done is for a _w123a collation, and this would be compatible with the Japanese-customer patch. The standard Japanese collation (WL#2555) is _w1234a. WEIGHT_STRING ------------- The WEIGHT_STRING function will return a binary string of weights. Comparable functions are Oracle's NLSSORT and Microsoft's tertiary_weights. This is now a separate completed task, WL#3716 WEIGHT_STRING function. See also WL#3804 WEIGHT_STRING deferred features. By default, there will be no compression, we will not implement any of the suggestion in the UCA description http://www.unicode.org/reports/tr10/tr10-12.html section 6.1 "Reducing sort key lengths". Indexing -------- Peter would be happy if we did WL#896 with no indexes, but apparently Bar has already done something, and Peter supposes that the Japanese customer won't be happy unless we have a solution with indexing. [ Note added 2011-03-04 ] Peter asked six questions, of varying quality. For one question, the answer was to refer to WL#826. For four questions, answers by Serg and/or Bar were wrong and are hereby superseded. One question was silly and is removed. Prefix indexes, which Peter wrongly called partial indexes, are disallowed for multi-level collations. For example, the following statements (which involve an implicit prefix index because InnoDB has a 767-byte limit until WL#5743 "InnoDB: Lift the limit of index key prefixes" is done) will cause an error: CREATE TABLE t1 (a VARCHAR(1000) CHARACTER SET utf8 COLLATE utf8_unicode_600_w1234); CREATE INDEX i ON t1 (a)); Result: ERROR xxxx (42000): prefix index for multi-level collation. Bar believes that we may be able to do more with prefix indexes after doing WL#5615 Prefix key changes for tricky collations. The index can be used as a covering index, because the value which will be stored is the original string, not the weight of the string. Peter objected that storing values instead of weights makes searches substantially slower (perhaps 30% - 40% if one guesses after looking at latin2_czech_cs examples). But this is an implementor decision, and Bar decided we'll store values. Expressions like "LIKE 'A_'" will not use the index for columns that have a multi-level collation. There are some problems that would have to be solved with the optimizer first. An example of the problems is: Assume a w1234 index. It's ordered thus: a A AB aC I'm searching with LIKE 'a_'. Assume that we apply a method like the one that's been used in the past for non-w1234 indexes: start with the first key that is >= 'a' (so it finds 'a'), and stop with the first key that is > 'a' (so it doesn't find 'aC'). If we applied that idea in the w1234 index, Bar explains that it should work as follows: 1. either scan all records between 'a<min><min><min>...<min>' and 'a<max><max><max>...<max>' using primary level only. where <min> - smallest primary non-ignorable character and <max> - biggest primary non-ignorable character in the collation. 2. or make like_range generate the range as follows: 'a<min><min><min>...<min>' and 'Ḁ<max><max><max>...<max>' notice capital A WITH RING BELOW in the second value - which is the greatest among all 'A's in DUCET. With this range we can use all levels to scan the range. But I think, this type of range generation is quite tricky. It will need operation which is able to: "Find a letter which is the same on the primary level and greatest on the other levels", which in the above example will convert "SMALL LETTER A" to "CAPITAL LETTER A WITH RING BELOW". Implementing this feature will need some extra tables. So I think N1 is a better solution. That means we should either support one-level search in a multi-level index under terms of WL#896, or we'll have to disallow LIKE optimization. Generally speaking, a UNIQUE index is compared for level-1 only, except for Hungarian and Japanese. We use comparison strength. So in a _w1234a index 'A' and 'a' are duplicates. In a _w1234 index 'A' and 'a' are not duplicates. Open questions -------------- - multi-level collations and InnoDB - multi-level collations and NDB - multi-level collations and partitioning Trailing spaces --------------- Trailing spaces (U+0020) will be represented at level-1. Trailing spaces will not be represented at level-2 or level-3. Trailing spaces will be represented at level-4. Email archives -------------- The contents of this task come from some emails exchanged in November 2004 between Alexander Barkov, Sergei Golubchik, and Peter Gulutzan, threads "MySQL 4.1 for [name of another Japanese customer deleted, see progress reports]" and "MySQL 4.1 case sensitive patch for [name of another Japanese customer deleted, see progress reports]". Earlier discussions are in scrum-tasks, scrum-daily, and support archives: Thread "Character sets" [January 2004] https://intranet.mysql.com/secure/mailarchive/mail.php?folder=76&mail=1080 Thread "Bar's proposed scrum todo for February" [February 2004] https://intranet.mysql.com/secure/mailarchive/mail.php?folder=78&mail=898 Thread "[req] utf8_general_cs patch" [July 2004] https://intranet.mysql.com/secure/mailarchive/mail.php?folder=12&mail=71247 Thread "Re: WL#896 Primary, Secondary and Tertiary Sorts" search in dev-private email archive "What the @!#$% is the TERTIARY_WEIGHTS() function for?" (Microsoft blogger) https://blogs.msdn.com/michkap/archive/2006/06/02/615571.aspx
Detailed design discussions have taken place via email; no further LLD needed.