Make a collation where every character has a unique primary weight, with variations of the character (caused by accents or case differences) appearing together. This task description exists because of discussion re WL#923 Store regular identifiers in upper case. It will probably be rejected if we don't find a use for it with identifiers.
Characteristics --------------- The collation has: 1. Case sensitivity. For example E <> e. 2. "Accent sensitivity". For example E <> É. 3. Accented characters following base characters. For example E < É. 4. Natural-looking order. For example E will be between D and F as in most western alphabets, and katakana PE will be near hiragana PE as in Japanese. 5. Unique weights. For example if the weight of 'E' is 555, then no other character has a weight of 555. 6. The same weight for the same character, regardless of encoding. So _ucs2 ='a' = _latin1 'a'. 7. The NO PAD characteristic. For example 'A ' <> 'A'. 8. One weight for one character. No attention to ignorables or expansions or combining characters or repeat marks. 9. Simple uppercasing. For example Upper(sharp s) = S, not SS. The rules are theoretically applicable to any character set, but our only concern is UTF8. Name ---- Some possible names are: SQL_IDENTIFIER or UTF8_ORDINAL_CS or UTF_GENERAL_CS or UTF8_GENERAL_CS_AS or ORDINAL or SOMETHING_ELSE. The prefix UTF8_ is okay and conventional but unnecessary. The collation doesn't really depend on UTF8, although some people may think it's convenient to define thus: CREATE TABLE (s1 CHAR COLLATE UTF8_GENERAL_CS); In this case we don't need to specify character set because the collation name tells us that the character set must be UTF8. The name UTF8_GENERAL_CS has been proposed before, and might be what some people expect now. However, this collation is not a case-sensitive variant of UTF8_GENERAL_CI, so such a name is confusing. The suffix _AS (Accent Sensitive) is what you might see in Microsoft SQL Server. The name ORDINAL should remind people of Microsoft's "ordinal" collations. Although we're not exactly like that because Microsoft doesn't exactly follow DUCET, we're pretty close. (Other lovely Microsoft terms are "invariant" and "lexicographic".) The name SQL_IDENTIFIER will help people see that the primary use of this collation is for identifiers of SQL objects. However, in standard SQL the SQL_IDENTIFIER collation is for the SQL_IDENTIFIER character set, a restricted subset of Unicode, which we don't support. The name SOMETHING_ELSE is a placeholder. That means: if you have a better idea, speak up now. Weight Table ------------ The weight table is a list of weights in Unicode-character-code order, such that given a character, one can return a weight. There is no table for going the other way (from weight to character), although such a thing is possible given the one-to-one relationship between characters and weights. The weight values will be the line numbers in a UCA DUCET table, such as 'allkeys.txt' http://www.unicode.org/Public/UCA/latest/allkeys.txt At time of writing the 'latest' allkeys.txt version is 5.2.0. The implementor may choose a different version. There are about 40,000 lines in allkeys.txt. Although that means the table can have 16-bit weights, the real number of Unicode characters is far greater (we are going to calculate rather than look up for characters outside allkeys.txt), and there will be new characters in future Unicode versions. So allow for 24-bit or 32-bit weights. (In the "identifier" application described in WL#923, we can disallow exotic characters so this might not be a problem.) Before producing weights, the implementor may re-order allkeys.txt to make all Variable Weighting characters sort after all ignorable characters (if we decide we do want ignorables). (Such weights are marked with '*' in allkeys.txt.) For example, the original allkeys.txt (version 5.0.0) contains: 03F6 ; [*04B3.0020.0002.03F6] # GREEK REVERSED LUNATE EPSILON SYMBOL 0482 ; [*03C0.0020.0002.0482] # CYRILLIC THOUSANDS SIGN 0488 ; [.0000.0000.0000.0488] # COMBINING CYRILLIC HUNDRED THOUSANDS SIGN 0489 ; [.0000.0000.0000.0489] # COMBINING CYRILLIC MILLIONS SIGN 055A ; [*032E.0020.0002.055A] # ARMENIAN APOSTROPHE Alexander Barkov prefers this order: 0488 ; [.0000.0000.0000.0488] # COMBINING CYRILLIC HUNDRED THOUSANDS SIGN 0489 ; [.0000.0000.0000.0489] # COMBINING CYRILLIC MILLIONS SIGN 055A ; [*032E.0020.0002.055A] # ARMENIAN APOSTROPHE 0482 ; [*03C0.0020.0002.0482] # CYRILLIC THOUSANDS SIGN 03F6 ; [*04B3.0020.0002.03F6] # GREEK REVERSED LUNATE EPSILON SYMBOL This command seems to do the right job: cat allkeys.txt |tr "*" "." |sort --key=3 To produce a weight in the table, given character X: * If X exists in allkeys.txt, then use the line number within allkeys.txt. * If X does not exist in allkeys.txt, then use the code point value + a base number which is greater than the number of lines in allkeys.txt. This is inspired by, but not the same as, the Unicode recommendation for implicit weights as described in Unicode Technical Report 10: "Any code points that are not explicitly mentioned in this table [i.e. allkeys] are given a derived collation element, as described in Section 7, Weight Derivation." http://www.unicode.org/reports/tr10/#Derived_Collation_Elements For example, just looking at allkeys.txt with an editor, one sees that LATIN SMALL LETTER U is line 8023 decimal FULLWIDTH LATIN SMALL LETTER U is line 8024 LATIN CAPITAL LETTER U is line 8041 LATIN CAPITAL LETTER U WITH CIRCUMFLEX is line 8068 So the weights are 8023, 8024, 8041, 8068. Good Things and Bad Things -------------------------- GOOD THING. To an ordinary user, '<' comparisons and '>' comparisons and ORDER BY results should look slightly less bizarre than if collation was binary. Da will still be greater than DB, but in very short lists Da will be 'near' DB. Actually this is the only serious reason that we have for making this new collation, so if users don't really think the results are natural, this whole task is dubious. GOOD THING: The utf8 ordinal collation is compatible with utf8_bin in this sense: the effect of '=' and '<>' comparisons is guaranteed to be the same, for all characters and character combinations. (The effect of '>' and '<' comparisons will not be the same.) So if string1 = string2 with utf8 ordinal collation, then string1 = string2 with utf8_bin collation, and vice versa. This is a mandatory requirement so that internal searches with utf8_bin can continue to be carried out with binary comparisons. The optimizer might be able to use this fact and change ordinal comparisons to binary comparisons in some cases. GOOD THING: The new collation might help us pacify the people who demand an "accent sensitive" collation. Peter Gulutzan thinks that those people simply don't understand all the implications, so this is not a practical usage. GOOD THING: Comparisons with the new collation should in theory be faster than comparisons with a multiple-weight collation. And besides, we don't have a multiple-weight collation yet because we haven't done WL#896. BAD THING: Combining-character sequences like A and CIRCUMFLEX must be treated as two separate characters, not as A CIRCUMFLEX. Otherwise we'd lose the utf8_bin compatibility. BAD THING: Although the collation is "ordinal", the comparison method is not fully compatible with Microsoft's Ordinal and OrdinalIgnoreCase string comparison methods. http://blogs.msdn.com/michkap/archive/2007/04/25/2273289.aspx http://msdn.microsoft.com/en-us/library/ms973919.aspx Radical Tailoring ----------------- In early discussions about this feature, there wass another suggestion. Rather than taking the line number of the DUCET table entry, take the primary weight, and add the secondary + tertiary weights to it, with some shifting to ensure uniqueness. We call this "radical tailoring". There were some flaws with the original proposal, and we found it easier to come up with numbers by using the line numbers. However, if we used radical tailoring instead, for example by taking the primary weight and adding the secondary weight to it, we'd be able to use tables made for WL#896 "Primary, Secondary and Tertiary Sorts" to simulate the ordinal collation's weight table. Possibly we gave up on radical tailoring too soon. References ---------- dev-private email thread "WL#923, Identifiers, Collation"