WL#5272: Ordinal collation

Affects: Server-Prototype Only — Status: Un-Assigned

Description
High Level Architecture

Make a collation where every character has a unique
primary weight, with variations of the character
(caused by accents or case differences) appearing
together.

This task description exists because of discussion re
WL#923 Store regular identifiers in upper case.
It will probably be rejected if we don't find a
use for it with identifiers.

Characteristics
---------------

The collation has:
1. Case sensitivity. For example E <> e.
2. "Accent sensitivity". For example E <> É.
3. Accented characters following base characters. For example E < É.
4. Natural-looking order. For example E will be between
   D and F as in most western alphabets, and katakana PE
   will be near hiragana PE as in Japanese.
5. Unique weights. For example if the weight of 'E'
   is 555, then no other character has a weight of 555.
6. The same weight for the same character, regardless
   of encoding. So _ucs2 ='a' = _latin1 'a'.
7. The NO PAD characteristic.
   For example 'A ' <> 'A'.
8. One weight for one character.
   No attention to ignorables or expansions or combining
   characters or repeat marks.
9. Simple uppercasing.
   For example Upper(sharp s) = S, not SS.

The rules are theoretically applicable to any
character set, but our only concern is UTF8.

Name
----

Some possible names are:
SQL_IDENTIFIER 
or UTF8_ORDINAL_CS
or UTF_GENERAL_CS
or UTF8_GENERAL_CS_AS
or ORDINAL
or SOMETHING_ELSE.

The prefix UTF8_ is okay and conventional but unnecessary.
The collation doesn't really depend on UTF8, although some
people may think it's convenient to define thus:
CREATE TABLE (s1 CHAR COLLATE UTF8_GENERAL_CS);
In this case we don't need to specify character set
because the collation name tells us that the character
set must be UTF8.

The name UTF8_GENERAL_CS has been proposed before,
and might be what some people expect now.
However, this collation is not a case-sensitive
variant of UTF8_GENERAL_CI, so such a name is confusing.

The suffix _AS (Accent Sensitive) is what you might
see in Microsoft SQL Server.

The name ORDINAL should remind people of Microsoft's
"ordinal" collations.
Although we're not exactly like that because Microsoft
doesn't exactly follow DUCET, we're pretty close.
(Other lovely Microsoft terms are "invariant" and "lexicographic".)

The name SQL_IDENTIFIER will help people see that
the primary use of this collation is for identifiers
of SQL objects. However, in standard SQL the
SQL_IDENTIFIER collation is for the SQL_IDENTIFIER
character set, a restricted subset of Unicode, which
we don't support.

The name SOMETHING_ELSE is a placeholder. That means:
if you have a better idea, speak up now.

Weight Table
------------

The weight table is a list of weights in
Unicode-character-code order, such that
given a character, one can return a weight.

There is no table for going the other way
(from weight to character), although such
a thing is possible given the one-to-one
relationship between characters and weights.

The weight values will be the line numbers in
a UCA DUCET table, such as 'allkeys.txt'
http://www.unicode.org/Public/UCA/latest/allkeys.txt
At time of writing the 'latest' allkeys.txt version
is 5.2.0. The implementor may choose a different version.

There are about 40,000 lines in allkeys.txt.
Although that means the table can have 16-bit weights,
the real number of Unicode characters is far greater
(we are going to calculate rather than look up for
characters outside allkeys.txt), and there will be
new characters in future Unicode versions.
So allow for 24-bit or 32-bit weights.
(In the "identifier" application described in WL#923,
we can disallow exotic characters so this might not
be a problem.)

Before producing weights, the implementor may re-order
allkeys.txt to make all Variable Weighting characters sort
after all ignorable characters (if we decide we do want ignorables).
(Such weights are marked with '*' in allkeys.txt.)
For example, the original allkeys.txt (version 5.0.0) contains:
    03F6  ; [*04B3.0020.0002.03F6] # GREEK REVERSED LUNATE EPSILON SYMBOL
    0482  ; [*03C0.0020.0002.0482] # CYRILLIC THOUSANDS SIGN
    0488  ; [.0000.0000.0000.0488] # COMBINING CYRILLIC HUNDRED THOUSANDS SIGN
    0489  ; [.0000.0000.0000.0489] # COMBINING CYRILLIC MILLIONS SIGN
    055A  ; [*032E.0020.0002.055A] # ARMENIAN APOSTROPHE
Alexander Barkov prefers this order:
    0488  ; [.0000.0000.0000.0488] # COMBINING CYRILLIC HUNDRED THOUSANDS SIGN
    0489  ; [.0000.0000.0000.0489] # COMBINING CYRILLIC MILLIONS SIGN
    055A  ; [*032E.0020.0002.055A] # ARMENIAN APOSTROPHE
    0482  ; [*03C0.0020.0002.0482] # CYRILLIC THOUSANDS SIGN
    03F6  ; [*04B3.0020.0002.03F6] # GREEK REVERSED LUNATE EPSILON SYMBOL
This command seems to do the right job:
cat allkeys.txt |tr "*" "." |sort --key=3

To produce a weight in the table, given character X:
* If X exists in allkeys.txt,
  then use the line number within allkeys.txt.
* If X does not exist in allkeys.txt,
  then use the code point value + a base number which
  is greater than the number of lines in allkeys.txt.
  This is inspired by, but not the same as, the
  Unicode recommendation for implicit weights
  as described in Unicode Technical Report 10:
  "Any code points that are not explicitly mentioned in this
  table [i.e. allkeys] are given a derived collation element,
  as described in  Section 7, Weight Derivation."
  http://www.unicode.org/reports/tr10/#Derived_Collation_Elements

For example, just looking at allkeys.txt with an editor, one sees that
LATIN SMALL LETTER U is line 8023 decimal
FULLWIDTH LATIN SMALL LETTER U is line 8024
LATIN CAPITAL LETTER U is line 8041
LATIN CAPITAL LETTER U WITH CIRCUMFLEX is line 8068
So the weights are 8023, 8024, 8041, 8068.

Good Things and Bad Things
--------------------------

GOOD THING. To an ordinary user, '<' comparisons and
'>' comparisons and ORDER BY results should
look slightly less bizarre than if collation was binary.
Da will still be greater than DB, but in very short lists
Da will be 'near' DB. Actually this is the only serious
reason that we have for making this new collation, so if
users don't really think the results are natural, this
whole task is dubious.

GOOD THING: The utf8 ordinal collation is compatible
with utf8_bin in this sense:
the effect of '=' and '<>' comparisons is guaranteed to
be the same, for all characters and character combinations.
(The effect of '>' and '<' comparisons will not be the same.)
So if string1 = string2 with utf8 ordinal collation, then
string1 = string2 with utf8_bin collation, and vice versa.
This is a mandatory requirement so that internal searches
with utf8_bin can continue to be carried out with binary
comparisons.
The optimizer might be able to use this fact and
change ordinal comparisons to binary comparisons in some cases.

GOOD THING: The new collation might help us pacify the
people who demand an "accent sensitive" collation.
Peter Gulutzan thinks that those people simply don't
understand all the implications, so this is not a
practical usage.

GOOD THING: Comparisons with the new collation should in
theory be faster than comparisons with a multiple-weight
collation. And besides, we don't have a multiple-weight
collation yet because we haven't done WL#896.

BAD THING: Combining-character sequences like A and CIRCUMFLEX
must be treated as two separate characters, not as
A CIRCUMFLEX. Otherwise we'd lose the utf8_bin compatibility.

BAD THING: Although the collation is "ordinal", the comparison
method is not fully compatible with Microsoft's
Ordinal and OrdinalIgnoreCase string comparison methods.
http://blogs.msdn.com/michkap/archive/2007/04/25/2273289.aspx
http://msdn.microsoft.com/en-us/library/ms973919.aspx

Radical Tailoring
-----------------

In early discussions about this feature, there wass
another suggestion. Rather than taking the line number
of the DUCET table entry, take the primary weight, and
add the secondary + tertiary weights to it, with some
shifting to ensure uniqueness. We call this "radical
tailoring".

There were some flaws with the original proposal,
and we found it easier to come up with numbers by
using the line numbers. 

However, if we used radical tailoring instead, for example
by taking the primary weight and adding the secondary weight
to it, we'd be able to use tables made for WL#896 "Primary,
Secondary and Tertiary Sorts" to simulate the ordinal collation's
weight table. Possibly we gave up on radical tailoring too soon.

References
----------

dev-private email thread "WL#923, Identifiers, Collation"