WL#3770: Unicode-compliant comparison and sorting of combining characters

Affects: Server-7.0 — Status: Assigned

Description
High Level Architecture

We should add support for combining marks, according to the Unicode standard.

The combining marks are intended to modify other characters.
The most common example of combining characters are the combining
diacritical marks (combining accents).

The combining marks code ranges
-------------------------------
* 0300–036F Combining Diacritical Marks 
* 1DC0–1DFF Combining Diacritical Marks Supplement 
* 20D0–20FF Combining Diacritical Marks for Symbols 
* FE20–FE2F Combining Half Marks 

The affected character sets
---------------------------
All character sets having combining makrs in their repertoir.
As of version 5.1, only two character sets seem to have the
combining marks:
* utf8
* ucs2


The affected collations
-----------------------
At least all UCA-based collations with tailorings
changing order of accented letters.

For example, in utf8_czech_ci and ucs2_czech_ci:

"0063 LATIN SMALL LETTER C"  <  "010D SMALL LETTER C WITH CARON"


Problems in version 5.1 and earlier
-----------------------------------
This perfectly works with the precomposed form 010D:

mysql> select _ucs2 0x0063 < _ucs2 0x010D collate ucs2_czech_ci;
+---------------------------------------------------+
| _ucs2 0x0063 < _ucs2 0x010D collate ucs2_czech_ci |
+---------------------------------------------------+
|                                                 1 |
+---------------------------------------------------+
1 row in set (0.00 sec)

However, it does not work with the decomposed form:

"0063 030C LATIN SMALL LETTER C + COMBINING CARON":

mysql> select _ucs2 0x0063 < _ucs2 0x0063030C collate ucs2_czech_ci;
+-------------------------------------------------------+
| _ucs2 0x0063 < _ucs2 0x0063030C collate ucs2_czech_ci |
+-------------------------------------------------------+
|                                                     0 |
+-------------------------------------------------------+
1 row in set (0.00 sec)

And, what is more important, precomposed and decomposed
versions of the same character are not equal to each other:

mysql> select _ucs2 0x010D = _ucs2 0x0063030C collate ucs2_czech_ci;
+-------------------------------------------------------+
| _ucs2 0x010D = _ucs2 0x0063030C collate ucs2_czech_ci |
+-------------------------------------------------------+
|                                                     0 |
+-------------------------------------------------------+
1 row in set (0.00 sec)


How it should work
------------------
This behaviour should be fixed to be Unicode complient.
The sequence of "0063 030C" should be considered to be
equivalent to "010D" as being different forms of the
same character, and thus should be equal for comparison,
and for sorting on the weight levels 1-3.


Combining classes
-----------------
Combining classes are important to resolve a sequence
of a letter followed by a number of combining marks.
For example,

"0063 032C 030C - LATIN SMALL LETTER C + COMBINING CARON BELOW + COMBINING CARON"

Combining class for:
"030C COMBINING CARON" is 230
"032C COMBINING CARON BELOW" is 220

The pair of the combining marks 032C and 030C is "exchangable",
because 230 > 220. I.e. before doing actual comparison, we should
exchange the combining marks and get this sequence:

"0063 030C 032C  - LATIN SMALL LETTER C + COMBINING CARON + COMBINING CARON  BELOW"

And, finally "0063 030C" should be resolved as being equal to "010D".

As effect, these queries should both return TRUE:

select _ucs2 0x010D = _ucs2 0x0063032C030C collate ucs2_czech_ci;
select _ucs2 0x0063 < _ucs2 0x0063032C030C collate ucs2_czech_ci;


References:
http://www.unicode.org/reports/tr10/
http://www.unicode.org/reports/tr15/
http://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values
http://en.wikipedia.org/wiki/Combining_diacritical_mark
http://en.wikipedia.org/wiki/Unicode_equivalence