WL#3770: Unicode-compliant comparison and sorting of combining characters
Affects: Server-7.0
—
Status: Assigned
We should add support for combining marks, according to the Unicode standard. The combining marks are intended to modify other characters. The most common example of combining characters are the combining diacritical marks (combining accents).
The combining marks code ranges ------------------------------- * 0300–036F Combining Diacritical Marks * 1DC0–1DFF Combining Diacritical Marks Supplement * 20D0–20FF Combining Diacritical Marks for Symbols * FE20–FE2F Combining Half Marks The affected character sets --------------------------- All character sets having combining makrs in their repertoir. As of version 5.1, only two character sets seem to have the combining marks: * utf8 * ucs2 The affected collations ----------------------- At least all UCA-based collations with tailorings changing order of accented letters. For example, in utf8_czech_ci and ucs2_czech_ci: "0063 LATIN SMALL LETTER C" < "010D SMALL LETTER C WITH CARON" Problems in version 5.1 and earlier ----------------------------------- This perfectly works with the precomposed form 010D: mysql> select _ucs2 0x0063 < _ucs2 0x010D collate ucs2_czech_ci; +---------------------------------------------------+ | _ucs2 0x0063 < _ucs2 0x010D collate ucs2_czech_ci | +---------------------------------------------------+ | 1 | +---------------------------------------------------+ 1 row in set (0.00 sec) However, it does not work with the decomposed form: "0063 030C LATIN SMALL LETTER C + COMBINING CARON": mysql> select _ucs2 0x0063 < _ucs2 0x0063030C collate ucs2_czech_ci; +-------------------------------------------------------+ | _ucs2 0x0063 < _ucs2 0x0063030C collate ucs2_czech_ci | +-------------------------------------------------------+ | 0 | +-------------------------------------------------------+ 1 row in set (0.00 sec) And, what is more important, precomposed and decomposed versions of the same character are not equal to each other: mysql> select _ucs2 0x010D = _ucs2 0x0063030C collate ucs2_czech_ci; +-------------------------------------------------------+ | _ucs2 0x010D = _ucs2 0x0063030C collate ucs2_czech_ci | +-------------------------------------------------------+ | 0 | +-------------------------------------------------------+ 1 row in set (0.00 sec) How it should work ------------------ This behaviour should be fixed to be Unicode complient. The sequence of "0063 030C" should be considered to be equivalent to "010D" as being different forms of the same character, and thus should be equal for comparison, and for sorting on the weight levels 1-3. Combining classes ----------------- Combining classes are important to resolve a sequence of a letter followed by a number of combining marks. For example, "0063 032C 030C - LATIN SMALL LETTER C + COMBINING CARON BELOW + COMBINING CARON" Combining class for: "030C COMBINING CARON" is 230 "032C COMBINING CARON BELOW" is 220 The pair of the combining marks 032C and 030C is "exchangable", because 230 > 220. I.e. before doing actual comparison, we should exchange the combining marks and get this sequence: "0063 030C 032C - LATIN SMALL LETTER C + COMBINING CARON + COMBINING CARON BELOW" And, finally "0063 030C" should be resolved as being equal to "010D". As effect, these queries should both return TRUE: select _ucs2 0x010D = _ucs2 0x0063032C030C collate ucs2_czech_ci; select _ucs2 0x0063 < _ucs2 0x0063032C030C collate ucs2_czech_ci; References: http://www.unicode.org/reports/tr10/ http://www.unicode.org/reports/tr15/ http://www.unicode.org/reports/tr44/#Canonical_Combining_Class_Values http://en.wikipedia.org/wiki/Combining_diacritical_mark http://en.wikipedia.org/wiki/Unicode_equivalence
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.