WL#2673: Unicode Collation Algorithm new version
Affects: Server-5.6 — Status: Complete
For the ucs2_unicode_ci and utf8_unicode_ci collations, MySQL uses version 4.0.0 of the Unicode Collation Algorithm (UCA). It was superseded by UCA version 4.1.0, and now UCA version 5.2.0 exists. We should keep up with standard requirements.
We will use Unicode Collation Algorithm 5.2. We will also establish "Principles" and guidelines for language-specific collations including some non-Unicode collations. Principles ---------- For all collations that we make in the future: "UCA + CLDR". UCA. For the basic rules of all characters, we will use the Unicode Collation Algorithm's DUCET to find the primary weights for single characters. (UCA is available on the Unicode site for download, UCA has a clear set of rules and list of characters, UCA is not proprietary like e.g. Windows, UCA is what we've agreed about since long ago.) CLDR. For tailoring according to national or locale rules we will use the Common Locale Data Repository (CLDR). (CLDR is available on the Unicode site for download, CLDR is clearly based on official standards, CLDR seems to be more up to date than e.g. Posix, CLDR is the basis for major products like ICU, CLDR is easy to acquire and read.) Secondarily we may check intuitions of native speakers, national standards, and behaviour of other DBMSs or earlier MySQL versions. But the "UCA + CLDR" principles are more important. We will ignore the principles only if they appear to be in error according to the secondary considerations, or for major customers. The implementor will ignore UCA or CLDR principles if and only if they require a feature that MySQL does not support, or that MySQL does not wish to always support. For example, at time of writing, we do not have secondary/tertiary weights or combining-character support. The principles apply to all MySQL character sets, including non-Unicode character sets. For 8-bit character sets with 8-bit weights, there are special rules and restrictions as described in a later section. Collation names =============== We had a choice between changing current collations (making upgrades harder) or adding new collations (making many new collations which are incompatible with current ones). Due to bad experience with "Sharp S", we now believe that we should always try to avoid changing current collations. The proposal is: add new collations. New collation names will use this name style notation: - utf8_unicode_520_ci - utf8_polish_520_ci - utf8_czech_520_ci and so on, where "520" indicates Unicode version. Language collations =================== WL#2673 will add only _unicode_520_ci collations. Language collations will be implemented in separate worklog tasks. utf8_general_ci never utf8_bin never utf8_unicode_520_ci now utf8_icelandic_520_ci later utf8_latvian_520_ci later utf8_romanian_520_ci later utf8_slovenian_520_ci later utf8_polish_520_ci later utf8_estonian_520_ci later utf8_spanish_520_ci later utf8_swedish_520_ci later utf8_turkish_520_ci later utf8_czech_520_ci later utf8_danish_520_ci later utf8_lithuanian_520_ci later utf8_slovak_520_ci later utf8_spanish2_520_ci later utf8_roman_520_ci later (grumble grumble ...) utf8_persian_520_ci later utf8_esperanto_520_ci later utf8_hungarian_520_ci later utf8_sinhala_520_ci later Contractions ============ As of version 5.5, MySQL supports only contractions consisting of two ASCII letters a-zA-Z. This WL will also make it possible to use contractions between any two letters (i.e. two characters that Unicode describes as letters). This will make possible the addition of some missing collations, for example Croatian which requires "Dž" to be a contraction. All Unicode character sets ========================== We will support all the Unicode character sets. So the new collations are: ucs2_unicode_520_ci, utf8_unicode_520_ci, utf16_unicode_520_ci, utf32_unicode_520_ci, utf8mb4_unicode_520_ci. The utf8 character set has an alias 'utf8mb3'. However, there will be no 'utf8mb3_unicode_520_ci' alias for 'utf8_unicode_520_ci'. non-BMP characters ================== The existing implementations of UCA collations sort all non-BMP characters all together. The new UCA collations will use real weights for non-BMP characters, from Unicode-5.2.0 allkeys.txt. LDML ==== LDML will recognize an optional UCA version attribute when creating a user defined collation: This definition will create a 5.2.0 based collation:
These definitions will create 4.0.0 based collations: Case folding in the new collations ================================== LOWER() and UPPER() functions should also follow the new 5.2.0 rules. For example, Unicode-5.0.0 introduces these characters: 2C00;GLAGOLITIC CAPITAL LETTER AZU;Lu;0;L;;;;;N;;;;2C30; 2C30;GLAGOLITIC SMALL LETTER AZU;Ll;0;L;;;;;N;;;2C00;;2C00 Source: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt These code points were unassigned in Unicode-4.0.0. Therefore in MySQL 5.5 we get: mysql> select hex(lower(_ucs2 0x2c00 collate ucs2_unicode_ci)); +--------------------+ | hex(lower(0x2c00)) | +--------------------+ | 2C00 | +--------------------+ 1 row in set (0.00 sec) But in the new version we expect to get mysql> select hex(lower(_ucs2 0x2c00 collate ucs2_unicode_500_ci)); +--------------------+ | hex(lower(0x2c00)) | +--------------------+ | 2C30 | +--------------------+ 1 row in set (0.00 sec) Indeed collation affects folding, as already happens with Turkish I. For any Unicode collation, the results for LOWER() should match the recommendation for simple case folding in http://unicode.org/Public/UNIDATA/CaseFolding.txt Case folding in _general_ci and _bin collations =============================================== There are a few choices what to do with case folding in the old Unicode collations _general_ci and _bin collations. 1. Don't change case folding in the old collations. WL#2673 will add new _general_520_ci and _520_bin collations, with Unicode-5.2.0 case folding rules. 2. WL#2673 will change case folding rules in the existing _general_ci and _bin collations to conform to Unicode-5.2.0 rules. 3. WL#2673 will neither change the existing case folding rules in _general_ci and _bin collations, nor add new _general_520_ci and _500_bin collations with Unicode-5.2.0 folding rules. This will mean that we don't have "fresh" _general_ci and _bin collations. We'll go the third way. Applying the algorithm in non-Unicode character sets ==================================================== We will apply UCA rules to non-Unicode character sets. For example "latin1_swedish_520_ci" will use the UCA algorithm plus Swedish tailoring, and incidentally fix some problems that exist in latin1_swedish_ci. See also the comments on BUG#46659. This is a separate worklog task, WL#5170 Swedish collation. Simple collation ---------------- A simple collation has one-to-one mappings for each of the possible byte values, with no allowance for ignorables or expansions. So most 8-bit collations will differ from the Unicode collations: (1) The ignorables will have weights less than space. (2) The expansions will stop after the first letter. But for all characters a simple collation has the same ordering as a Unicode collation, with UCA rules. Expansions actually depend on character type: (a) For a letter, take the first letter of the expansion, and the weight is equal to that. For example, SHARP S expands to 'SS'. The DUCET UCA 5.2 weight of 'S' is 1410. We want SHARP S to be equal to 'S'. Therefore the weight of SHARP S is 1410. All this means for a simple collation is; S = SHARP S. Letters subject to expansion, if there's no tailoring, include: LATIN SMALL LETTER SHARP S, LATIN SMALL LIGATURE OE, LATIN CAPITAL LIGATURE OE, LATIN CAPITAL LETTER AE, LATIN SMALL LETTER AE. (b) For a non-letter, take the first letter of the expansion, and the weight is just after that. For example TRADE MARK SIGN expands to 'TM'. Stop after the first letter of expansion, 'T'. The DUCET UCA 5.2 weight of 'T' is 1433. We want TRADE MARK SIGN to be just after 'T'. Therefore the weight of TRADE MARK SIGN is 1434. All this means for a simple collation is: T < TRADE MARK SIGN < U. Non-letters subject to exspansion, if there's no tailoring, include: HORIZONTAL ELLIPSIS, VULGAR FRACTION ONE QUARTER, VULGAR FRACTION ONE HALF, VULGAR FRACTION THREE QUARTERS, TRADE MARK SIGN. The ignorable characters, which for simple collations will always be the control characters, will be sorted before space, thus: Characters which are ignorable should be in their code point order before 'space'. (These characters are 0x00 to 0x1f, 0x7f, 0x81, 0x8d, and 0x8f,). Mr Barkov's proposition is: 8-bit characters from 0x00 to 0x1f -- including 0x00! -- will have weights equal to encodings, and weights for 0x7f,81,8d,8f,90,9d will be 0x20, 0x21, 0x22, 0x23, 0x24, 0x25 respectively. So, in a non-tailored simple collation, these are the code points and weights for the characters as far as SPACE: 8bit ucs2 weight name 00 0000 ! 0000 (control) NULL 01 0001 ! 0001 (control) 02 0002 ! 0002 (control) 03 0003 ! 0003 (control) 04 0004 ! 0004 (control) 05 0005 ! 0005 (control) 06 0006 ! 0006 (control) 07 0007 ! 0007 (control) 08 0008 ! 0008 (control) BACKSPACE 0e 000e !*000e (control) 0f 000f !*010f (control) 10 0010 ! 0010 (control) 11 0011 ! 0011 (control) 12 0012 ! 0012 (control) 13 0013 ! 0013 (control) 14 0014 ! 0014 (control) 15 0015 ! 0015 (control) 16 0016 ! 0016 (control) 17 0017 ! 0017 (control) 18 0018 ! 0018 (control) 19 0019 !*0019 (control) 1a 001a !*001a (control) 1b 001b !*001b (control) 1c 001c !*001c (control) 1d 001d !*001d (control) 1e 001e !*001e (control) 1f 001f !*001f (control) 7f 007F !*0020 DELETE 81 0081 !*0021 (control) 8d 008d ! 0022 PARTIAL LINE FEED 8f 008f ! 0023 SINGLE SHIFT THREE 90 0090 ! 0024 DEVICE CONTROL STRING 9d 009d ! 0025 OPERATING SYSTEM COMMAND 09 0009 !*0201 (control) HORIZONTAL TABULATION 0a 000a !*0202 (control) LINE FEED 0b 000b !*0203 (control) VERTICAL TABULATION 0c 000c !*0204 (control) FORM FEED 0d 000d !*0205 (control) CARRIAGE RETURN 20 0020 *020a SPACE a0 00a0 *020A NO-BREAK SPACE These rules are arbitrary. If somebody wants to object, because they want more uniqueness or because they have some intuition about what makes sense or whatever, they should object now. It's too late to object about any collation once it's in a GA version of MySQL. Exceptions: For 8-bit-character-set collations which are nearly 'simple' but contain a few expansions or contractions, we will allow exceptions to the simple rules for letters, only. For example, with german2, we will have one-to-two expansion for Æ Ä Œ Ö Ü ß, but not for HORIZONTAL ELLIPSIS or for other non-letters. See also WL#5210. Background ---------- (This explanation from the original worklog text explains why a change to UCA 4.1 is necessary.) Background: BUG#11369 was a complaint about a change that occurred due to a Unicode committee vote in 2004 (IBM and Oracle in favour, Microsoft and Sybase against). The version-4.1 UCA changes include: - Support for new Unicode 4.1.0 characters - Change in weights for existing characters 0C6 LATIN CAPITAL LETTER AE 01FC LATIN CAPITAL LETTER AE WITH ACUTE 01E2 LATIN CAPITAL LETTER AE WITH MACRON 00D8 LATIN CAPITAL LETTER O WITH STROKE 01FE LATIN CAPITAL LETTER O WITH STROKE AND ACUTE 0110 LATIN CAPITAL LETTER D WITH STROKE 0141 LATIN CAPITAL LETTER L WITH STROKE 013F LATIN CAPITAL LETTER L WITH MIDDLE DOT 00D0 LATIN CAPITAL LETTER ETH 0126 LATIN CAPITAL LETTER H WITH STROKE 0600 Arabic number sign and similar characters - Addition of contractions for Thai For example, the weight change means that "L with stroke" was formerly greater than "L" (as in Polish), but now it's equal to "L" (as in any non-Polish directory where a Polish name might appear). This change would affect the Unicode collations in MySQL -- ucs2_unicode_ci, utf8_unicode_ci, utf16_unicode_ci, utf8mb3_unicode_ci -- and all collations which are tailorings of the Unicode collations, except the locale-specific collations ucs2_polish_ci, utf8_polish_ci, utf16_polish_ci, utf8mb3_polish_ci. The general collations -- ucs2_general_ci, utf8_general_ci, utf16_general_ci, utf8mb3_general_ci -- will not be regarded as tailorings of the Unicode collations. They will not be affected by the change in weights for existing characters. They may be affected by the support for new characters, but the effects do not matter, just do what's convenient. Changes in Unicode 5 -------------------- - The announcement about 5.0 http://www.unicode.org/versions/Unicode5.0.0/ doesn't indicate that UCA has changed for existing characters. However, some primary-key weights differ, as these three lines (from Unicode 4.00, 4.1.0, 5.0.0) show (for LATIN SMALL LETTER L WITH STROKE): ftp://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt 0142 ; [.0F36.0020.0002.0142] ftp://www.unicode.org/Public/UCA/4.1.0/allkeys.txt 0142 ; [.1077.0020.0002.006C][.0000.008D.0002.0335] ftp://www.unicode.org/Public/UCA/5.0.0/allkeys.txt 0142 ; [.10E5.0020.0002.006C][.0000.008D.0002.0335] References ========== The 5.2.0 UCA DUCET: http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt - See the example in BUG#11369, and the correspondence in dev-bugs or dev-public with subject heading "BUG#11369 [Ver->Doc]: Unicode ucs2_unicode_ci collation isn't following Unicode guidelin". - See Bar's comments about UCA in this docs-private email: [ mysql internal intranet ] /secure/mailarchive/mail.php?folder=101&mail=6639 - This task is on the roadmap for version 6.x [ mysql internal intranet 'inside' ] /wiki/MySQL_Server_Future. - A question about "L with stroke" appeared on a MySQL forum: http://forums.mysql.com/read.php?103,115664,115664#msg-115664 - And BUG#27877 incorrect german order in utf8_general_ci http://bugs.mysql.com/bug.php?id=27877 said utf8_general_ci should be fixed: "SHARP S" should be changed to be equal to "s", like the manual says. - WL#4475 Unicode 5.1. - Thread in dev-private email "Re: WL#1213 and Glagolitic Azu" [ mysql internal intranet ] /secure/mailarchive/mail.php?folder=4&mail=20614
Copyright (c) 2000, 2022, Oracle Corporation and/or its affiliates. All rights reserved.