WL#2673: Unicode Collation Algorithm new version

Affects: Server-5.6 — Status: Complete

Description
High Level Architecture

For the ucs2_unicode_ci and utf8_unicode_ci collations, 
MySQL uses version 4.0.0 of the Unicode Collation Algorithm 
(UCA). It was superseded by UCA version 4.1.0, and now
UCA version 5.2.0 exists. We should keep up with 
standard requirements.

We will use Unicode Collation Algorithm 5.2.

We will also establish "Principles" and
guidelines for language-specific collations
including some non-Unicode collations.

Principles
----------

For all collations that we make in the future:
"UCA + CLDR".

UCA. For the basic rules of all characters, we will use
the Unicode Collation Algorithm's DUCET to find the
primary weights for single characters.
(UCA is available on the Unicode site for download,
UCA has a clear set of rules and list of characters,
UCA is not proprietary like e.g. Windows,
UCA is what we've agreed about since long ago.)

CLDR. For tailoring according to national or locale rules
we will use the Common Locale Data Repository (CLDR).
(CLDR is available on the Unicode site for download,
CLDR is clearly based on official standards,
CLDR seems to be more up to date than e.g. Posix,
CLDR is the basis for major products like ICU,
CLDR is easy to acquire and read.)

Secondarily we may check intuitions of native speakers,
national standards, and behaviour of other DBMSs or
earlier MySQL versions. But the "UCA + CLDR" principles
are more important. We will ignore the principles only
if they appear to be in error according to the secondary
considerations, or for major customers.

The implementor will ignore UCA or CLDR principles if and only
if they require a feature that MySQL does not support, or
that MySQL does not wish to always support. For example,
at time of writing, we do not have secondary/tertiary weights
or combining-character support.

The principles apply to all MySQL character sets,
including non-Unicode character sets.
For 8-bit character sets with 8-bit weights, there are
special rules and restrictions as described in a later section.

Collation names
===============
We had a choice between changing current collations
(making upgrades harder) or adding new collations (making
many new collations which are incompatible with current ones).
Due to bad experience with "Sharp S", we now believe that
we should always try to avoid changing current collations.

The proposal is: add new collations.
New collation names will use this name style notation:

- utf8_unicode_520_ci
- utf8_polish_520_ci
- utf8_czech_520_ci
and so on, where "520" indicates Unicode version.


Language collations
===================
WL#2673 will add only _unicode_520_ci collations.

Language collations will be implemented in separate
worklog tasks.

utf8_general_ci                         never
utf8_bin                                never
utf8_unicode_520_ci                     now
utf8_icelandic_520_ci                   later
utf8_latvian_520_ci                     later
utf8_romanian_520_ci                    later
utf8_slovenian_520_ci                   later
utf8_polish_520_ci                      later
utf8_estonian_520_ci                    later
utf8_spanish_520_ci                     later
utf8_swedish_520_ci                     later
utf8_turkish_520_ci                     later
utf8_czech_520_ci                       later
utf8_danish_520_ci                      later
utf8_lithuanian_520_ci                  later
utf8_slovak_520_ci                      later
utf8_spanish2_520_ci                    later
utf8_roman_520_ci                       later (grumble grumble ...)
utf8_persian_520_ci                     later
utf8_esperanto_520_ci                   later
utf8_hungarian_520_ci                   later
utf8_sinhala_520_ci                     later

Contractions
============
As of version 5.5, MySQL supports only
contractions consisting of two ASCII letters a-zA-Z.
This WL will also make it possible to use contractions
between any two letters (i.e. two characters that Unicode
describes as letters). This will make possible
the addition of some missing collations, for example Croatian
which requires "Dž" to be a contraction.

All Unicode character sets
==========================

We will support all the Unicode character sets.
So the new collations are: ucs2_unicode_520_ci,
utf8_unicode_520_ci, utf16_unicode_520_ci,
utf32_unicode_520_ci, utf8mb4_unicode_520_ci.

The utf8 character set has an alias 'utf8mb3'.
However, there will be no 'utf8mb3_unicode_520_ci'
alias for 'utf8_unicode_520_ci'.

non-BMP characters
==================
The existing implementations of UCA collations sort all
non-BMP characters all together. The new UCA collations
will use real weights for non-BMP characters,
from Unicode-5.2.0 allkeys.txt.

LDML
====
LDML will recognize an optional UCA version attribute
when creating a user defined collation:

This definition will create a 5.2.0 based collation:




These definitions will create 4.0.0 based collations:






Case folding in the new collations
==================================
LOWER() and UPPER() functions should also follow the new 5.2.0 rules.

For example, Unicode-5.0.0 introduces these characters:

2C00;GLAGOLITIC CAPITAL LETTER AZU;Lu;0;L;;;;;N;;;;2C30;
2C30;GLAGOLITIC SMALL LETTER AZU;Ll;0;L;;;;;N;;;2C00;;2C00
Source: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

These code points were unassigned in Unicode-4.0.0. 
Therefore in MySQL 5.5 we get:

mysql> select hex(lower(_ucs2 0x2c00 collate ucs2_unicode_ci));
+--------------------+
| hex(lower(0x2c00)) |
+--------------------+
| 2C00               |
+--------------------+
1 row in set (0.00 sec)

But in the new version we expect to get

mysql> select hex(lower(_ucs2 0x2c00 collate ucs2_unicode_500_ci));
+--------------------+
| hex(lower(0x2c00)) |
+--------------------+
| 2C30               |
+--------------------+
1 row in set (0.00 sec)

Indeed collation affects folding, as already happens with Turkish I.

For any Unicode collation, the results for LOWER() should match
the recommendation for simple case folding in
http://unicode.org/Public/UNIDATA/CaseFolding.txt


Case folding in _general_ci and _bin collations
===============================================
There are a few choices what to do with case folding
in the old Unicode collations _general_ci and _bin
collations.

1. Don't change case folding in the old collations.
WL#2673 will add new _general_520_ci and _520_bin collations,
with Unicode-5.2.0 case folding rules.

2. WL#2673 will change case folding rules in the existing
_general_ci and _bin collations to conform to Unicode-5.2.0 rules.

3. WL#2673 will neither change the existing case folding rules
in _general_ci and _bin collations, nor add new _general_520_ci
and _500_bin collations with Unicode-5.2.0 folding rules.
This will mean that we don't have "fresh" _general_ci and _bin
collations.

We'll go the third way.

Applying the algorithm in non-Unicode character sets
====================================================

We will apply UCA rules to non-Unicode character sets.
For example "latin1_swedish_520_ci" will use the UCA
algorithm plus Swedish tailoring, and incidentally fix
some problems that exist in latin1_swedish_ci.
See also the comments on BUG#46659.
This is a separate worklog task, WL#5170 Swedish collation.

Simple collation
----------------

A simple collation has one-to-one mappings for
each of the possible byte values, with no allowance
for ignorables or expansions. So most 8-bit
collations will differ from the Unicode collations:
(1) The ignorables will have weights less than space.
(2) The expansions will stop after the first letter.
But for all characters a simple collation has the
same ordering as a Unicode collation, with UCA rules.

Expansions actually depend on character type:
(a) For a letter, take the first letter of the
    expansion, and the weight is equal to that.
    For example, SHARP S expands to 'SS'.
    The DUCET UCA 5.2 weight of 'S' is 1410.
    We want SHARP S to be equal to 'S'.
    Therefore the weight of SHARP S is 1410.
    All this means for a simple collation is;
    S = SHARP S. Letters subject to
    expansion, if there's no tailoring, include:
    LATIN SMALL LETTER SHARP S, LATIN SMALL LIGATURE OE,
    LATIN CAPITAL LIGATURE OE, LATIN CAPITAL LETTER AE,
    LATIN SMALL LETTER AE.
(b) For a non-letter, take the first letter of the
    expansion, and the weight is just after that.
    For example TRADE MARK SIGN expands to 'TM'.
    Stop after the first letter of expansion, 'T'.
    The DUCET UCA 5.2 weight of 'T' is 1433.
    We want TRADE MARK SIGN to be just after 'T'.
    Therefore the weight of TRADE MARK SIGN is 1434.
    All this means for a simple collation is:
    T < TRADE MARK SIGN < U. Non-letters subject to
    exspansion, if there's no tailoring, include:
    HORIZONTAL ELLIPSIS, VULGAR FRACTION ONE QUARTER,
    VULGAR FRACTION ONE HALF, VULGAR FRACTION THREE QUARTERS,
    TRADE MARK SIGN.

The ignorable characters, which for simple collations will
always be the control characters, will be sorted before space, thus:

Characters which are ignorable should be in their code
point order before 'space'. (These characters are 0x00 to 0x1f,
0x7f, 0x81, 0x8d, and 0x8f,). Mr Barkov's proposition is:
8-bit characters from 0x00 to 0x1f -- including 0x00! -- will have
weights equal to encodings, and weights for 0x7f,81,8d,8f,90,9d
will be 0x20, 0x21, 0x22, 0x23, 0x24, 0x25 respectively.
So, in a non-tailored simple collation, these are the code points
and weights for the characters as far as SPACE:
8bit   ucs2 weight   name
00     0000 ! 0000   (control) NULL
01     0001 ! 0001   (control)
02     0002 ! 0002   (control)
03     0003 ! 0003   (control)
04     0004 ! 0004   (control)
05     0005 ! 0005   (control)
06     0006 ! 0006   (control)
07     0007 ! 0007   (control)
08     0008 ! 0008   (control) BACKSPACE
0e     000e !*000e   (control)
0f     000f !*010f   (control)
10     0010 ! 0010   (control)
11     0011 ! 0011   (control)
12     0012 ! 0012   (control)
13     0013 ! 0013   (control)
14     0014 ! 0014   (control)
15     0015 ! 0015   (control)
16     0016 ! 0016   (control)
17     0017 ! 0017   (control)
18     0018 ! 0018   (control)
19     0019 !*0019   (control)
1a     001a !*001a   (control)
1b     001b !*001b   (control)
1c     001c !*001c   (control)
1d     001d !*001d   (control)
1e     001e !*001e   (control)
1f     001f !*001f   (control)
7f     007F !*0020   DELETE
81     0081 !*0021   (control)
8d     008d ! 0022   PARTIAL LINE FEED
8f     008f ! 0023   SINGLE SHIFT THREE
90     0090 ! 0024   DEVICE CONTROL STRING
9d     009d ! 0025   OPERATING SYSTEM COMMAND
09     0009 !*0201   (control) HORIZONTAL TABULATION
0a     000a !*0202   (control) LINE FEED
0b     000b !*0203   (control) VERTICAL TABULATION
0c     000c !*0204   (control) FORM FEED
0d     000d !*0205   (control) CARRIAGE RETURN
20     0020  *020a   SPACE
a0     00a0  *020A   NO-BREAK SPACE

These rules are arbitrary. If somebody wants to object,
because they want more uniqueness or because they have
some intuition about what makes sense or whatever, they
should object now. It's too late to object about any
collation once it's in a GA version of MySQL.

Exceptions: For 8-bit-character-set collations which
are nearly 'simple' but contain a few
expansions or contractions, we will allow
exceptions to the simple rules for letters, only.
For example, with german2, we will have one-to-two expansion
for Æ Ä Œ Ö Ü ß, but not for HORIZONTAL ELLIPSIS
or for other non-letters. See also WL#5210. 

Background
----------

(This explanation from the original worklog text
explains why a change to UCA 4.1 is necessary.)

Background: BUG#11369 was a complaint about a change that occurred  
due to a Unicode committee vote in 2004 (IBM and Oracle in favour,  
Microsoft and Sybase against). The version-4.1 UCA changes include:   
- Support for new Unicode 4.1.0 characters  
- Change in weights for existing characters
  0C6 LATIN CAPITAL LETTER AE  
  01FC LATIN CAPITAL LETTER AE WITH ACUTE  
  01E2  LATIN CAPITAL LETTER AE WITH MACRON  
  00D8 LATIN CAPITAL LETTER O WITH STROKE  
  01FE LATIN CAPITAL LETTER O WITH STROKE AND ACUTE  
  0110 LATIN CAPITAL LETTER D WITH STROKE  
  0141 LATIN CAPITAL LETTER L WITH STROKE  
  013F LATIN CAPITAL LETTER L WITH MIDDLE DOT  
  00D0 LATIN CAPITAL LETTER ETH  
  0126 LATIN CAPITAL LETTER H WITH STROKE   
  0600 Arabic number sign and similar characters  
- Addition of contractions for Thai
  
For example, the weight change means that "L with stroke"  
was formerly greater than "L" (as in Polish), but now it's  
equal to "L" (as in any non-Polish directory where a Polish  
name might appear). This change would affect the  
Unicode collations in MySQL -- ucs2_unicode_ci, utf8_unicode_ci,
utf16_unicode_ci, utf8mb3_unicode_ci -- and all collations
which are tailorings of the Unicode collations,
except the locale-specific collations ucs2_polish_ci,
utf8_polish_ci, utf16_polish_ci, utf8mb3_polish_ci.

The general collations -- ucs2_general_ci, utf8_general_ci,
utf16_general_ci, utf8mb3_general_ci -- will not be regarded
as tailorings of the Unicode collations. They will not be
affected by the change in weights for existing characters.
They may be affected by the support for new characters,
but the effects do not matter, just do what's convenient.

Changes in Unicode 5
--------------------

- The announcement about 5.0
http://www.unicode.org/versions/Unicode5.0.0/
doesn't indicate that UCA has changed for existing
characters. However, some primary-key weights differ,
as these three lines (from Unicode 4.00, 4.1.0, 5.0.0)
show (for LATIN SMALL LETTER L WITH STROKE):
ftp://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt
0142  ; [.0F36.0020.0002.0142]

ftp://www.unicode.org/Public/UCA/4.1.0/allkeys.txt
0142  ; [.1077.0020.0002.006C][.0000.008D.0002.0335]

ftp://www.unicode.org/Public/UCA/5.0.0/allkeys.txt
0142  ; [.10E5.0020.0002.006C][.0000.008D.0002.0335]

References
==========

The 5.2.0 UCA DUCET:
http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt

- See the example in BUG#11369, and the correspondence in dev-bugs  
or dev-public with subject heading  
"BUG#11369 [Ver->Doc]: Unicode ucs2_unicode_ci collation  
isn't following Unicode guidelin".
 
- See Bar's comments about UCA in this docs-private email: 
[ mysql internal intranet ] /secure/mailarchive/mail.php?folder=101&mail=6639 

- This task is on the roadmap for version 6.x
[ mysql internal intranet 'inside' ] /wiki/MySQL_Server_Future.

- A question about "L with stroke" appeared on a MySQL forum:
http://forums.mysql.com/read.php?103,115664,115664#msg-115664

- And BUG#27877 incorrect german order in utf8_general_ci
http://bugs.mysql.com/bug.php?id=27877
said utf8_general_ci should be fixed: "SHARP S" should be
changed to be equal to "s", like the manual says.

- WL#4475 Unicode 5.1.

- Thread in dev-private email "Re: WL#1213 and Glagolitic Azu"
[ mysql internal intranet ] /secure/mailarchive/mail.php?folder=4&mail=20614