WL#9108: Add language specific case insensitive collations of utf8mb4

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture

Because different language has its specific collation rules, we are going to add
language specific case insensitive collations of utf8mb4 to make them collate
correctly.
The languages are: icelandic, latvian, romanian, slovenian, polish, estonian,
spanish, spanish2, swedish, turkish, czech, danish, lithuanian, slovak, roman,
esperanto, hungarian, german2, croatian, vietnamese.

Persian and Sinhala collations will be added in future WL.

F-1: The collations shall be compiled ones.

F-2: The collations shall work for all characters in range [U+0, U+10FFFF].

F-3: The collations shall sort characters of the language correctly according to
     language specific rules.

F-4: The collations shall sort characters not belonging to the language according
     to their order in DUCET.

F-5: For the characters whose weight is not assigned in DUCET, the collation
     shall sort them with their implicit weight value which is constructed with 
     the UCA way.

F-6: The collation shall be case / accent insensitive.

F-7: To simplify the collation, characters in contraction sequence will be
     treated as separate characters.

How we sort characters with DUCET:

Please refer to the High-Level Specification of WL#9125. DUCET gives default
collation weight to about 30,000 Unicode characters. For characters not assigned
weight in DUCET, we can get their implicit weight with the method defined by UCA.

Language specific rules:

We usually use the weight data of characters to do sorting, but it sometimes does
not work for some languages. For example,
String A: "headache"
String B: "cheers"
Usually, we say A > B because the weight of A's first character 'h' (0x1C93) is
greater than the weight of B's first character 'c' (0x1BF5). But if we sort these
2 strings under Czech environment, it is different. Because for Czech, there is
one special rule: "h < ch".
CLDR defines sets of rules for different languages, our task is to modify our
collations to adopt those rules to make them work correctly for specific
languages.
For characters explicitly specified in CLDR rule, we'll sort them according to the
rule. For characters not specified in CLDR rule, we'll sort them with their weight
assiged in DUCET or implicit weight. For example, when sorting Han characters with
Croatian collation, the Han characters are collated with their DUCET weight or
implicit weight.

Collations to add:
We are going to add following collations. The collation name contains:
a. character set name: "utf8mb4"
b. language's iso code: for example, "cs" for Czech
c. UCA version: "800"
d. accent / case insensitive: "ai_ci"

The complete list is:
Collation name                      language
------------------------------------------------------------
utf8mb4_cs_800_ai_ci                Czech
utf8mb4_da_800_ai_ci                Danish
utf8mb4_de_phonebook_800_ai_ci      German (phonebook order)
utf8mb4_eo_800_ai_ci                Esperanto
utf8mb4_es_800_ai_ci                Spanish
utf8mb4_es_traditional_800_ai_ci    Spanish (traditional)
utf8mb4_et_800_ai_ci                Estonian
utf8mb4_hr_800_ai_ci                Croatian
utf8mb4_hu_800_ai_ci                Hungarian
utf8mb4_is_800_ai_ci                Icelandic
utf8mb4_la_800_ai_ci                Roman (classical Latin)
utf8mb4_lt_800_ai_ci                Lithuanian
utf8mb4_lv_800_ai_ci                Latvian
utf8mb4_pl_800_ai_ci                Polish
utf8mb4_ro_800_ai_ci                Romanian
utf8mb4_sk_800_ai_ci                Slovak
utf8mb4_sl_800_ai_ci                Slovenian
utf8mb4_sv_800_ai_ci                Swedish
utf8mb4_tr_800_ai_ci                Turkish
utf8mb4_vi_800_ai_ci                Vietnamese

Collation Rules:
We apply CLDR rules on the base of DUCET. Which means we change character's
weight when CLDR defines specific rule for a language or characters. All the
CLDR rules can be found at:
http://www.unicode.org/cldr/charts/29/collation/index.html
Or you can download and unzip the file:
http://www.unicode.org/Public/cldr/29/core.zip, then find the rule files under
common/collations.

For 1 language, there might be more than one type of rules defined. As what CLDR
preferred, we select the rule as following procedure:
a. Select the "defaultCollation" rule if it is defined, as Swedish uses
   "reformed".
b. Select the "standard" rule if it is defined.

Special Rules:
For most of the rules we applied to our collations, they usually change the
sorting weight of character to change the sorting order. But there is one
special rule we need to note: Reorder of Croatian collation. For other
collations we implemented, no matter how we change character's weight, Cyrillic
characters sort greater than Greek characters. But Croatian collation's rule
wants to make all Cyrillic character sort greater than Latin and all other
character's (except digit, space, punctuation, symbol characters) sort greater
than Cyrillic characters.

Normalization support:
One Unicode character might be expressed as combination of two or more other
characters. For example, U+00FD = U+0079 ('y') + U+0301. In Latvian's collation
(utf8mb4_lv_800_ai_ci), the weight of character 'y' needs to change because
Latvian's CLDR rule defines "&I << y <<< Y". Then we think U+00FD's weight
should also change with 'y'.


Reference:
http://www.unicode.org/reports/tr35/tr35-collation.html