WL#9108: Add language specific case insensitive collations of utf8mb4
Affects: Server-8.0
—
Status: Complete
Because different language has its specific collation rules, we are going to add language specific case insensitive collations of utf8mb4 to make them collate correctly. The languages are: icelandic, latvian, romanian, slovenian, polish, estonian, spanish, spanish2, swedish, turkish, czech, danish, lithuanian, slovak, roman, esperanto, hungarian, german2, croatian, vietnamese. Persian and Sinhala collations will be added in future WL.
F-1: The collations shall be compiled ones. F-2: The collations shall work for all characters in range [U+0, U+10FFFF]. F-3: The collations shall sort characters of the language correctly according to language specific rules. F-4: The collations shall sort characters not belonging to the language according to their order in DUCET. F-5: For the characters whose weight is not assigned in DUCET, the collation shall sort them with their implicit weight value which is constructed with the UCA way. F-6: The collation shall be case / accent insensitive. F-7: To simplify the collation, characters in contraction sequence will be treated as separate characters.
How we sort characters with DUCET: Please refer to the High-Level Specification of WL#9125. DUCET gives default collation weight to about 30,000 Unicode characters. For characters not assigned weight in DUCET, we can get their implicit weight with the method defined by UCA. Language specific rules: We usually use the weight data of characters to do sorting, but it sometimes does not work for some languages. For example, String A: "headache" String B: "cheers" Usually, we say A > B because the weight of A's first character 'h' (0x1C93) is greater than the weight of B's first character 'c' (0x1BF5). But if we sort these 2 strings under Czech environment, it is different. Because for Czech, there is one special rule: "h < ch". CLDR defines sets of rules for different languages, our task is to modify our collations to adopt those rules to make them work correctly for specific languages. For characters explicitly specified in CLDR rule, we'll sort them according to the rule. For characters not specified in CLDR rule, we'll sort them with their weight assiged in DUCET or implicit weight. For example, when sorting Han characters with Croatian collation, the Han characters are collated with their DUCET weight or implicit weight. Collations to add: We are going to add following collations. The collation name contains: a. character set name: "utf8mb4" b. language's iso code: for example, "cs" for Czech c. UCA version: "800" d. accent / case insensitive: "ai_ci" The complete list is: Collation name language ------------------------------------------------------------ utf8mb4_cs_800_ai_ci Czech utf8mb4_da_800_ai_ci Danish utf8mb4_de_phonebook_800_ai_ci German (phonebook order) utf8mb4_eo_800_ai_ci Esperanto utf8mb4_es_800_ai_ci Spanish utf8mb4_es_traditional_800_ai_ci Spanish (traditional) utf8mb4_et_800_ai_ci Estonian utf8mb4_hr_800_ai_ci Croatian utf8mb4_hu_800_ai_ci Hungarian utf8mb4_is_800_ai_ci Icelandic utf8mb4_la_800_ai_ci Roman (classical Latin) utf8mb4_lt_800_ai_ci Lithuanian utf8mb4_lv_800_ai_ci Latvian utf8mb4_pl_800_ai_ci Polish utf8mb4_ro_800_ai_ci Romanian utf8mb4_sk_800_ai_ci Slovak utf8mb4_sl_800_ai_ci Slovenian utf8mb4_sv_800_ai_ci Swedish utf8mb4_tr_800_ai_ci Turkish utf8mb4_vi_800_ai_ci Vietnamese Collation Rules: We apply CLDR rules on the base of DUCET. Which means we change character's weight when CLDR defines specific rule for a language or characters. All the CLDR rules can be found at: http://www.unicode.org/cldr/charts/29/collation/index.html Or you can download and unzip the file: http://www.unicode.org/Public/cldr/29/core.zip, then find the rule files under common/collations. For 1 language, there might be more than one type of rules defined. As what CLDR preferred, we select the rule as following procedure: a. Select the "defaultCollation" rule if it is defined, as Swedish uses "reformed". b. Select the "standard" rule if it is defined. Special Rules: For most of the rules we applied to our collations, they usually change the sorting weight of character to change the sorting order. But there is one special rule we need to note: Reorder of Croatian collation. For other collations we implemented, no matter how we change character's weight, Cyrillic characters sort greater than Greek characters. But Croatian collation's rule wants to make all Cyrillic character sort greater than Latin and all other character's (except digit, space, punctuation, symbol characters) sort greater than Cyrillic characters. Normalization support: One Unicode character might be expressed as combination of two or more other characters. For example, U+00FD = U+0079 ('y') + U+0301. In Latvian's collation (utf8mb4_lv_800_ai_ci), the weight of character 'y' needs to change because Latvian's CLDR rule defines "&I << y <<< Y". Then we think U+00FD's weight should also change with 'y'. Reference: http://www.unicode.org/reports/tr35/tr35-collation.html
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.