WL#9751: Add Japanese collation to utf8mb4
Affects: Server-8.0
—
Status: Complete
We already have Japanese collations for SJIS and UJIS character sets, but there is no Japanese collation for UTF8. We'll use this to track the implementation of Japanese collation for utf8mb4 on the base of utf8mb4 collations built on latest Unicode 9.0. We had bugs like BUG#76553, our current collations cannot compare Japanese characters correctly. We'll implement this task on the rules defined for Japanese in CLDR v30. User Documentation ================== https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-1.html
F-1: The collation shall be compiled one. F-2: The collation shall work for all characters in range [U+0, U+10FFFF]. F-3: The collation shall sort characters of Japanese language correctly according to language specific rules defined by CLDR. F-4: The collation shall sort characters not belonging to the language according to their order in DUCET. F-5: For the characters whose weight is not assigned in DUCET, the collation shall sort them with their implicit weight value which is constructed in the UCA way. F-6: The collation shall be case / accent sensitive. NF-1: The performance of this collation's functions should have regression of no more than 10% comparing to utf8mb4's general case / accent sensitive collation (utf8mb4_0900_as_cs).
CLDR defines rule "[strength 3]" for Japanese collation. Which means by default three levels of weight will be used to implement this collation. We'll call this new collation "utf8mb4_ja_0900_as_cs". "utf8mb4" is the character set. "ja" is the language, Japanese. "0900" means the UCA version, 9.0.0. And "as_cs" means we'll use the secondary and tertiary weight defined in DUCET. 1. Hiragana and Katakana DUCET has already organized weight for Hiragana and Katakana to sort them correctly. It uses level 2 weight to differentiate "voiced kana" from "voiceless kana". So the "mother-father" problem (BUG#76553) is solved. Example, 306F ; [.3D74.0020.000E] # HIRAGANA LETTER HA 3070 ; [.3D74.0020.000E][.0000.0037.0002] # HIRAGANA LETTER BA "HA" is "voiceless kana" and "BA" is "voiced kana". The primary weight of both "HA" and "BA" is same, 3D74. But their secondary weight is different, 0020 for "HA" and 0020 0037 for "BA". The weight of Hiragana and Katakana has also already grantee that Hiragana sort before Katakana. They are distinguished by the level 3 weight. Example, 306F ; [.3D74.0020.000E] # HIRAGANA LETTER HA 30CF ; [.3D74.0020.0011] # KATAKANA LETTER HA The primary weight of both Hiragana and Katakana "HA" is same, 3D74. The secondary weight of them is same too, 0020. But their tertiary weight is different, 000E for Hiragana "HA" and 0011 for Katakana "HA". 2. Length mark and iteration mark Length mark ('ー', U+30FC) is a Japanese symbol which indicates a long vowel of two mora (syllable timing) in length. It is usually used in Katakana and Hiragana writing. CLDR has a series of rules which define how the length mark should sort when it follows different Katakana / Hiragana character. We'll use these rules to tailor its weight. Iteration mark is a punctuation mark that represent a duplicated character. For example, "人 hito" means person, "人々 hitobito" means people. The "々" is an iteration mark for Hanji. Japanese also has iteration mark "ゝ" for Hiragana and "ヽ" for Katakana. CLDR has rules which define how the iteration mark should sort. We'll use these rules to tailor its weight. 3. Kanji CLDR defines sort order of 6355 Kanji characters. All these Kanji characters are in the same set as defined by JIS X 0208, the most recent version of Japanese Industrial Standards. All these Japanese Kanji characters are in range [U+4E00, U+9FFF]. In DUCET, there is no collation element assigned to any code point in this range. Which means their weight needs to be calculated as implicit weight in other collations. The implicit weight contains two collation elements. Because Japanese Kanji characters no doubt are common for Japanese collation, we don't use their implicit weight, instead we give them tailored explicit weight according to the collating order defined in CLDR. This given explicit weight will have only one collation element. This is better from performance point of view because it can reduce the time looking up in weight table. The given weight will be greater than the maximum weight of Kana characters. For other non-Japanese Han characters, we keep their implicit weight ([FB80 - FB85, 0000, 0000][P2, S2, T2]). 4. Reorder CLDR defines reorder rule [Latin, Kana, Hani] for Japanese, which means Kana characters should compare greater than Latin, and Han characters should compare greater than Kana. But there are other character groups between them, for example, in DUCET, Greek (and Coptic, Cyrillic etc) characters are between Latin and Kana. So we can see the origin character groups as [Latin, CharA, Kana, CharB, Hani, others]. Japanese collation should re-arrange them as [Latin, Kana, Hani, CharA, CharB, others]. For Latin and Kana group, the reorder implementation in WL#9108 can reorder them well. But for CharA and CharB, because non-Japanese Han characters are using implicit weight according to UCA, moving CharA and CharB to be after all Han characters means we need to give CharA and CharB greater weight. The maximum implicit weight of Han characters defined by UCA is [FB85, 0020, 0002][P2, 0000, 0000]. So if one character in CharA or CharB has weight [P, S, T] in DUCET, giving it an ajusted weight as [FB86, 0000, 0000][P, S, T] is enough to achieve this (the value of P, S and T won't change). Please note that reorder only happen at the primary weight. So the primary weight range of all these involved character groups are: Char Group | Origin Weight Range | Tailored Weight Range ---------------------------------------------------------------------------- Latin | 1C47 -- 1FB5 | 1C47 -- 1FB5 CharA | 1FB9 -- 3D59 | [FB86, 1FB9] - [FB86, 3D59] Kana | 3D5A -- 3D8B | 1FB6 -- 1FE7 CharB | 3D89 -- 54A3 | [FB86, 3D89] - [FB86, 54A3] Japanese Han | [FB40, AAAA] - [FB41, BBBB] | 54A4 -- 6D76 non-Japanese Han | [FB40, XXXX] - [FB85, YYYY] | [FB40, XXXX] - [FB85, YYYY]
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.