WL#10480: Add Japanese kana sensitive collation to utf8mb4
Affects: Server-8.0
—
Status: Complete
We have implemented utf8mb4_ja_0900_as_cs collation which sorts characters by using three levels' weight. But customer thinks it is good to have a collation which has additional kana sensitive feature. New collation will have the name: utf8mb4_ja_0900_as_cs_ks, with 'ks' stands for 'kana sensitive'. Suffix '_ks' is only for Japanese language currently. For hiragana and katakana, DUCET assigns different weight for them with difference at third level. So other non-Japanese collation can distinguish them already. But Japanese's default collating rule defines that hiragana and katakana are only different at quaternary level, which means the default Japanese collation, utf8mb4_ja_0900_as_cs compares hiragana and katakana equal. This is why we introduce this new suffix and collation. User Documentation ================== * https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-2.html * https://dev.mysql.com/doc/refman/8.0/en/charset-collation-names.html (_ks suffix) * https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html
F-1: The collation shall be compiled one. F-2: The collation shall work for all characters in range [U+0, U+10FFFF]. F-3: The collation shall sort characters of Japanese language correctly according to language specific rules defined by CLDR. F-4: The collation shall sort characters not belonging to the language according to their order in DUCET. F-5: For the characters whose weight is not assigned in DUCET, the collation shall sort them with their implicit weight value which is constructed in the UCA way. F-6: The collation shall be case / accent sensitive and Kana sensitive. NF-1: The performance of this collation's functions should have regression of no more than 33% comparing to utf8mb4_ja_0900_as_cs since the number of weight levels is increased from 3 to 4.
Quaternary weight is neccesary to implement Japanese kana-sensitive collation. It helps to destinguish Katakana from Hiragana. According to current CLDR rules defined for Japanese (e.g. &き<<<<キ), Hiragana equals Katakana with the default collating level (3). This is a sample showing how kana-sensitive impacts sorting order: We have rule: &き<<<<キ, &ゅ<<<<ュ, &ゆ<<<<ユ, and &う<<<<ウ. A. kana insensitive (default, what utf8mb4_ja_0900_as_cs does) きゅう = キュウ < きゆう = キユウ きゅう = キュウ is because three characters in both strings are equal one by one on first three levels' weight. キュウ < きゆう is because the tertiary weight of ュ is less than it of ゆ. きゆう = キユウ is because three characters in both strings are equal one by one on first three levels' weight. B. kana sensitive きゅう < キュウ < きゆう < キユウ きゅう < キュウ is because the quaternary weight of き is less than it of キ. キュウ < きゆう is because the tertiary weight of ュ is less than it of ゆ. きゆう < キユウ is because the quaternary weight of き is less than it of キ. (Keep in mind that we compare characters' primary weight first. If the primary weight is equal, then we compare their secondary weight. Keep comparing this way until we find different weight, or the end of weight. That is why キュウ < きゆう when we have き <<<< キ.) UCA defines its way to assign quaternary weight for characters: a big enough weight (e.g. 0xFFFF) for every normal character and 0x0000 for combining marks, then shift the weight based on it. We can see that for most cases, this quaternary weight is not needed. For example, Latin character can be distinguished from Kanji, Kanji can be distinguished from Katakana / Hiragana by three levels of weight. And for Japanese, it is only necessary when there is Katakana / Hiragana character in the string we want to compare. So we'd like to simplify the way that quaternary weight is assigned. Instead of adding piles of unneccesary 0xFFFF in the weight tables, we only assign quaternary weight to character when we know it is Katakana / Hiragana. The value of quaternary weight can be any positive integer. Because of the existence of level seperator (0x0000), the quaternary weight doesn't impact the comparing result if the result is determined with first three levels' weight. Reference: http://www.unicode.org/reports/tr10/#Variable_Weighting
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.