WL#11825: Add Chinese collation for utf8mb4
Affects: Server-8.0
—
Status: Complete
This worklog is to track the work adding Chinese collation for utf8mb4. This collation will be named 'utf8mb4_zh_0900_as_cs', 'utf8mb4' is the character set, 'zh' is the ISO code for Chinese, '0900' means this collation follows the Unicode standard 9.0, 'as_cs' means accent sensitive and case sensitive.
F-1: The collation shall be compiled one. F-2: The collation shall work for all characters in range [U+0, U+10FFFF]. F-3: The collation shall sort characters of the language correctly according to language specific rules. F-4: The collation shall sort characters not belonging to the language according to their order in DUCET. F-5: For the characters whose weight is not assigned in DUCET, the collation shall sort them with their implicit weight value which is constructed in the UCA way. F-6: The collation shall be case / accent sensitive.
We'll follow the rules defined by the CLDR collation definition file for Chinese, zh.xml, to implement this collation. The zh.xml file defines the reordering rule for all character groups, and the sorting order of 41294 Han characters, a few weight shifting rules for Bopomofo (some symbols which are used to present the pronunciation of Han characters) and some special characters. Basically, we'll re-use the code we implemented for other utf8mb4 collations, but there are a few things we need to tune to make them work correctly for Chinese. Reorder The CLDR defines a rule "[reorder Hani]" for the Chinese collation. This rule means all Han characters should be sorted before other characters (except for core characters like spaces, symbols). The DUCET defines the sorting order of character groups like [core group, Latin, Cyrillic ..., Hani, others]. Let us give a name, GrpA, to the character groups between core group and Hani group. After reordering, the character groups should be [core group, Hani, GrpA, others]. We have implemented reordering with WL#9108 and WL#9751. For the character groups (except for Hani) which are involved in reordering, we calculate these groups' reordering parameters, and apply the reordering when returning weight. We'll do the same to the characters in GrpA (see item 4 below). For Hani group, since the huge number of characters, and there is no simple calculate to get a character's reordered weight based on its weight in the DUCET, to save CPU cost, we pre-calculate Han character's weight with uca9dump. To make this reordering, we change the primary weight of all characters as below: 1. For the core group, we don't change their weight. It still sorts before all other character groups. 2. The zh.xml file defines the sort order of 41294 Han characters (let us call them Han group1). These characters have implicit weight in the DUCET which has two primary weights. We can re-use the weight range left by moving GrpA (see item 4 below) to give the characters in Han group1 a single primary weight. We'll give them a weight from 0x1C47 to 0xBD94 (0xBD94 - 0x1C47 + 1 = 41294). 3. For other Han characters (let us call them Han group2) whose sorting order are not defined in zh.xml, we still give them two primary weights. The weights will be calculated like what the UCA calculates implicit weight, but the leading weight is changed from 0xFB40 - 0xFB85 to 0xBD95 - 0xBD99. This won't cause any weight conflict because the leading weight (0xBD95 - 0xBD99) is bigger than the largest weight of Han group1 (0xBD94). Their relative order is the same as they are in the DUCET. 4. For the characters in GrpA, which had been assigned primary weight by the DUCET in the range [0x1C47, 0x54A3], we need to make them sorted after all Han characters. To do this, we will give them a new weight range [0xBD9A, 0xF5F6] (0xF5F6 - 0xBD9A = 0x54A3 - 0x1C47). This makes it sort after all Han characters in both Han group1 and Han group2. Because the smallest weight in this group (0xBD9A) is greater than the biggest weight of Han group1 (0xBD94), and the biggest leading weight of Han group2 (0xBD99). 5. For all other characters not mentioned above, we give them two primary weights. The weights will be calculated like how the UCA calculates implicit weight, but the leading weight is changed from 0xFBC0 - 0xFBE1 to 0xF5F7 - 0xF618. The primary weight changes look like: Char Group | Origin Weight Range | Reordered Weight Range -----------------|-----------------------------|---------------------------- core group | 0200 - 1C46 | 0200 - 1C46 Han group1 | [FB40, AAAA] - [FB85, BBBB] | 1C47 - BD94 Han group2 | [FB40, CCCC] - [FB85, DDDD] | [BD95, CCCC] - [BD99, DDDD] GrpA | 1C47 - 54A3 | BD9A - F5F6 Others | [FBC0, XXXX] - [FBE1, YYYY] | [F5F7, XXXX] - [F618, YYYY] How the UCA calculates implicit weight and how we change it We menctioned above that we'll calculate the weight of characters in Han group2 and Others. Here let us have a look how the UCA calculates implicit weight and how we change it. Take the character in Others group for example, the UCA calculates its implicit weight as: AAAA = 0xFBC0 + (CP >> 15) // AAAA is the leading weight BBBB = (CP & 0x7FFF) | 0x8000 // BBBB is the second weight CP means character's code point. The only thing we need to change is to replace the 0xFBC0 with 0xF5F7. Weight shifting Same as what we do for other collations to shift weight if there is rule like '&A < B', we add an extra collation element for B. But for other collations, we set the primary weight of extra collation element as 0x54A4, the biggest primary weight of regular character plus one, to make sure there is no weight overlapping. To achive the same goal, we need to set the primary weight of extra collation element as 0xF619, which is immediately after the biggest leading primary weight of Others group. Following is an example showing how it works. Assume we have a rule '&A < B' and the character A's weight is BD9A, then we'll give B the weight of BD9A F619. If we are comparing string 'AC' and 'BD', then 1. if C and D are in core group, Han group1 or GrpA, the comparison result won't be affected by C and D's weight, because F619 is greater than any weight of these character groups. 2. if C or D are in Han group2 or Others, in which there is character whose second weight might be greater than F619, the comparison result won't be affected, because F619 is greater than the leading primary, and we compare strings by comparing their weight from the beginning byte. Ex, AC's weight is: BD9A F5F8 F6FF BD's weight is: BD9A F619 1C50
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.