WL#9125: Add utf8mb4_800_ci_ai
Affects: Server-8.0
—
Status: Complete
We got many requests from market to replace the current default character set, latin1, with utf8mb4 to support all characters in BMP / SMP. Although we already have utf8mb4 character set and some collations based on it, but all these collations are case insensitive. To support accent / case sensitive sorting, and even support punctuation to make sorting Japanese possible later, we decide to add new collations which implement latest UCA (Unicode Collation Algorithm) whose version is 8.0.0. This WL is to track the implementation of adding new general collation of utf8mb4, which sorts all characters in default order. The new general collation's name will be: utf8mb4_800_ci_ai. "800" means the implementation is based on UCA 8.0.0, "ci" means case insensitive, and "ai" means accent insensitive. Reference: http://www.unicode.org/reports/tr10/
F-1: The collation shall be a compiled one. F-2: The collation shall work for all characters in range [U+0, U+10FFFF]. F-3: For the characters whose code point is assigned in DUCET (Default Unicode Collation Element Table), the collation shall sort them according to the weight value assigned. F-4: For the characters whose code point is not assigned in DUCET, the collation shall sort them with their implicit weight value which is constructed with the UCA way. F-5: The collation shall be case / accent insensitive. F-6: To simplify the collation, characters in contraction sequence will be treated as separate characters.
DUCET: Most of the sorting of this collation will be based on the data from DUCET. DUCET defines weight data for about 30k characters. One character may have 1 more collation elements And there are 3 unsigned 16 bit integer value in each collation element. They are called level 1 (or primary), level 2 (or secondary), level 3 (or tertiary) weight. Primary weight is for sorting the base character, and secondary weight is for accent sensitive sorting, tertiary weight is for case sensitive sorting. For example: 0061 ; [.1BC2.0020.0002] # LATIN SMALL LETTER A 0041 ; [.1BC2.0020.0008] # LATIN CAPITAL LETTER A With case insensitive collation, we shall sort "a" and "A" as same, because their primary weights are same. How we import DUCET: We are to import the whole DUCET, not only the primary weight, but also the secondary and tertiary weights, because we are going to use it to implement accent / case sensitive collations later. How we use DUCET: We have already implemented UCA 5.2.0 in the past, and the algorithm didn't change since then, so we can re-use the source code. But because the UCA 5.2.0 collations only support primary weight to implement accent / case insensitive sorting, we need to change the code to make it work for multilevel weight, and let the code decide how many levels of weight it should use according to the collation.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.