WL#10818: Add utf8mb4 accent sensitive and case insensitive collation
Affects: Server-8.0
—
Status: Complete
Add utf8mb4 accent sensitive and case insensitive collation, data dictionary needs it, language specific collations not needed in this WL. This new collation will have the name utf8mb4_0900_as_ci. 'As' means accent sensitive, and 'ci' means case insensitive. User Documentation ================== * https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-2.html
F-1: The collation shall be a compiled one. F-2: The collation shall work for all characters in range [U+0, U+10FFFF]. F-3: For the characters whose weight is assigned in DUCET (Default Unicode Collation Element Table), the collation shall sort them according to the weight value assigned. F-4: For the characters whose weight is not assigned in DUCET, the collation shall sort them with their implicit weight value which is constructed with the UCA way. F-5: The collation shall be accent sensitive and case insensitive. F-6: Characters in contraction sequence will be treated as separate characters because it is not a language specific collation.
DUCET defines three levels for weight for characters. The first level is used to sort base letter. The second level is used to do accent sensitive sorting. And the third level is used to do case sensitive sorting. For example, 0061 ; [.1C47.0020.0002] # LATIN SMALL LETTER A 00E1 ; [.1C47.0020.0002][.0000.0024.0002] # LATIN SMALL LETTER A WITH ACUTE 00C1 ; [.1C47.0020.0008][.0000.0024.0002] # LATIN CAPITAL LETTER A WITH ACUTE With accent / case insensitive collation (utf8mb4_0900_ai_ci), the sorting order of above three characters is: 0061 = 00E1 = 00C1, because their first level weight are same, 1C47 (0000 is ignored). With accent / case sensitive collation (utf8mb4_0900_as_cs), the sorting order of above three characters is: 0061 < 00E1 < 00C1. This is because their first level weight are same, 1C47, but 00E1's second level weight (0020 0024) is greater than 0061's second level weight (0020) and 00C1's third level weight (0008 0002) is greater than 00E1's third level weight (0002 0002). With this new accent sensitive and case insensitive collation, the sorting order of above characters should be, 0061 < 00E1 = 00C1. Because 00E1 and 00C1's second level weight are same (0020 0024). We have implemented accent / case insensitive sorting with WL#9125 and WL#9108, and accent / case sensitive sorting with WL#9109. The implementation of these worklogs has make multi-level sorting ready. For this accent sensitive and case insensitive collation, we only need to reuse the multi-level sorting logic and control that only the first two levels of weight is used to do sorting.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.