WL#9109: Add case and accent sensitive collations for utf8mb4
Affects: Server-8.0
—
Status: Complete
Case and accent insensitive collations have been added with WL#9108 and WL#9125. This is to add case and accent sensitive collations. DUCET defines 3 levels collation weight. Of which the first level (primary level) is used to compare base letter, the secondary level is used to compare accent if the base letters are equal, and the third (tertiary) level is used to compare case if the base letter and its accent are equal. Our case and accent insensitive collations use only the first level of collation weight defined in DUCET. We'll use all 3 levels' weight to implement this WL.
F-1, the function should still work with case / accent insensitive collations F-2, the function should return secondary / tertiary weights as demanded F-3, the function should pad correct space character's secondary / tertiary weights to the end of string
As we said in the High-Level Description, DUCET defines 3 levels collation weight. With accent and case sensitive collations, we'll compare 2 strings' first level weights first. If equal, then we compare their secondary level weights. If equal again, then compare the third level weights. For example, following 4 characters are equal if we compare them by the accent and case insensitive collations. 006F ; [.1D58.0020.0002] # LATIN SMALL LETTER O 004F ; [.1D58.0020.0008] # LATIN CAPITAL LETTER O 00D3 ; [.1D58.0020.0008][.0000.0024.0002] # LATIN CAPITAL LETTER O WITH ACUTE 00D2 ; [.1D58.0020.0008][.0000.0025.0002] # LATIN CAPITAL LETTER O WITH GRAVE It is because their first level weights are all 0x1D58. But with accent and case sensitive collation, the order should be: 006F <<< 004F << 00D3 << 00D2. Because 006F's third level weight '0002' < 004F's third level weight '0008', and 004F's secondary level weight '0000' < 00D3's secondary level weight '0024' and so on. The '<<<' means 'case level less than', and '<<' means 'accent level less than'. In this way, we can distinguish all these 4 characters. The strnxfrm function is used to return weight of characters. With it, we'll return weight data one level followed by another, primary level first, followed by secondary level and then tertiary level. For example, for characters in string 'o\u00D3' and 'O\u00D2', our current strnxfrm returns "0x1D58, 0x1D58" for both strings, so that they sort equal. After this implementation, for string 'o\u00D3', the weights returned from strnxfrm should be: "0x1D58, 0x1D58, 0000, 0x0020, 0x0020, 0x0024, 0000, 0x0002, 0x0008, 0x0002", and for string 'O\u00D2', the weights returned should be: "0x1D58, 0x1D58, 0000, 0x0020, 0x0020, 0x0025, 0000, 0x0008, 0x0008, 0x0002". In this way, we'll be able to distinguish these 2 strings. The '0000' in above weights is called weight separator. It is because the secondary weight range in DUCET is [0020, 0192], and the third level weight range is [0002, 001F]. There might be overlap after weight shift for specific languages. For the spaces padding to the right of string, because the weight of space(0x20) is defined as: 0020 ; [*0209.0020.0002] # SPACE If we are to add, for instance, one padding space to the string 'o\u00D3', our current strnxfrm returns "0x1D58, 0x1D58, 0x0209". After this implementation, it should append space's primary weight to the end of character's primary weights, space's secondary weight to the end if character's secondary weights, and same for tertiary weight. The weight returned should be: "0x1D58, 0x1D58, 0x0209, 0000, 0x0020, 0x0020, 0x0024, 0x0020, 0000, 0x0002, 0x0008, 0x0002, 0x0002".
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.