MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
MySQL 8.0.1: Accent and case sensitive collations for utf8mb4

In MySQL 8.0 we have been working to improve our support for utf8 as we make the transition to switch it to the default character set.  For more details see our earlier posts:

In today’s post I wanted to describe the improvements to support accent and case sensitive collations.

Introducing as_cs collations

While previous releases of MySQL have supported both case insensitive and sensitive collations, there has not been support for a case sensitive collation when using UTF-8 as the character set:

In MySQL 8.0.0 we improved our character set support with the addition of new accent and case insensitive (ai_ci) collations.  In MySQL 8.0.1 the corresponding accent and case sensitive collations (as_cs) have also been added, as well as a Japanese collation:

Implementation Details

Similar to the ai_ci collations described in the previous post, all these as_cs collations using the weight data defined in DUCET to sort.

DUCET defines three levels of weight for about 30k characters. Each character has one or more collation element. For example, the collation element of latin character ‘a’ is [.1C47.0020.0002]. In each collation element, there are three hex value separated by dot mark. Here 1C47 is called primary weight, 0020 is secondary weight and 0002 is tertiary weight:

  • Primary weight is used to distinguish the base letter of character.
  • Secondary weight is used to distinguish accent of character.
  • Tertiary weight is used to show different case.

For example, ‘a’ and ‘A’, their base letter is same, LATIN A, so they have the same primary weight 1C47:

0061 ; [.1C47.0020.0002] # LATIN SMALL LETTER A
0041 ; [.1C47.0020.0008] # LATIN CAPITAL LETTER A
00E3 ; [.1C47.0020.0002][.0000.002D.0002] # LATIN SMALL LETTER A WITH TILDE
0062 ; [.1C60.0020.0002] # LATIN SMALL LETTER B

Comparing the primary weight, we can easily sort ‘B’ greater than all other three characters because 1C60 > 1C47. Comparing the secondary weight, we can sort ‘ã’ greater than ‘a’ and ‘A’ because both ‘a’ and ‘A’ have same secondary weight 0020, but ‘ã’ has 0020 002D. And ‘A’ is greater than ‘a’ because 0008 > 0002.

Ai_ci collations compare only the primary weight, but as_cs compare the primary, secondary and tertiary weights.

Additional Collation Rules

The reorder and weight inheritance rules described in the previous post also apply to these case sensitive collations.  We also implemented another parametric collating, ‘case first’. ‘Case first’ is used by utf8mb4_da_0900_as_cs for Danish. When case of character is considered to collate, most languages sort small character before capitalized character, e.g. 'a' < 'A'. But for Danish, the reverse order is expected:

We can see from above example that ‘aa’ equals ‘å’ when we use utf8mb4_da_0900_ai_ci to do the comparison, but ‘aa’ sorts after ‘å’ when utf8mb4_da_0900_as_cs is used. This is because the collating rule defined in CLDR:

&[before1]ǀ<æ<<<Æ<<ä<<<Ä<ø<<<Ø<<ö<<<Ö<<ő<<<Ő<<œ<<<Œ<å<<<Å<<<aa<<<Aa<<<AA

In the past, when MySQL didn’t have the much needed case sensitive collation, some users turn to use binary collation (e.g. utf8_bin) as alternative. Although _bin collation differentiate capital letters and smaller letter for comparison, _bin collations behaves quite differently from Unicode based collations. A binary collation compares only the code point of characters and have no idea what the character really means, while Unicode based collations compare the weights. A simple example, we usually sort ‘w’ before ‘W’, but utf8mb4_bin sorts ‘w’ after ‘W’ because the code point of ‘w’ is 0x77 and the code point of ‘W’ is 0x57.

For detailed collating rule for different languages, please refer to: http://unicode.org/repos/cldr/tags/release-30/common/collation/

Conclusion

Please try out the new as_cs collations and let us know what you think!  In my next post, I will describe the new collations for Japanese language support.