MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
New collations in MySQL 8.0.0

Since MySQL 5.5, MySQL has supported the utf8mb4 character set.  With the character-set defining the repertoire of characters that can be stored (utf8mb4 can present Unicode characters from U+0000 to U+10FFFF), a collation defines how sorting order and comparisons should behave.

In MySQL 8.0.0 we have improved the utf8mb4 character set support with the addition of over 20 new collations.  We expect to add additional collations before MySQL 8 reaches GA.

Motivation

As part of our plan to switch to utf8mb4 as the default character set for MySQL 8.0, we identified collation support as one of the areas where we would like to make improvements.

As Manyi outlines in her introduction post, the utf8mb4_general_ci collation is quite old, and treats all characters in SMP as equal.  MySQL 5.7 also supports a newer utf8mb4_unicode_520_ci collation, but it still falls behind the latest version of Unicode (9.0.0).

We decided that in order to switch the default, we should also start by being based on the latest standard.  The new collations were built on the basis of the UCA (Unicode Collation Algorithm).  Language specific collations (ex, utf8mb4_spanish_ci) are implemented using UCA4.0.0, UCA5.2.0 is used to implement some general collations (ex, utf8mb4_unicode_520_ci).

Implementation Details

The new collations are build using the weight data defined in DUCET (Default Unicode Collation Element Table) to determine the sort order of characters.  We also use the case mapping and decomposition info defined in Unicodedata.txt, and the language specific collation rules defined in CLDR.

There are actually quite a lot of differences between Unicode 9.0.0 and Unicode 5.2.0 (the highest version we previously supported).  This includes about 20K+ new characters added, with weight data changed to accommodate them.  Some characters have also changed case mapping.

Below is the full list of new collations added so far in MySQL 8.0.0:

The convention used here is:

  • Character-set name (utf8mb4)
  • Language identifier (de,is etc)
  • Unicode version (0900)
  • Accent and case sensitivity (ai_ci means case and accent insensitive)

de_pb means German phone book. es_trad means ‘traditional Spain’.

Additional Improvements

Besides updating the weight data, we also added support for:

Reorder

DUCET gives characters of same group in separate weight range, for example, Latin characters have the weight range [0x1C47, 0x1FB5], Greek characters have the weight range [0x1FB9, 0x1FE5], and Cyrillic characters have the weight range [0x2022, 0x21E1]. So for most collations, Latin character should sort before Greek characters, and Greek characters should sort before Cyrillic characters.

But for Cyrillic language, it is different. It wants Cyrillic character sort after Latin characters but before any character of other languages.  Our new collation has implemented reorder to help sorting correct for Croatian language. You can see in following example:

Caption: The Greek character ‘α’ (U+03B1) sorts less than Cyrillic character ‘а’ (U+0430) with the utf8mb4_hungarian_ci collation (which implements UCA4.0.0).  However, with the newer utf8mb4_hr_0900_ai_ci collation, it sorts greater.

Inheritance of weight

To make our collations work for different languages, we change character’s weight according to the collation rules of the language. For example, we have rule ‘&I << y‘ for Lithuanian, then for character ‘ý’, it should also inherit this rule, which means ‘I << y << ý‘. All 3 characters should sort equal with case / accent insensitive collation.

Conclusion

This is just the start of the bigger effort we are putting into UTF-8 support, please stay tuned for MySQL 8.0.1 where we plan to add accent and case sensitive collations, as well as Japanese language support.

Please try out these new collations, and let us know your feedback!