In MySQL 8.0 our plan is to drastically improve support for utf8. While utf8 support itself dates back to MySQL 4.1, there exist some limitations. The “sushi = beer” problem in the title refers to Bug #76553. Sushi and beer don’t even go well together, at least not to my taste:-) I will use this bug as an example to explain issues with utf8 collations in the past and our plans for utf8 support going forward.
Problem #1 utf8mb3 vs utf8mb4
For historical reasons, the utf8 character set refers to utf8mb3 and not utf8mb4 in MySQL. 3 bytes utf8 character set only supports a limited set of characters defined in the Unicode, basically characters in the BMP (basic multi-lingual plane). It does not support for instance emojis and other characters in the SMP (supplementary multi-lingual plane). Additional Chinese characters (CJK Unified Ideographs Extension B) are in SIP (supplementary ideographic plane), are not covered by utf8mb3 either.
The confusion between utf8[mb3] and utf8mb4 is exacerbated by our desire for backwards compatibility. That is to say that shifting utf8 to suddenly alias utf8mb4 instead will likely create problems during upgrades. One solution to this problem could be to remove the alias “utf8” for one GA version, to later change the alias “utf8mb4” to mean utf8. Our plan however, is to change the default character set to utf8mb4. This means that going forward, we expect users to less frequently be needing to change the character set at all. We expect this to reduce the impact of the utf8mb3 vs utf8mb4 problem.
Problem #2 default collation of utf8mb4
Luckily, the reporter behind Bug #76553 has already figured out problem 1), he was using utf8mb4 character set. The default collation of utf8mb4 in 5.7 and earlier is utf8mb4_general_ci. This is an quite old collation and it treats all characters in SMP as equal! Therefore we have the reported Sushi = Beer problem. I guess it was for performance reasons why utf8mb4_general_ci only supports a limited set of utf8 characters, and Emojis and SMP characters were not that commonly used at the time it was developed.
MySQL 5.7 already provide a much newer utf8mb4_unicode_520_ci collation that handles characters in SMP correctly. The Sushi-Beer problem could be solved by switching to this collation. But in MySQL 8.0 we decided to take this a step further by introducing utf8mb4_0900_ai_ci, which is based on the latest Unicode standard, and we intend to set this collation as the default collation for utf8mb4.
Problem #3 Sorting level
The problem is not entirely solved for the reporter of Bug #76553 just yet. Neither utf8mb4_unicode_520_ci or utf8_0900_ai_ci solve the Haha-Papa (Mother-Father issue in Japanese) problem. MySQL will not recognize “ハ” (U+30CF KATAKANA LETTER HA), “パ” (U+30D1 KATAKANA LETTER PA), and “バ” (U+30D0 KATAKANA LETTER BA) as different characters.
To understand this problem, we need to to take a look at the sorting levels defined in the standard:
Primary Level: used to denote differences between base characters (for example, “a” < “b”)
Secondary Level: Accents (for example, “as” < “às” < “at”)
Tertiary Level: Upper and lower case differences (for example, “ao” < “Ao”< “aò”)
We need secondary sorting level before we can differentiate these characters, and many Japanese characters actually require tertiary sorting level.
MySQL 5.7 and earlier versions only supports sorting/comparison at the Primary level. We are adding accent and case sensitive collations into MySQL 8.0 which rely on secondary and tertiary sorting. These also include language specific collation, e.g. utf8mb4_0900_danish_as_cs will be the accent sensitive and case sensitive collation for Danish. The Danish collation contains sorting rules that are specific for Danish that do not apply for other languages.
It is also on our plan to add Japanese collation. Japanese is a fascinating language and our collation experts Xing Zhang and Bernt Marius Johnsen will explain this more in detail in a future blog post.
To summarize, our plan is to drastically improve support for utf8 by changing the default character set to utf8mb4, and add a large set of collations to charter the international user base of MySQL.
Please stay tuned, this is only the first of a series of blogs where we share knowledge of character set and collation, give insight into our roadmap and discuss practice for upgrade and similar.
Thank you for using MySQL!
Manyi