MySQL :: MySQL 8.0.1: Japanese collation for utf8mb4

In MySQL 8.0.1, in addition to new as_cs collations (accent sensitive, case sensitive) for utf8mb4, we have also added a new collation for Japanese.

Introducing utf8mb4_ja_0900_as_cs

Collating rules for for Japanese are complex. Japanese has multiple writing systems with katakana, hiragana, kanji, romaji. On top of that, for a single character, there are fullwidth and halfwidth symbols. For example, how we are going to sort ‘あ’, ‘ア’, ‘a’, ‘ｱ’?

According to the reorder rule for Japanese defined by CLDR: [reorder Latn Kana Hani], ‘a’ should sort before all others because ‘a’ is Latin letter and others are all Kana (katakana and hiragana). Then how about the sorting order of ‘あ’, ‘ア’and ‘ｱ’? We sort them as the rule defines: ‘&あ<<<<ア=ｱ’

mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> select strcmp('a', 'あ' collate utf8mb4_ja_0900_as_cs);
+--------------------------------------------------+
| strcmp('a', 'あ' collate utf8mb4_ja_0900_as_cs)  |
+--------------------------------------------------+
| -1                                               |
+--------------------------------------------------+
1 row in set (0.00 sec)

mysql> select strcmp('あ', 'ア' collate utf8mb4_ja_0900_as_cs);
+----------------------------------------------------+
| strcmp('あ', 'ア' collate utf8mb4_ja_0900_as_cs)    |
+----------------------------------------------------+
| 0                                                  |
+----------------------------------------------------+
1 row in set (0.00 sec)


mysql> select strcmp('ア', 'ｱ' collate utf8mb4_ja_0900_as_cs);
+----------------------------------------------------+
| strcmp('ア', 'ｱ' collate utf8mb4_ja_0900_as_cs)     |
+----------------------------------------------------+
| 0                                                  |
+----------------------------------------------------+
1 row in set (0.00 sec)

mysql> set names utf8mb4;

Query OK, 0 rows affected (0.00 sec)

mysql> select strcmp('a', 'あ' collate utf8mb4_ja_0900_as_cs);

+--------------------------------------------------+

| strcmp('a', 'あ' collate utf8mb4_ja_0900_as_cs) |

+--------------------------------------------------+

| -1 |

+--------------------------------------------------+

1 row in set (0.00 sec)

mysql> select strcmp('あ', 'ア' collate utf8mb4_ja_0900_as_cs);

+----------------------------------------------------+

| strcmp('あ', 'ア' collate utf8mb4_ja_0900_as_cs) |

+----------------------------------------------------+

| 0 |

+----------------------------------------------------+

1 row in set (0.00 sec)

mysql> select strcmp('ア', 'ｱ' collate utf8mb4_ja_0900_as_cs);

+----------------------------------------------------+

| strcmp('ア', 'ｱ' collate utf8mb4_ja_0900_as_cs) |

+----------------------------------------------------+

| 0 |

+----------------------------------------------------+

1 row in set (0.00 sec)

Why are ‘あ’, ‘ア’and ‘ｱ’are equal? The rule says ‘あ’ should sort before ‘ア’ on the quaternary level! Yes, that’s true. But CLDR defines the default collating strength, 3 for Japanese. This means the quaternary difference is ignored by default. This might not be what user wants. We will consider adding more collations for Japanese based on user input.

JIS X 0208 (http://www.jisc.go.jp/app/pager?id=94516) is a specification published by Japanese Industrial Standard. It defines a set of Japanese kanji (totally 6,355 kanji characters) and the sorting order of them. utf8mb4_ja_0900_as_cs sorts these kanji characters as defined by JIS X 0208. But there are still many kanji characters not in the specification, for they are not so common. For these kanji characters, utf8mb4_ja_0900_as_cs sorts them with their implicit weight (http://www.unicode.org/reports/tr10/#Implicit_Weights) defined by UCA.

For example, we have kanji characters: ‘王’, ‘人’, ‘兵’, ‘﨎’, ‘㐀’. The first three characters are in the character set defined by JIS X 0208, the latter two are not.

mysql> create table jpn(a varchar(10)) collate utf8mb4_ja_0900_as_cs;
Query OK, 0 rows affected (0.90 sec)

mysql> insert into jpn values ('王'), ('﨎'), ('人'), ('㐀'), ('兵');
Query OK, 5 rows affected (0.05 sec)
Records: 5 Duplicates: 0 Warnings: 0

mysql> select a from jpn order by a;
+------+
| a    |
+------+
| 王   |
| 人   |
| 兵   |
| 﨎   |
| 㐀   |
+------+
5 rows in set (0.02 sec)

mysql> create table jpn(a varchar(10)) collate utf8mb4_ja_0900_as_cs;

Query OK, 0 rows affected (0.90 sec)

mysql> insert into jpn values ('王'), ('﨎'), ('人'), ('㐀'), ('兵');

Query OK, 5 rows affected (0.05 sec)

Records: 5 Duplicates: 0 Warnings: 0

mysql> select a from jpn order by a;

+------+

| a |

+------+

| 王 |

| 人 |

| 兵 |

| 﨎 |

| 㐀 |

+------+

5 rows in set (0.02 sec)

Conclusion

This post follows on from earlier posts where we have been describing our work on improving utf8 support as part of the switch to make it the default character set. If you haven’t read our earlier posts, please do:

Please also try out this new collation and let us know your feedback!