MySQL :: MySQL 8.0: Kana-sensitive collation for Japanese

In my previous post, I wrote about the new Japanese collation for utf8mb4 introduced in MySQL 8.0.1! This collation (utf8mb4_ja_0900_as_cs) implements accent / case sensitivity for Japanese as defined by CLDR 30.

Today, I am writing about our new utf8mb4_ja_0900_as_cs_ks collation which includes support for kana sensitivity. Kana sensitive means one collation should distinguish hiragana from katakana. For Japanese, hiragana should sort smaller than katakana ( あ < ア, etc.). DUCET has organized three levels of weight for all hiragana and katakana characters to ensure hiragana sort smaller than katakana. For example:

3042 ; [.3D5A.0020.000E] # HIRAGANA LETTER A
30A2 ; [.3D5A.0020.0011] # KATAKANA LETTER A

The difference at third level (000E and 0011) ensures 0x3042 (あ) < 0x30A2 (ア). The CLDR rules require the difference of hiragana and katakana happen on quaternary level (ex, &あ<<<<ア). The default collating level is three ([strength 3]), so they are equal at first three levels.

Introducing `utf8mb4_ja_0900_as_cs_ks`

In response to the overwhelming feedback we’ve received on our utf8mb4_ja_0900_as_cs collation, we’ve decided to add a new utf8mb4_ja_0900_as_cs_ks collation that sorts hiragana and katakana as expected. The ‘_ks’ stands for kana sensitive.

This collation assigns quaternary level weight for hiragana and katakana to make them different even if their first three levels of weight are same. The following example shows different result when we compare same strings with utf8mb4_ja_0900_as_cs and utf8mb4_ja_0900_as_cs_ks:

mysql> SET @s1 = CONVERT('きゅう' USING utf8mb4);
Query OK, 0 rows affected (0.01 sec)

mysql> SET @s2 = CONVERT('キュウ' USING utf8mb4);
Query OK, 0 rows affected (0.00 sec)

mysql> SET @s3 = CONVERT('きゆう' USING utf8mb4);
Query OK, 0 rows affected (0.00 sec)

mysql> SET @s4 = CONVERT('キユウ' USING utf8mb4);
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT STRCMP(@s1 COLLATE utf8mb4_ja_0900_as_cs, @s2);
+------------------------------------------------+
| STRCMP(@s1 COLLATE utf8mb4_ja_0900_as_cs, @s2) |
+------------------------------------------------+
|                                              0 |
+------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT STRCMP(@s2 COLLATE utf8mb4_ja_0900_as_cs, @s3);
+------------------------------------------------+
| STRCMP(@s2 COLLATE utf8mb4_ja_0900_as_cs, @s3) |
+------------------------------------------------+
|                                             -1 |
+------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT STRCMP(@s3 COLLATE utf8mb4_ja_0900_as_cs, @s4);
+------------------------------------------------+
| STRCMP(@s3 COLLATE utf8mb4_ja_0900_as_cs, @s4) |
+------------------------------------------------+
|                                              0 |
+------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT STRCMP(@s1 COLLATE utf8mb4_ja_0900_as_cs_ks, @s2);
+---------------------------------------------------+
| STRCMP(@s1 COLLATE utf8mb4_ja_0900_as_cs_ks, @s2) |
+---------------------------------------------------+
|                                                -1 |
+---------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT STRCMP(@s2 COLLATE utf8mb4_ja_0900_as_cs_ks, @s3);
+---------------------------------------------------+
| STRCMP(@s2 COLLATE utf8mb4_ja_0900_as_cs_ks, @s3) |
+---------------------------------------------------+
|                                                -1 |
+---------------------------------------------------+
1 row in set (0.01 sec)

mysql> SELECT STRCMP(@s3 COLLATE utf8mb4_ja_0900_as_cs_ks, @s4);
+---------------------------------------------------+
| STRCMP(@s3 COLLATE utf8mb4_ja_0900_as_cs_ks, @s4) |
+---------------------------------------------------+
|                                                -1 |
+---------------------------------------------------+
1 row in set (0.00 sec)

mysql> SET @s1 = CONVERT('きゅう' USING utf8mb4);

Query OK, 0 rows affected (0.01 sec)

mysql> SET @s2 = CONVERT('キュウ' USING utf8mb4);

Query OK, 0 rows affected (0.00 sec)

mysql> SET @s3 = CONVERT('きゆう' USING utf8mb4);

Query OK, 0 rows affected (0.00 sec)

mysql> SET @s4 = CONVERT('キユウ' USING utf8mb4);

Query OK, 0 rows affected (0.00 sec)

mysql> SELECT STRCMP(@s1 COLLATE utf8mb4_ja_0900_as_cs, @s2);

+------------------------------------------------+

| STRCMP(@s1 COLLATE utf8mb4_ja_0900_as_cs, @s2) |

+------------------------------------------------+

| 0 |

+------------------------------------------------+

1 row in set (0.00 sec)

mysql> SELECT STRCMP(@s2 COLLATE utf8mb4_ja_0900_as_cs, @s3);

+------------------------------------------------+

| STRCMP(@s2 COLLATE utf8mb4_ja_0900_as_cs, @s3) |

+------------------------------------------------+

| -1 |

+------------------------------------------------+

1 row in set (0.00 sec)

mysql> SELECT STRCMP(@s3 COLLATE utf8mb4_ja_0900_as_cs, @s4);

+------------------------------------------------+

| STRCMP(@s3 COLLATE utf8mb4_ja_0900_as_cs, @s4) |

+------------------------------------------------+

| 0 |

+------------------------------------------------+

1 row in set (0.00 sec)

mysql> SELECT STRCMP(@s1 COLLATE utf8mb4_ja_0900_as_cs_ks, @s2);

+---------------------------------------------------+

| STRCMP(@s1 COLLATE utf8mb4_ja_0900_as_cs_ks, @s2) |

+---------------------------------------------------+

| -1 |

+---------------------------------------------------+

1 row in set (0.00 sec)

mysql> SELECT STRCMP(@s2 COLLATE utf8mb4_ja_0900_as_cs_ks, @s3);

+---------------------------------------------------+

| STRCMP(@s2 COLLATE utf8mb4_ja_0900_as_cs_ks, @s3) |

+---------------------------------------------------+

| -1 |

+---------------------------------------------------+

1 row in set (0.01 sec)

mysql> SELECT STRCMP(@s3 COLLATE utf8mb4_ja_0900_as_cs_ks, @s4);

+---------------------------------------------------+

| STRCMP(@s3 COLLATE utf8mb4_ja_0900_as_cs_ks, @s4) |

+---------------------------------------------------+

| -1 |

+---------------------------------------------------+

1 row in set (0.00 sec)

We can see that, with utf8mb4_ja_0900_as_cs, ‘きゅう’ = ‘キュウ’ < ‘きゆう’ = ‘キユウ’, but with utf8mb4_ja_0900_as_cs_ks, ‘きゅう’ < ‘キュウ’ < ‘きゆう’ < ‘キユウ’.

Conclusion

This collation is available today as a Labs release built with MySQL 8.0.1. We plan to make it available in regular MySQL releases starting with 8.0.2.

Please try it out, and let us know your feedback 🙂

Introducing utf8mb4_ja_0900_as_cs_ks

Conclusion

Introducing `utf8mb4_ja_0900_as_cs_ks`