In MySQL 8.0 we have been working to improve our support for utf8 as we make the transition to switch it to the default character set. For more details see our earlier posts:
In today’s post I wanted to describe the improvements to support accent and case sensitive collations.
Introducing as_cs collations
While previous releases of MySQL have supported both case insensitive and sensitive collations, there has not been support for a case sensitive collation when using UTF-8 as the character set:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
mysql57> SELECT COUNT(*) FROM information_schema.collations WHERE collation_name LIKE 'utf8\_%cs' OR collation_name LIKE 'utf8mb4%cs'; +----------+ | COUNT(*) | +----------+ | 0 | +----------+ 1 row in set (0.00 sec) mysql57> SELECT collation_name FROM information_schema.collations WHERE collation_name LIKE '%cs'; +--------------------+ | collation_name | +--------------------+ | latin1_general_cs | | latin2_czech_cs | | cp1250_czech_cs | | latin7_estonian_cs | | latin7_general_cs | | cp1251_general_cs | +--------------------+ 6 rows in set (0.00 sec) |
In MySQL 8.0.0 we improved our character set support with the addition of new accent and case insensitive (ai_ci
) collations. In MySQL 8.0.1 the corresponding accent and case sensitive collations (as_cs
) have also been added, as well as a Japanese collation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
mysql> select collation_name from information_schema.collations where character_set_name='utf8mb4' and collation_name like '%as_cs' order by collation_name; +----------------------------+ | collation_name | +----------------------------+ | utf8mb4_0900_as_cs | | utf8mb4_cs_0900_as_cs | | utf8mb4_da_0900_as_cs | | utf8mb4_de_pb_0900_as_cs | | utf8mb4_eo_0900_as_cs | | utf8mb4_es_0900_as_cs | | utf8mb4_es_trad_0900_as_cs | | utf8mb4_et_0900_as_cs | | utf8mb4_hr_0900_as_cs | | utf8mb4_hu_0900_as_cs | | utf8mb4_is_0900_as_cs | | utf8mb4_ja_0900_as_cs | | utf8mb4_la_0900_as_cs | | utf8mb4_lt_0900_as_cs | | utf8mb4_lv_0900_as_cs | | utf8mb4_pl_0900_as_cs | | utf8mb4_ro_0900_as_cs | | utf8mb4_sk_0900_as_cs | | utf8mb4_sl_0900_as_cs | | utf8mb4_sv_0900_as_cs | | utf8mb4_tr_0900_as_cs | | utf8mb4_vi_0900_as_cs | +----------------------------+ 22 rows in set (0.02 sec) |
Implementation Details
Similar to the ai_ci
collations described in the previous post, all these as_cs
collations using the weight data defined in DUCET to sort.
DUCET defines three levels of weight for about 30k characters. Each character has one or more collation element. For example, the collation element of latin character ‘a’ is [.1C47.0020.0002]. In each collation element, there are three hex value separated by dot mark. Here 1C47 is called primary weight, 0020 is secondary weight and 0002 is tertiary weight:
- Primary weight is used to distinguish the base letter of character.
- Secondary weight is used to distinguish accent of character.
- Tertiary weight is used to show different case.
For example, ‘a’ and ‘A’, their base letter is same, LATIN A, so they have the same primary weight 1C47:
0061 ; [.1C47.0020.0002] # LATIN SMALL LETTER A
0041 ; [.1C47.0020.0008] # LATIN CAPITAL LETTER A
00E3 ; [.1C47.0020.0002][.0000.002D.0002] # LATIN SMALL LETTER A WITH TILDE
0062 ; [.1C60.0020.0002] # LATIN SMALL LETTER B
Comparing the primary weight, we can easily sort ‘B’ greater than all other three characters because 1C60 > 1C47. Comparing the secondary weight, we can sort ‘ã’ greater than ‘a’ and ‘A’ because both ‘a’ and ‘A’ have same secondary weight 0020, but ‘ã’ has 0020 002D. And ‘A’ is greater than ‘a’ because 0008 > 0002.
Ai_ci
collations compare only the primary weight, but as_cs
compare the primary, secondary and tertiary weights.
Additional Collation Rules
The reorder and weight inheritance rules described in the previous post also apply to these case sensitive collations. We also implemented another parametric collating, ‘case first’. ‘Case first’ is used by utf8mb4_da_0900_as_cs
for Danish. When case of character is considered to collate, most languages sort small character before capitalized character, e.g. 'a' < 'A'
. But for Danish, the reverse order is expected:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
mysql> set names utf8mb4; Query OK, 0 rows affected (0.00 sec) mysql> select 'a' < 'A' collate utf8mb4_0900_as_cs; +--------------------------------------+ | 'a' < 'A' collate utf8mb4_0900_as_cs | +--------------------------------------+ | 1 | +--------------------------------------+ 1 row in set (0.00 sec) mysql> select 'a' < 'A' collate utf8mb4_da_0900_as_cs; +-----------------------------------------+ | 'a' < 'A' collate utf8mb4_da_0900_as_cs | +-----------------------------------------+ | 0 | +-----------------------------------------+ 1 row in set (0.06 sec) mysql> select strcmp('aa', 'å' collate utf8mb4_da_0900_ai_ci); +--------------------------------------------------+ | strcmp('aa', 'å' collate utf8mb4_da_0900_ai_ci) | +--------------------------------------------------+ | 0 | +--------------------------------------------------+ 1 row in set (0.00 sec) mysql> select strcmp('aa', 'å' collate utf8mb4_da_0900_as_cs); +--------------------------------------------------+ | strcmp('aa', 'å' collate utf8mb4_da_0900_as_cs) | +--------------------------------------------------+ | 1 | +--------------------------------------------------+ 1 row in set (0.00 sec) |
We can see from above example that ‘aa’ equals ‘å’ when we use utf8mb4_da_0900_ai_ci to do the comparison, but ‘aa’ sorts after ‘å’ when utf8mb4_da_0900_as_cs is used. This is because the collating rule defined in CLDR:
&[before1]ǀ<æ<<<Æ<<ä<<<Ä<ø<<<Ø<<ö<<<Ö<<ő<<<Ő<<œ<<<Œ<å<<<Å<<<aa<<<Aa<<<AA
In the past, when MySQL didn’t have the much needed case sensitive collation, some users turn to use binary collation (e.g. utf8_bin) as alternative. Although _bin collation differentiate capital letters and smaller letter for comparison, _bin collations behaves quite differently from Unicode based collations. A binary collation compares only the code point of characters and have no idea what the character really means, while Unicode based collations compare the weights. A simple example, we usually sort ‘w’ before ‘W’, but utf8mb4_bin sorts ‘w’ after ‘W’ because the code point of ‘w’ is 0x77 and the code point of ‘W’ is 0x57.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
create table abc (a varchar(20)) character set utf8mb4; Query OK, 0 rows affected (0.49 sec) mysql> insert into abc values('white house'), ('White House'); Query OK, 2 rows affected (0.06 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> select * from abc order by a collate utf8mb4_bin; +-------------+ | a | +-------------+ | White House | | white house | +-------------+ 2 rows in set (0.00 sec) mysql> select * from abc order by a collate utf8mb4_0900_as_cs; +-------------+ | a | +-------------+ | white house | | White House | +-------------+ 2 rows in set (0.00 sec) |
For detailed collating rule for different languages, please refer to: http://unicode.org/repos/cldr/tags/release-30/common/collation/
Conclusion
Please try out the new as_cs collations and let us know what you think! In my next post, I will describe the new collations for Japanese language support.