This section describes the collations available for Unicode character sets and their differentiating properties. For general information about Unicode, see Section 1.9, “Unicode Support”.
MySQL supports multiple Unicode character sets:
utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
utf8: An alias for
ucs2: The UCS-2 encoding of the Unicode character set using two bytes per character.
utf16: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like
ucs2but with an extension for supplementary characters.
utf16le: The UTF-16LE encoding for the Unicode character set. Like
utf16but little-endian rather than big-endian.
utf32: The UTF-32 encoding for the Unicode character set using four bytes per character.
utf8mb3 character set is deprecated and
will be removed in a future MySQL release. Please use
utf8mb4 instead. Although
utf8 is currently an alias for
utf8mb3, at some point
utf8 will become a reference to
utf8mb4. To avoid ambiguity about the
utf8, consider specifying
utf8mb4 explicitly for character set
references instead of
Basic Multilingual Plane (BMP) characters and supplementary
characters that lie outside the BMP.
ucs2 support only BMP characters.
Most Unicode character sets have a general collation (indicated
_general in the name or by the absence of
a language specifier), a binary collation (indicated by
_bin in the name), and several
language-specific collations (indicated by language specifiers).
For example, for
utf8mb4_bin are its general and binary
utf8mb4_danish_ci is one of
its language-specific collations.
Most character sets have a single binary collation.
utf8mb4 is an exception that has two:
utf8mb4_bin and (as of MySQL 8.0.17)
utf8mb4_0900_bin. These two binary collations
have the same sort order but are distinguished by their pad
attribute and collating weight characteristics. See
Collation Pad Attributes, and
Character Collating Weights.
Collation support for
utf16le is limited. The
only collations available are
utf16le_bin. These are similar to
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
collation uses the version-4.0.0 UCA weight keys:
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported, and combining
marks are not fully supported. This affects primarily
Vietnamese, Yoruba, and some smaller languages such as Navajo.
A combined character is considered different from the same
character written with a single unicode character in string
comparisons, and the two characters are considered to have a
different length (for example, as returned by the
CHAR_LENGTH() function or in
result set metadata).
Unicode collations based on UCA versions higher than 4.0.0 include the version in the collation name. Examples:
utf8mb4_unicode_520_ciis based on UCA 5.2.0 weight keys (http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt),
utf8mb4_0900_ai_ciis based on UCA 9.0.0 weight keys (http://www.unicode.org/Public/UCA/9.0.0/allkeys.txt).
UPPER() functions perform case
folding according to the collation of their argument. A
character that has uppercase and lowercase versions only in a
Unicode version higher than 4.0.0 is converted by these
functions only if the argument collation uses a high enough
Collations based on UCA 9.0.0 and higher are faster than
collations based on UCA versions prior to 9.0.0. They also
have a pad attribute of
NO PAD, in contrast
PAD SPACE as used in collations based on
UCA versions prior to 9.0.0. For comparison of nonbinary
NO PAD collations treat spaces at
the end of strings like any other character (see
Trailing Space Handling in Comparisons).
To determine the pad attribute for a collation, use the
COLLATIONS table, which has a
PAD_ATTRIBUTE column. For example:
mysql> SELECT COLLATION_NAME, PAD_ATTRIBUTE FROM INFORMATION_SCHEMA.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4'; +----------------------------+---------------+ | COLLATION_NAME | PAD_ATTRIBUTE | +----------------------------+---------------+ | utf8mb4_general_ci | PAD SPACE | | utf8mb4_bin | PAD SPACE | | utf8mb4_unicode_ci | PAD SPACE | | utf8mb4_icelandic_ci | PAD SPACE | ... | utf8mb4_0900_ai_ci | NO PAD | | utf8mb4_de_pb_0900_ai_ci | NO PAD | | utf8mb4_is_0900_ai_ci | NO PAD | ... | utf8mb4_ja_0900_as_cs | NO PAD | | utf8mb4_ja_0900_as_cs_ks | NO PAD | | utf8mb4_0900_as_ci | NO PAD | | utf8mb4_ru_0900_ai_ci | NO PAD | | utf8mb4_ru_0900_as_cs | NO PAD | | utf8mb4_zh_0900_as_cs | NO PAD | | utf8mb4_0900_bin | NO PAD | +----------------------------+---------------+
Comparison of nonbinary string values
TEXT) that have a
collation differ from other collations with respect to
trailing spaces. For example,
'a ' compare as different strings, not
the same string. This can be seen using the binary collations
utf8mb4. The pad attribute for
SPACE, whereas for
utf8mb4_0900_bin it is
PAD. Consequently, operations involving
utf8mb4_0900_bin do not add trailing
spaces, and comparisons involving strings with trailing spaces
may differ for the two collations:
mysql> CREATE TABLE t1 (c CHAR(10) COLLATE utf8mb4_bin); Query OK, 0 rows affected (0.03 sec) mysql> INSERT INTO t1 VALUES('a'); Query OK, 1 row affected (0.01 sec) mysql> SELECT * FROM t1 WHERE c = 'a '; +------+ | c | +------+ | a | +------+ 1 row in set (0.00 sec) mysql> ALTER TABLE t1 MODIFY c CHAR(10) COLLATE utf8mb4_0900_bin; Query OK, 0 rows affected (0.02 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM t1 WHERE c = 'a '; Empty set (0.00 sec)
MySQL implements language-specific Unicode collations if the ordering based only on the Unicode Collation Algorithm (UCA) does not work well for a language. Language-specific collations are UCA-based, with additional language tailoring rules. Examples of such rules appear later in this section. For questions about particular language orderings, unicode.org provides Common Locale Data Repository (CLDR) collation charts at http://www.unicode.org/cldr/charts/30/collation/index.html.
For example, the nonlanguage-specific
utf8mb4_0900_ai_ci and language-specific
Unicode collations each have these characteristics:
The collation is based on UCA 9.0.0 and CLDR v30, is accent insensitive, and case insensitive. These characteristics are indicated by
_ciin the collation name. Exception:
utf8mb4_la_0900_ai_ciis not based on CLDR because Classical Latin is not defined in CLDR.
The collation works for all characters in the range [U+0, U+10FFFF].
If the collation is not language specific, it sorts all characters, including supplementary characters, in default order (described following). If the collation is language specific, it sorts characters of the language correctly according to language-specific rules, and characters not in the language in default order.
By default, the collation sorts characters having a code point listed in the DUCET table (Default Unicode Collation Element Table) according to the weight value assigned in the table. The collation sorts characters not having a code point listed in the DUCET table using their implicit weight value, which is constructed according to the UCA.
For non-language-specific collations, characters in contraction sequences are treated as separate characters. For language-specific collations, contractions might change character sorting order.
A collation name that includes a locale code or language name shown in the following table is a language-specific collation. Unicode character sets may include collations for one or more of these languages.
Table 1.3 Unicode Collation Language Specifiers
|German phone book order||
Croatian collations are tailored for these Croatian letters:
Danish collations may also be used for Norwegian.
For Japanese, the
utf8mb4 character set
utf8mb4_ja_0900_as_cs_ks collations. Both
collations are accent sensitive and case-sensitive.
utf8mb4_ja_0900_as_cs_ks is also kana
sensitive and distinguishes Katakana characters from Hiragana
treats Katakana and Hiragana characters as equal for sorting.
Applications that require a Japanese collation but not kana
sensitivity may use
for better sort performance.
utf8mb4_ja_0900_as_cs uses three weight
levels for sorting;
utf8mb4_ja_0900_as_cs_ks uses four.
For Classical Latin collations that are accent insensitive,
J compare as
compare as equal.
V compare as equal on the base letter
level. In other words,
J is regarded as an
regarded as an accented
Spanish collations are available for modern and traditional
Spanish. For both,
ñ (n-tilde) is a
separate letter between
o. In addition, for traditional Spanish,
ch is a separate letter between
ll is a separate letter between
Traditional Spanish collations may also be used for Asturian and Galician.
Swedish collations include Swedish rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:
Ü = Y < Ö
For any Unicode character set, operations performed using the
collation are faster than those for the
collation. For example, comparisons for the
utf8_general_ci collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci. The reason is that
utf8_unicode_ci supports mappings such as
expansions; that is, when one character compares as equal to
combinations of other characters. For example,
ß is equal to
German and some other languages.
utf8_unicode_ci also supports contractions
and ignorable characters.
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_unicode_ci (for the effect of this in
comparisons or searches, see
Section 1.8.6, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
ß = s
Whereas this is true for
which supports the German DIN-1 ordering (also known as
ß = ss
collations if the ordering with
utf8_unicode_ci does not work well for a
language. For example,
works fine for German dictionary order and French, so there is
no need to create special
utf8_general_ci also is satisfactory for
both German and French, except that
s, and not to
ss. If this is acceptable for your
application, you should use
because it is faster. If this is not acceptable (for example,
if you require German dictionary order), use
utf8_unicode_ci because it is more
If you require German DIN-2 (phone book) ordering, use the
utf8_german2_ci collation, which compares
the following sets of characters equal:
Ä = Æ = AE Ö = Œ = OE Ü = UE ß = ss
utf8_german2_ci is similar to
latin1_german2_ci, but the latter does not
Æ equal to
Œ equal to
utf8_german_ci corresponding to
latin1_german_ci for German dictionary
A character's collating weight is determined as follows:
For all Unicode collations except the
_bin(binary) collations, MySQL performs a table lookup to find a character's collating weight.
utf8mb4_0900_bin, the weight is based on the code point, possibly with leading zero bytes added.
utf8mb4_0900_bin, the weight is the
utf8mb4encoding bytes. The sort order is the same as for
utf8mb4_bin, but much faster.
Collating weights can be displayed using the
WEIGHT_STRING() function. (See
String Functions and Operators.) If a collation uses a
weight lookup table, but a character is not in the table (for
example, because it is a “new” character),
collating weight determination becomes more complex:
For BMP characters in general collations (
), the weight is the code point.
For BMP characters in UCA collations (for example,
and language-specific collations), the following algorithm applies:
if (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */ else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */ else base= 0xFBC0; /* All other characters */ aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements,
bbbb. For example:
mysql> SELECT HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)); +----------------------------------------------------------+ | HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)) | +----------------------------------------------------------+ | FBC084CF | +----------------------------------------------------------+
U+04cf CYRILLIC SMALL LETTER PALOCHKAis, with all UCA 4.0.0 collations, greater than
U+04c0 CYRILLIC LETTER PALOCHKA. With UCA 5.2.0 collations, all palochkas sort together.
For supplementary characters in general collations, the weight is the weight for
0xfffd REPLACEMENT CHARACTER. For supplementary characters in UCA 4.0.0 collations, their collating weight is
0xfffd. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.
An example with Deseret characters and
CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */ INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */ INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */ SELECT COUNT(DISTINCT s1) FROM t;
The result is 2 because in the MySQL
collations, the replacement character has a weight of
0x0dc6, whereas Deseret Bee and Deseret Tee both have a weight of
0xfffd. (Were the
utf32_general_cicollation used instead, the result is 1 because all three characters have a weight of
0xfffdin that collation.)
An example with cuneiform characters and
/* The four characters in the INSERT string are 00000041 # LATIN CAPITAL LETTER A 0001218F # CUNEIFORM SIGN KAB 000121A7 # CUNEIFORM SIGN KISH 00000042 # LATIN CAPITAL LETTER B */ CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0x000000410001218f000121a700000042); SELECT HEX(WEIGHT_STRING(s1)) FROM t;
The result is:
0E33 FFFD FFFD 0E4A
0E4Aare primary weights as in UCA 4.0.0.
FFFDis the weight for KAB and also for KISH.
The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it is very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by the MySQL rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
For supplementary characters based on UCA versions higher than 4.0.0 (for example,
), supplementary characters do not necessarily all have the same collating weight. Some have explicit weights from the UCA
allkeys.txtfile. Others have weights calculated from this algorithm:
aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
There is a difference between “ordering by the
character's code value” and “ordering by the
character's binary representation,” a difference that
appears only with
utf16_bin, because of
utf16_bin (the binary
utf16) was a binary
comparison “byte by byte” rather than
“character by character.” If that were so, the
order of characters in
differ from the order in
example, the following chart shows two rare characters. The
first character is in the range
FFFF, so it is
greater than a surrogate but less than a supplementary. The
second character is a supplementary.
Code point Character utf8 utf16 ---------- --------- ---- ----- 0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D 10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point
0x10384. And they are in order by
utf8 value because
0xf0. But they are not in order by
utf16 value, if we use byte-by-byte
utf16_bin collation is not
“byte by byte.” It is “by code
point.” When MySQL sees a supplementary-character
utf16, it converts to the
character's code-point value, and then compares. Therefore,
are the same ordering. This is consistent with the SQL:2008
standard requirement for a UCS_BASIC collation:
“UCS_BASIC is a collation in which the ordering is
determined entirely by the Unicode scalar values of the
characters in the strings being sorted. It is applicable to
the UCS character repertoire. Since every character repertoire
is a subset of the UCS repertoire, the UCS_BASIC collation is
potentially applicable to every character set. NOTE 11: The
Unicode scalar value of a character is its code point treated
as an unsigned integer.”
If the character set is
ucs2, comparison is
ucs2 strings should not
contain surrogates, anyway.