MySQL supports multiple Unicode character sets:
utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character.
utf8mb4, a UTF-8 encoding of the Unicode character set using one to four bytes per character.
ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character.
utf16, the UTF-16 encoding for the Unicode character set; like
ucs2but with an extension for supplementary characters.
utf16le, the UTF-16LE encoding for the Unicode character set; like
utf16but little-endian rather than big-endian.
utf32, the UTF-32 encoding for the Unicode character set using 32 bits per character.
Basic Multilingual Plane (BMP) characters.
support BMP and supplementary characters.
utf16le was added in MySQL 5.6.1.
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set and describes their differentiating properties. For general information about Unicode character sets, see Section 10.1.11, “Unicode Support”.
Most Unicode character sets have a similar set of collations, shown in the following list:
In the preceding list,
represents the character set name. For example,
represents the Danish collations, the specific names of which
Collation support for
utf16le is more
limited. The only collations available are
utf16le_bin. These are similar to
Unicode collation names may include a version number (for
to indicate the Unicode Collation Algorithm version on which
the collation is based, as described later in this section.
For such collations, there is no
alias to the corresponding
See Section 10.1.11.6, “The utf8mb3 Character Set (Alias for utf8)”.
collations were added in MySQL 5.6.5. They preserve the
pre-5.1.24 ordering of the original
collations and permit upgrades for tables created before MySQL
5.1.24. For more information, see
Section 2.11.3, “Checking Whether Tables or Indexes Must Be Rebuilt”, and
Section 2.11.4, “Rebuilding or Repairing Tables or Indexes”.
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
collation uses the version-4.0.0 UCA weight keys:
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo. A combined character will be considered different
from the same character written with a single unicode
character in string comparisons, and the two characters are
considered to have a different length (for example, as
returned by the
function or in result set metadata).
Collations based on UCA versions later than 4.0.0 include the
version in the collation name. Thus,
utf8_unicode_520_ci is based on UCA 5.2.0
MySQL implements language-specific Unicode collations only if
the ordering with
does not work well for a language. Language-specific
collations are UCA-based. They are derived from
with additional language tailoring rules.
UPPER() perform case folding
according to the collation of their argument. A character that
has uppercase and lowercase versions only in a Unicode version
more recent than 4.0.0 will be converted by these functions
only if the argument has a collation that uses a recent enough
For any Unicode character set, operations performed using the
collation are faster than those for the
collation. For example, comparisons for the
utf8_general_ci collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci. The reason for this is
utf8_unicode_ci supports mappings such
as expansions; that is, when one character compares as equal
to combinations of other characters. For example, in German
and some other languages “
is equal to “
utf8_unicode_ci also supports contractions
and ignorable characters.
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_unicode_ci (for the effect this has in
comparisons or when doing searches, see
Section 10.1.8.7, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
ß = s
Whereas this is true for
which supports the German DIN-1 ordering (also known as
ß = ss
MySQL implements language-specific collations for the
utf8 character set only if the ordering
utf8_unicode_ci does not work well for
a language. For example,
works fine for German dictionary order and French, so there is
no need to create special
utf8_general_ci also is satisfactory for
both German and French, except that
ß” is equal to
s”, and not to
ss”. If this is acceptable
for your application, you should use
utf8_general_ci because it is faster. If
this is not acceptable (for example, if you require German
dictionary order), use
because it is more accurate.
If you require German DIN-2 (phone book) ordering, use the
utf8_german2_ci collation, which compares
the following sets of characters equal:
Ä = Æ = AE Ö = Œ = OE Ü = UE ß = ss
utf8_german2_ci is similar to
latin1_german2_ci, but the latter does not
Æ” equal to
Œ” equal to
OE”. There is no
utf8_german_ci corresponding to
latin1_german_ci for German dictionary
includes Swedish rules. For example, in Swedish, the following
relationship holds, which is not something expected by a
German or French speaker:
Ü = Y < Ö
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
ñ” (n-tilde) is a separate
letter between “
o”. In addition, for
traditional Spanish, “
ch” is a
separate letter between “
ll” is a separate letter
collations may also be used for Asturian and Galician.
collations may also be used for Norwegian.
compare as equal, and
V compare as equal.
collations are tailored for these Croatian letters:
For all Unicode collations except the “binary”
collations, MySQL performs a table lookup to find a
character's collating weight. This weight can be displayed
function. (See Section 12.5, “String Functions”.) If a
character is not in the table (for example, because it is a
“new” character), collating weight determination
becomes more complex:
For BMP characters in general collations (
), weight = code point.
For BMP characters in UCA collations (for example,
and language-specific collations), the following algorithm applies:
if (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */ else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */ else base= 0xFBC0; /* All other characters */ aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements,
bbbb. For example:
SELECT HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci));+----------------------------------------------------------+ | HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)) | +----------------------------------------------------------+ | FBC084CF | +----------------------------------------------------------+
U+04cf CYRILLIC SMALL LETTER PALOCHKAis, with all UCA 4.0.0 collations, greater than
U+04c0 CYRILLIC LETTER PALOCHKA. With UCA 5.2.0 collations, all palochkas sort together.
For supplementary characters in general collations, the weight is the weight for
0xfffd REPLACEMENT CHARACTER. For supplementary characters in UCA 4.0.0 collations, their collating weight is
0xfffd. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.
An example with Deseret characters and
CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */ INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */ INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */ SELECT COUNT(DISTINCT s1) FROM t;
The result is 2 because in the MySQL
collations, the replacement character has a weight of
0x0dc6, whereas Deseret Bee and Deseret Tee both have a weight of
0xfffd. (Were the
utf32_general_cicollation used instead, the result is 1 because all three characters have a weight of
0xfffdin that collation.)
An example with cuneiform characters and
/* The four characters in the INSERT string are 00000041 # LATIN CAPITAL LETTER A 0001218F # CUNEIFORM SIGN KAB 000121A7 # CUNEIFORM SIGN KISH 00000042 # LATIN CAPITAL LETTER B */ CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0x000000410001218f000121a700000042); SELECT HEX(WEIGHT_STRING(s1)) FROM t;
The result is:
0E33 FFFD FFFD 0E4A
0E4Aare primary weights as in UCA 4.0.0.
FFFDis the weight for KAB and also for KISH.
The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by MySQL's rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
For supplementary characters based on UCA versions higher than 4.0.0 (for example,
), supplementary characters do not necessarily all have the same collation weight. Some have explicit weights from the UCA
allkeys.txtfile. Others have weights calculated from this algorithm:
aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
There is a difference between “ordering by the
character's code value” and “ordering by the
character's binary representation,” a difference that
appears only with
utf16_bin, because of
utf16_bin (the binary
utf16) was a binary
comparison “byte by byte” rather than
“character by character.” If that were so, the
order of characters in
differ from the order in
example, the following chart shows two rare characters. The
first character is in the range
FFFF, so it is
greater than a surrogate but less than a supplementary. The
second character is a supplementary.
Code point Character utf8 utf16 ---------- --------- ---- ----- 0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D 10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point
0x10384. And they are in order by
utf8 value because
0xf0. But they are not in order by
utf16 value, if we use byte-by-byte
utf16_bin collation is not
“byte by byte.” It is “by code
point.” When MySQL sees a supplementary-character
utf16, it converts to the
character's code-point value, and then compares. Therefore,
are the same ordering. This is consistent with the SQL:2008
standard requirement for a UCS_BASIC collation:
“UCS_BASIC is a collation in which the ordering is
determined entirely by the Unicode scalar values of the
characters in the strings being sorted. It is applicable to
the UCS character repertoire. Since every character repertoire
is a subset of the UCS repertoire, the UCS_BASIC collation is
potentially applicable to every character set. NOTE 11: The
Unicode scalar value of a character is its code point treated
as an unsigned integer.”
If the character set is
ucs2, comparison is
ucs2 strings should not
contain surrogates, anyway.
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).