MySQL 6.0 has several Unicode character sets:
ucs2, the UCS-2 encoding of the Unicode
character set using 16 bits per character
utf16, the UTF-16 encoding for the
Unicode character set; like ucs2 but
with an extension for supplementary characters
utf32, the UTF-32 encoding for the
Unicode character set using 32 bits per character
utf8, a UTF-8 encoding of the Unicode
character set using one to four bytes per character
utf8mb3, a UTF-8 encoding of the
Unicode character set using one to three bytes per
character (known as utf8 before MySQL
6.0)
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set. For general information about the character sets, see Section 9.1.8, “Unicode Support”.
A similar set of collations is available for each Unicode
character set. These are shown in the following list, where
xxx represents the character set
name. For example,
represents the Danish collations, the specific names of which
are xxx_danish_ciucs2_danish_ci,
utf16_danish_ci,
ucs32_danish_ci,
utf8_danish_ci, and
utf8mb3_danish_ci.
xxx_bin
xxx_czech_ci
xxx_danish_ci
xxx_esperanto_ci
xxx_estonian_ci
(default)
xxx_general_ci
xxx_hungarian_ci
xxx_icelandic_ci
xxx_latvian_ci
xxx_lithuanian_ci
xxx_persian_ci
xxx_polish_ci
xxx_roman_ci
xxx_romanian_ci
xxx_sinhala_ci
xxx_slovak_ci
xxx_slovenian_ci
xxx_spanish2_ci
xxx_spanish_ci
xxx_swedish_ci
xxx_turkish_ci
xxx_unicode_ci
The collations for utf8mb3,
utf16, and utf32 were
added in MySQL 6.0.4. In 6.0.4, the old (three-byte)
utf8 character set was renamed to
utf8mb3 and the new (four-byte)
utf8 character set was added. The collation
IDs for the utf8mb3 collations in MySQL 6.0
are the same as the IDs for the utf8
collations in MySQL 5.1. In other words, the IDs are still
associated with data for a particular encoding; it is the
encoding name that has changed.
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
Currently, the
xxx_unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo. The following discussion uses
xxx_unicode_ciutf8_unicode_ci for concreteness.
For any Unicode character set, operations performed using the
collation are faster than those for the
xxx_general_ci
collation. For example, comparisons for the
xxx_unicode_ciutf8_general_ci collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci. The reason for this is
that utf8_unicode_ci supports mappings such
as expansions; that is, when one character compares as equal
to combinations of other characters. For example, in German
and some other languages “ß”
is equal to “ss”.
utf8_unicode_ci also supports contractions
and ignorable characters. utf8_general_ci
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_general_ci and
utf8_unicode_ci (for the effect this has in
comparisons or when doing searches, see
Section 9.1.5.6, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci:
ß = ss
MySQL implements language-specific collations for the
utf8 character set only if the ordering
with utf8_unicode_ci does not work well for
a language. For example, utf8_unicode_ci
works fine for German and French, so there is no need to
create special utf8 collations for these
two languages.
utf8_general_ci also is satisfactory for
both German and French, except that
“ß” is equal to
“s”, and not to
“ss”. If this is acceptable
for your application, then you should use
utf8_general_ci because it is faster.
Otherwise, use utf8_unicode_ci because it
is more accurate.
utf8_swedish_ci, like other
utf8 language-specific collations, is
derived from utf8_unicode_ci with
additional language rules. For example, in Swedish, the
following relationship holds, which is not something expected
by a German or French speaker:
Ü = Y < Ö
The
and
xxx_spanish_ci
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
“xxx_spanish2_ciñ” (n-tilde) is a separate
letter between “n” and
“o”. In addition, for
traditional Spanish, “ch” is a
separate letter between “c”
and “d”, and
“ll” is a separate letter
between “l” and
“m”
In the
collations, xxx_roman_ciI and J
compare as equal, and U and
V compare as equal.
For all Unicode collations except the “binary”
(_bin) collations, MySQL performs a table
lookup to find a character's collating weight. This weight can
be displayed using the
WEIGHT_STRING() function. (See
Section 11.4, “String Functions”.) But if a character is not
in the table (for example, because it's a “new”
character), collating weight determination becomes more
complex:
For BMP characters in general collations (for example,
utf8_general_ci), weight = code point.
For BMP characters in UCA collations (for example,
utf8_unicode_ci), the following
algorithm applies:
if (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */ else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */ else base= 0xFBC0; /* All other characters */ aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements,
aaaa followed by
bbbb. For example:
mysql> SELECT HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci));
+----------------------------------------------------------+
| HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)) |
+----------------------------------------------------------+
| FBC084CF |
+----------------------------------------------------------+
Thus, U+04cf CYRILLIC SMALL LETTER
PALOCHKA currently is, with all UCA collations,
greater than U+04c0 CYRILLIC LETTER
PALOCHKA. Eventually, after further collation
tuning, all palochkas will sort together.
For supplementary characters in general collations, the
weight is the weight for 0xfffd REPLACEMENT
CHARACTER. For supplementary characters in UCA
collations, their collating weight is
0xfffd. That is, to MySQL, all
supplementary characters are equal to each other, and
greater than almost all BMP characters.
An example with Deseret characters and
COUNT(DISTINCT):
CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */ INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */ INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */ SELECT COUNT(DISTINCT s1) FROM t;
The result is 1, because Deseret Bee = Deseret Tee = Replacement Character, in the MySQL Unicode collation.
An example with cuneiform characters and
WEIGHT_STRING():
/* The four characters in the INSERT string are 00000041 # LATIN CAPITAL LETTER A 0001218F # CUNEIFORM SIGN KAB 000121A7 # CUNEIFORM SIGN KISH 00000042 # LATIN CAPITAL LETTER B */ CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0x000000410001218f000121a700000042); SELECT HEX(WEIGHT_STRING(s1)) FROM t;
The result is:
0E33 FFFD FFFD 0E4A
0E33 and 0E4A are
primary weights as in
UCA
4.0.0. FFFD is the weight for
KAB and also for KISH.
The current rule that all supplementary characters are equal to each other is non-optimal. However, we don't expect the rule to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user doesn't care what order they're in, anyway. If you really want to get rows sorted by MySQL's rule and secondarily by code point value, it's easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
The utf16_bin Collation
There is a difference between “ordering by the
character's code value” and “ordering by the
character's binary representation,” a difference that
appears only with utf16_bin, because of
surrogates.
Suppose that utf16_bin (the binary
collation for utf16) was a binary
comparison “byte by byte” rather than
“character by character.” If that were so, then
the order of characters in utf16_bin would
differ from the order in utf8_bin. For
example, here is a chart showing two rare characters. The
first character is in the range
E000-FFFF, so it is
greater than a surrogate but less than a supplementary. The
second character is a supplementary.
Code point Character utf8 utf16 ---------- --------- ---- ----- 0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D 10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point
value, because 0xff9d <
0x10384. And they are in order by
utf8 value, because 0xef <
0xf0. But they are not in order by
utf16 value, if we use byte-by-byte
comparison, because 0xff > 0xd8.
So MySQL's utf16_bin collation is not
“byte by byte.” It is “by code
point.” When MySQL sees a supplementary-character
encoding in utf16, it converts to the character's code-point
value, and then compares. Therefore
utf8_bin and utf16_bin
are the same ordering. This is consistent with the SQL:2008
standard requirement for a UCS_BASIC collation:
“UCS_BASIC is a collation in which the ordering is
determined entirely by the Unicode scalar values of the
characters in the strings being sorted. It is applicable to
the UCS character repertoire. Since every character repertoire
is a subset of the UCS repertoire, the UCS_BASIC collation is
potentially applicable to every character set. NOTE 11 —
The Unicode scalar value of a character is its code point
treated as an unsigned integer.”
If the character set is ucs2, then
comparison is byte-by-byte, but ucs2
strings shouldn't contain surrogates, anyway.
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).

User Comments
Note for Hebrew speakers: in utf8_general_ci, dots (Niqqud symbols) are treated as seperate characters, so
a. If you have dotted words in your table, they won't be ordered correctly.
b. You can have words with the same letters and different dots in a unique index column.
On the other hand, in utf8_unicode_ci, dots are igonred, so:
a. The order will be correct;
b. Words with the same letters and different dots will be regarded as equal, and you won't be able to have them in a unique index column.
I still didn't find a collation that treats both issues correctly.
Add your own comment.