MySQL 5.0 has two Unicode character sets:
ucs2, the UCS-2 encoding of the Unicode
character set using 16 bits per character
utf8, a UTF-8 encoding of the Unicode
character set using one to three bytes per character
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set and describes their differentiating properties. For general information about the character sets, see Section 9.1.10, “Unicode Support”.
A similar set of collations is available for each Unicode
character set. These are shown in the following list, where
xxx represents the character set
name. For example,
represents the Danish collations, the specific names of which
are xxx_danish_ciucs2_danish_ci and
utf8_danish_ci.
xxx_bin
xxx_czech_ci
xxx_danish_ci
xxx_esperanto_ci
xxx_estonian_ci
(default)
xxx_general_ci
xxx_hungarian_ci
xxx_icelandic_ci
xxx_latvian_ci
xxx_lithuanian_ci
xxx_persian_ci
xxx_polish_ci
xxx_roman_ci
xxx_romanian_ci
xxx_slovak_ci
xxx_slovenian_ci
xxx_spanish_ci
xxx_spanish2_ci
xxx_swedish_ci
xxx_turkish_ci
xxx_unicode_ci
The
collations were added in MySQL 5.0.13. The
xxx_esperanto_ci
collations were added in MySQL 5.0.19.
xxx_hungarian_ci
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
Currently, the
xxx_unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo.
xxx_unicode_ci
MySQL implements language-specific Unicode collations only if
the ordering with
does not work well for a language. Language-specific
collations are UCA-based. They are derived from
xxx_unicode_ci
with additional language tailoring rules.
xxx_unicode_ci
For any Unicode character set, operations performed using the
collation are faster than those for the
xxx_general_ci
collation. For example, comparisons for the
xxx_unicode_ciutf8_general_ci collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci. The reason for this is
that utf8_unicode_ci supports mappings such
as expansions; that is, when one character compares as equal
to combinations of other characters. For example, in German
and some other languages “ß”
is equal to “ss”.
utf8_unicode_ci also supports contractions
and ignorable characters. utf8_general_ci
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_general_ci and
utf8_unicode_ci (for the effect this has in
comparisons or when doing searches, see
Section 9.1.7.8, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci,
which supports the German DIN-1 ordering (also known as
dictionary order):
ß = ss
MySQL implements language-specific collations for the
utf8 character set only if the ordering
with utf8_unicode_ci does not work well for
a language. For example, utf8_unicode_ci
works fine for German dictionary order and French, so there is
no need to create special utf8 collations.
utf8_general_ci also is satisfactory for
both German and French, except that
“ß” is equal to
“s”, and not to
“ss”. If this is acceptable
for your application, you should use
utf8_general_ci because it is faster.
Otherwise, use utf8_unicode_ci because it
is more accurate.
includes Swedish rules. For example, in Swedish, the following
relationship holds, which is not something expected by a
German or French speaker:
xxx_swedish_ci
Ü = Y < Ö
The
and
xxx_spanish_ci
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
“xxx_spanish2_ciñ” (n-tilde) is a separate
letter between “n” and
“o”. In addition, for
traditional Spanish, “ch” is a
separate letter between “c”
and “d”, and
“ll” is a separate letter
between “l” and
“m”
In the
collations, xxx_roman_ciI and J
compare as equal, and U and
V compare as equal.
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).

User Comments
Note for Hebrew speakers: in utf8_general_ci, dots (Niqqud symbols) are treated as seperate characters, so
a. If you have dotted words in your table, they won't be ordered correctly.
b. You can have words with the same letters and different dots in a unique index column.
On the other hand, in utf8_unicode_ci, dots are igonred, so:
a. The order will be correct;
b. Words with the same letters and different dots will be regarded as equal, and you won't be able to have them in a unique index column.
I still didn't find a collation that treats both issues correctly.
Add your own comment.