MySQL has two Unicode character sets:
ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character
utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set and describes their differentiating properties. For general information about the character sets, see Section 10.1.10, “Unicode Support”.
A similar set of collations is available for each Unicode
character set. These are shown in the following list, where
xxx represents the character set
name. For example,
represents the Danish collations, the specific names of which
collations were added in MySQL 5.0.13. The
collations were added in MySQL 5.0.19.
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
collation uses the version-4.0.0 UCA weight keys:
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo. A combined character will be considered different
from the same character written with a single unicode
character in string comparisons, and the two characters are
considered to have a different length (for example, as
returned by the
function or in result set metadata).
MySQL implements language-specific Unicode collations only if
the ordering with
does not work well for a language. Language-specific
collations are UCA-based. They are derived from
with additional language tailoring rules.
For any Unicode character set, operations performed using the
collation are faster than those for the
collation. For example, comparisons for the
utf8_general_ci collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci. The reason for this is
utf8_unicode_ci supports mappings such
as expansions; that is, when one character compares as equal
to combinations of other characters. For example, in German
and some other languages “
is equal to “
utf8_unicode_ci also supports contractions
and ignorable characters.
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_unicode_ci (for the effect this has in
comparisons or when doing searches, see
Section 10.1.7.8, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
ß = s
Whereas this is true for
which supports the German DIN-1 ordering (also known as
ß = ss
MySQL implements language-specific collations for the
utf8 character set only if the ordering
utf8_unicode_ci does not work well for
a language. For example,
works fine for German dictionary order and French, so there is
no need to create special
utf8_general_ci also is satisfactory for
both German and French, except that
ß” is equal to
s”, and not to
ss”. If this is acceptable
for your application, you should use
utf8_general_ci because it is faster.
utf8_unicode_ci because it
is more accurate.
includes Swedish rules. For example, in Swedish, the following
relationship holds, which is not something expected by a
German or French speaker:
Ü = Y < Ö
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
ñ” (n-tilde) is a separate
letter between “
o”. In addition, for
traditional Spanish, “
ch” is a
separate letter between “
ll” is a separate letter
collations may also be used for Asturian and Galician.
collations may also be used for Norwegian.
compare as equal, and
V compare as equal.
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).