As of MySQL version 4.1, there are two new character sets for storing Unicode data:
ucs2, the UCS-2 encoding of the Unicode
character set using 16 bits per character
utf8, a UTF-8 encoding of the Unicode
character set using one to three bytes per character
These two character sets support the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0. BMP characters have these characteristics:
Their code values are between 0 and 65535 (or
They can be encoded with a fixed 16-bit word, as in
They can be encoded with 8, 16, or 24 bits, as in
They are sufficient for almost all characters in major languages
character sets do not support supplementary characters that lie
outside the BMP.
A similar set of collations is available for each Unicode
character set. For example, each has a Danish collation, the
names of which are
utf8_danish_ci. All Unicode collations are
listed at Section 188.8.131.52, “Unicode Character Sets”.
The MySQL implementation of UCS-2 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM. In such cases, conversion of values will need to be performed when transferring data between those systems and MySQL.
MySQL uses no BOM for UTF-8 values.
Client applications that need to communicate with the server
using Unicode should set the client character set accordingly;
for example, by issuing a
SET NAMES 'utf8'
ucs2 cannot be used as a client
character set, which means that it does not work for
SET NAMES or
SET. (See Section 9.1.4, “Connection Character Sets and Collations”.)
The following sections provide additional detail on the Unicode character sets in MySQL.