The initial implementation of Unicode support (in MySQL 4.1) included two character sets for storing Unicode data:
ucs2, the UCS-2 encoding of the Unicode
character set using 16 bits per character
utf8, a UTF-8 encoding of the Unicode
character set using one to three bytes per character
These two character sets support the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0. BMP characters have these characteristics:
Their code values are between 0 and 65535 (or
U+0000 .. U+FFFF)
They can be encoded with a fixed 16-bit word, as in
ucs2
They can be encoded with 8, 16, or 24 bits, as in
utf8
They are sufficient for almost all characters in major languages
Characters not supported by the aforementioned character sets include supplementary characters that lie outside the BMP. As of MySQL 6.0 (versions 6.0.4 and up), Unicode support is extended to include supplementary characters, which requires new character sets that have a broader range and therefore take more space. The following table shows a brief feature comparison of previous and current Unicode support.
| Before MySQL 6.0 | MySQL 6.0 |
| All Unicode 3.0 characters | All Unicode 5.0 characters |
| No supplementary characters | With supplementary characters |
ucs2 character set |
No change |
utf8 character set for up to three bytes |
Renamed to utf8mb3
|
New utf8 character set for up to four bytes |
|
New utf16 character set |
|
New utf32 character set |
Most of these changes are upward compatible. For example, the
old utf8 character set (the three-byte
version) has been renamed to utf8mb3, so
tables created before MySQL 6.0 that used
utf8 are treated as using
utf8mb3 after an in-place upgrade to MySQL
6.0. However, the table contents will not have changed in any
way.
The new utf8 character set, as well as
utf16 and utf32, support
supplementary characters. The Unicode character sets in MySQL
6.0 are:
ucs2, the UCS-2 encoding of the Unicode
character set using 16 bits per character
utf16, the UTF-16 encoding for the
Unicode character set; like ucs2 but with
an extension for supplementary characters
utf32, the UTF-32 encoding for the
Unicode character set using 32 bits per character
utf8, a UTF-8 encoding of the Unicode
character set using one to four bytes per character
utf8mb3, a UTF-8 encoding of the Unicode
character set using one to three bytes per character (known
as utf8 before MySQL 6.0)
A similar set of collations is available for each Unicode
character set. For example, each has a Danish collation, the
names of which are ucs2_danish_ci,
utf16_danish_ci,
ucs32_danish_ci,
utf8_danish_ci, and
utf8mb3_danish_ci. All Unicode collations are
listed at Section 9.1.12.1, “Unicode Character Sets”, which also
describes collation properties for supplementary characters.
Note that although many of the supplementary characters come from East Asian languages, what MySQL 6.0 adds is support for more Japanese and Chinese characters in Unicode character sets, not support for new Japanese and Chinese character sets.
The following discussion provides additional detail on the
Unicode character sets in MySQL. For details on potential
incompatibility issues for your applications, see
Section 9.1.9, “Upgrading from Previous to Current Unicode Support”. That section also
describes how to convert tables from utf8mb3
to the new (four-byte) utf8 character set,
and what constraints may apply in doing so.
The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM, in which case, conversion of values will need to be performed when transferring data between those systems and MySQL.
MySQL uses no BOM for UTF-8 values.
Client applications that need to communicate with the server
using Unicode should set the client character set accordingly;
for example, by issuing a SET NAMES 'utf8'
statement. ucs2, utf16,
and utf32 cannot be used as a client
character set, which means that they do not work for
SET NAMES or SET CHARACTER
SET. (See Section 9.1.4, “Connection Character Sets and Collations”.)
The ucs2 Character Set
(UCS-2 Unicode Encoding)
In UCS-2, every character is represented by a two-byte Unicode
code with the most significant byte first. For example:
LATIN CAPITAL LETTER A has the code
0x0041 and it is stored as a two-byte
sequence: 0x00 0x41. CYRILLIC SMALL
LETTER YERU (Unicode 0x044B) is
stored as a two-byte sequence: 0x04 0x4B. For
Unicode characters and their codes, please refer to the
Unicode Home Page.
In MySQL, the ucs2 character set is a fixed
16-bit encoding for Unicode BMP characters.
The utf16 Character Set
(UTF-16 Unicode Encoding)
The utf16 character set is the
ucs2 character set with an extension that
enables encoding of supplementary characters:
For a BMP character, utf16 and
ucs2 have identical storage
characteristics: The same code values, same encoding, and
same length.
For a supplementary character, utf16 has
a special sequence for representing the character using 32
bits. This is called the “surrogate” mechanism:
For a number greater than 0xffff, take 10
bits and add them to 0xd800 and put them
in the first 16-bit word, take 10 more bits and add them to
0xdc00 and put them in the next 16-bit
word. Consequently, all supplementary characters require 32
bits, where the first 16 bits are a number between
0xd800 and 0xdbff, and
the last 16 bits are a number between
0xdc00 and 0xdfff.
Examples are in
15.5
Surrogates Area in the Unicode 4.0 document.
Because utf16 supports surrogates and
ucs2 does not, there is a validity check that
applies only in utf16: You cannot insert a
top surrogate without a bottom surrogate, or vice versa. For
example:
INSERT INTO t (ucs2_column) VALUES (0xd800); /* legal */ INSERT INTO t (utf16_column)VALUES (0xd800); /* illegal */
There is no validity check for characters that are technically
valid but are not true Unicode (that is, characters that Unicode
considers to be “unassigned code points” or
“private use” characters or even
“illegals” like 0xffff). For
example, since U+F8FF is the Apple Logo, this
is legal:
INSERT INTO t (utf16_column)VALUES (0xf8ff); /* legal */
Such characters cannot be expected to mean the same thing to everyone.
Because MySQL must allow for the worst case (that one character
requires four bytes) the maximum length of a
utf16 column or index is only half of the
maximum length for a ucs2 column or index.
For example, at the time of writing the maximum length of a
Falcon index key is 1100 bytes (assuming the default Falcon page
size), so these statements create tables with the longest
allowed indexes for ucs2 and
utf16 columns:
CREATE TABLE tf (s1 VARCHAR(550) CHARACTER SET ucs2) ENGINE=FALCON; CREATE INDEX i ON tf (s1); CREATE TABLE tg (s1 VARCHAR(275) CHARACTER SET utf16) ENGINE=FALCON; CREATE INDEX i ON tg (s1);
The utf32 Character Set
(UTF-32 Unicode Encoding)
The utf32 character set is fixed length (like
ucs2 and unlike utf16).
utf32 uses 32 bits for every character,
unlike ucs2 (which uses 16 bits for every
character), and unlike utf16 (which uses 16
bits for some characters and 32 bits for others).
utf32 takes twice as much space as
ucs2 and more space than
utf16, but utf32 has the
same advantage as ucs2 that it is predictable
for storage: The required number of bytes for
utf32 equals the number of characters times
4. Also, unlike utf16, there are no tricks
for encoding in utf32, so the stored value
equals the code value.
To demonstrate how the latter advantage is useful, here is an
example that shows how to determine a utf8
value given the utf32 code value:
/* Assume code value = 100cc LINEAR B WHEELED CHARIOT */
CREATE TABLE tmp (utf32 CHAR(1) CHARACTER SET utf32,
utf8 CHAR(1) CHARACTER SET utf8);
INSERT INTO tmp VALUES (0x000100cc,NULL);
UPDATE tmp SET utf8 = utf32;
SELECT HEX(utf32),HEX(utf8) FROM tmp;
MySQL is very forgiving about additions of unassigned Unicode
characters, private-use-area characters, and other code values
not present in the official
Unicode
5.0 Character Database. There is in fact only one
validity check for utf32: No code value may
be greater than 0x10ffff. For example, this
is illegal:
INSERT INTO t (utf32_column) VALUES (0x110000); /* illegal */
The utf8 Character Set
(Four-Byte UTF-8 Unicode Encoding)
UTF-8 (Unicode Transformation Format with 8-bit units) is an alternative way to store Unicode data. It is implemented according to RFC 3629. RFC 3629 describes encoding sequences that take from one to four bytes. (An older standard for UTF-8 encoding is given by RFC 2279, which describes UTF-8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)
The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:
Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte or four-byte sequences.
Tip: To save space with UTF-8,
use VARCHAR instead of
CHAR. Otherwise, MySQL must reserve four
bytes for each character in a CHAR CHARACTER SET
utf8 column because that is the maximum possible
length. For example, MySQL must reserve 40 bytes for a
CHAR(10) CHARACTER SET utf8 column.
Before MySQL 6.0, the character set named
utf8 contained only BMP characters and used a
maximum of three bytes per character. As of MySQL 6.0, that
character set uses a different name, utf8mb3,
but its characteristics remain the same as before.
In terms of table content (for old tables created before MySQL
6.0), the old utf8 (now
utf8mb3)) and new utf8 are
compatible:
For a BMP character, the 5.1 and 6.0 versions of
utf8 have identical storage
characteristics: same code values, same encoding, same
length.
For a supplementary character, utf8 in
5.1 cannot store the character at all, while
utf8 in 6.0 requires four bytes to store.
Since utf8 in 5.1 cannot store the
character at all, you do not have any supplementary
characters in utf8 columns in your 5.1
database, so you need not worry about converting characters
or losing data.
The utf8mb3 Character
Set (Three-Byte UTF-8 Unicode Encoding)
The utf8mb3 character set in MySQL 6.0 is the
same as the character set that was called
utf8 before 6.0. In other words,
utf8mb3 in MySQL 6.0 has exactly the same
characteristics as utf8 in MySQL 5.1:
No support for supplementary characters (BMP characters only)
Maximum of three bytes per multi-byte character
For the few cases where it's desirable to have complete
compatibility in MySQL 6.0 with the old utf8
character set, you can define with CHARACTER SET
utf8mb3.
Exactly the same set of characters is available in
utf8mb3 as in ucs2. That
is, they have the same repertoire.

User Comments
Connect with the same characterset as your data to display correctly. This example connects to the MySQL-server using UTF-8:
mysql --default-character-set=utf8 -uyour_username -p -h your_databasehost.your_domain.com your_database
If you get into trouble from a PHP-based web application, check the characterset configurations of these components:
1) the MySQL database
2) php.ini
3) httpd.conf
4) your server
if you get data via php from your mysql-db (everything utf-8)
but still get '?' for some special characters in your browser
(<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />),
try this:
after mysql_connect() , and mysql_select_db() add this lines:
mysql_query("SET NAMES utf8");
worked for me.
i tried first with the utf8_encode, but this only worked for äüöéè...
and so on, but not for kyrillic and other chars.
Add your own comment.