The Asian character sets that we support include Chinese,
Japanese, Korean, and Thai. These can be complicated. For
example, the Chinese sets must allow for thousands of
different characters. See Section 220.127.116.11.1, “The
cp932 Character Set”, for
additional information about the
sjis character sets.
big5 (Big5 Traditional Chinese)
cp932 (SJIS for Windows Japanese)
eucjpms (UJIS for Windows Japanese)
euckr (EUC-KR Korean) collations:
gb2312 (GB2312 Simplified Chinese)
gbk (GBK Simplified Chinese)
sjis (Shift-JIS Japanese) collations:
tis620 (TIS620 Thai) collations:
ujis (EUC-JP Japanese) collations:
big5_chinese_ci collation sorts on
number of strokes.
In MySQL, the
sjis character set
corresponds to the
set defined by IANA, which supports JIS X0201 and JIS X0208
However, the meaning of “SHIFT JIS” as a
descriptive term has become very vague and it often includes
the extensions to
Shift_JIS that are
defined by various vendors.
For example, “SHIFT JIS” used in Japanese
Windows environments is a Microsoft extension of
Shift_JIS and its exact name is
Microsoft Windows Codepage : 932 or
cp932. In addition to the characters
cp932 supports extension characters such
as NEC special characters, NEC selected—IBM extended
characters, and IBM extended characters.
Since MySQL 4.1, many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
Character sets are converted using Unicode
sjis character set does not
support the conversion of these extension characters.
There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
cp932 character set is designed
to solve these problems. It is available as of MySQL 4.1.12.
Before MySQL 4.1, it was safe to use any version of
“SHIFT JIS” in conjunction with the
sjis character set. However, because
MySQL supports character set conversion beginning with 4.1,
it is important to separate IANA
into two different character sets because they provide
different conversion rules.
cp932 character set differs from
sjis in the following ways:
cp932 supports NEC special
characters, NEC selected—IBM extended characters,
and IBM selected characters.
cp932 characters have two
different code points, both of which convert to the same
Unicode code point. When converting from Unicode back to
cp932, one of the code points must be
selected. For this “round trip conversion,”
the rule recommended by Microsoft is used. (See
The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC selected—IBM extended characters, use the code point of IBM extended characters.
The table shown at
provides information about the Unicode values of
cp932 characters. For
cp932 table entries with characters
under which a four-digit number appears, the number
represents the corresponding Unicode
ucs2) encoding. For table entries
with an underlined two-digit value appears, there is a
cp932 character values that
begin with those two digits. Clicking such a table entry
takes you to a page that displays the Unicode value for
each of the
cp932 characters that
begin with those digits.
The following links are of special interest. They correspond to the encodings for the following sets of characters:
NEC special characters:
NEC selected—IBM extended characters:
IBM selected characters:
For some characters, conversion to and from
ucs2 is different for
following tables illustrate these differences.
Users of any Japanese character sets should be aware that
has an important effect. See
Section 5.1.2, “Server Command Options”.