Documentation Home
MySQL 5.6 Reference Manual
Related Documentation Download this Manual
PDF (US Ltr) - 31.6Mb
PDF (A4) - 31.6Mb
PDF (RPM) - 30.5Mb
HTML Download (TGZ) - 7.6Mb
HTML Download (Zip) - 7.6Mb
HTML Download (RPM) - 6.5Mb
Man Pages (TGZ) - 189.6Kb
Man Pages (Zip) - 304.1Kb
Info (Gzip) - 3.0Mb
Info (Zip) - 3.0Mb
Excerpts from this Manual

10.1.10.7 Asian Character Sets

The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters. See Section 10.1.10.7.1, “The cp932 Character Set”, for additional information about the cp932 and sjis character sets.

For answers to some common questions and problems relating support for Asian character sets in MySQL, see Section A.11, “MySQL 5.6 FAQ: MySQL Chinese, Japanese, and Korean Character Sets”.

  • big5 (Big5 Traditional Chinese) collations:

    • big5_bin

    • big5_chinese_ci (default)

  • cp932 (SJIS for Windows Japanese) collations:

    • cp932_bin

    • cp932_japanese_ci (default)

  • eucjpms (UJIS for Windows Japanese) collations:

    • eucjpms_bin

    • eucjpms_japanese_ci (default)

  • euckr (EUC-KR Korean) collations:

    • euckr_bin

    • euckr_korean_ci (default)

  • gb2312 (GB2312 Simplified Chinese) collations:

    • gb2312_bin

    • gb2312_chinese_ci (default)

  • gbk (GBK Simplified Chinese) collations:

    • gbk_bin

    • gbk_chinese_ci (default)

  • sjis (Shift-JIS Japanese) collations:

    • sjis_bin

    • sjis_japanese_ci (default)

  • tis620 (TIS620 Thai) collations:

    • tis620_bin

    • tis620_thai_ci (default)

  • ujis (EUC-JP Japanese) collations:

    • ujis_bin

    • ujis_japanese_ci (default)

The big5_chinese_ci collation sorts on number of strokes.

10.1.10.7.1 The cp932 Character Set

Why is cp932 needed?

In MySQL, the sjis character set corresponds to the Shift_JIS character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See http://www.iana.org/assignments/character-sets.)

However, the meaning of SHIFT JIS as a descriptive term has become very vague and it often includes the extensions to Shift_JIS that are defined by various vendors.

For example, SHIFT JIS used in Japanese Windows environments is a Microsoft extension of Shift_JIS and its exact name is Microsoft Windows Codepage : 932 or cp932. In addition to the characters supported by Shift_JIS, cp932 supports extension characters such as NEC special characters, NEC selected—IBM extended characters, and IBM selected characters.

Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:

  • MySQL automatically converts character sets.

  • Character sets are converted using Unicode (ucs2).

  • The sjis character set does not support the conversion of these extension characters.

  • There are several conversion rules from so-called SHIFT JIS to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).

The MySQL cp932 character set is designed to solve these problems.

Because MySQL supports character set conversion, it is important to separate IANA Shift_JIS and cp932 into two different character sets because they provide different conversion rules.

How does cp932 differ from sjis?

The cp932 character set differs from sjis in the following ways:

  • cp932 supports NEC special characters, NEC selected—IBM extended characters, and IBM selected characters.

  • Some cp932 characters have two different code points, both of which convert to the same Unicode code point. When converting from Unicode back to cp932, one of the code points must be selected. For this round trip conversion, the rule recommended by Microsoft is used. (See http://support.microsoft.com/kb/170559/EN-US/.)

    The conversion rule works like this:

    • If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.

    • If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.

    • If the character is in both IBM selected characters and NEC selected—IBM extended characters, use the code point of IBM extended characters.

    The table shown at https://msdn.microsoft.com/en-us/goglobal/cc305152.aspx provides information about the Unicode values of cp932 characters. For cp932 table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode (ucs2) encoding. For table entries with an underlined two-digit value appears, there is a range of cp932 character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of the cp932 characters that begin with those digits.

    The following links are of special interest. They correspond to the encodings for the following sets of characters:

    • NEC special characters (lead byte 0x87):

      https://msdn.microsoft.com/en-us/goglobal/gg674964
    • NEC selected—IBM extended characters (lead byte 0xED and 0xEE):

      https://msdn.microsoft.com/en-us/goglobal/gg671837
      https://msdn.microsoft.com/en-us/goglobal/gg671838
    • IBM selected characters (lead byte 0xFA, 0xFB, 0xFC):

      https://msdn.microsoft.com/en-us/goglobal/gg671839
      https://msdn.microsoft.com/en-us/goglobal/gg671840
      https://msdn.microsoft.com/en-us/goglobal/gg671841
  • cp932 supports conversion of user-defined characters in combination with eucjpms, and solves the problems with sjis/ujis conversion. For details, please refer to http://www.sljfaq.org/afaq/encodings.html.

For some characters, conversion to and from ucs2 is different for sjis and cp932. The following tables illustrate these differences.

Conversion to ucs2:

sjis/cp932 Valuesjis -> ucs2 Conversioncp932 -> ucs2 Conversion
5C005C005C
7E007E007E
815C20152015
815F005CFF3C
8160301CFF5E
816120162225
817C2212FF0D
819100A2FFE0
819200A3FFE1
81CA00ACFFE2

Conversion from ucs2:

ucs2 valueucs2 -> sjis Conversionucs2 -> cp932 Conversion
005C815F5C
007E7E7E
00A281913F
00A381923F
00AC81CA3F
2015815C815C
201681613F
2212817C3F
22253F8161
301C81603F
FF0D3F817C
FF3C3F815F
FF5E3F8160
FFE03F8191
FFE13F8192
FFE23F81CA

Users of any Japanese character sets should be aware that using --character-set-client-handshake (or --skip-character-set-client-handshake) has an important effect. See Section 5.1.4, “Server Command Options”.


User Comments
  Posted by on September 27, 2005
As of MySQL 4.1.14,
Please notice that for Traditional Chinese (BIG5), collation 'big5_chinese_ci' uses stroke count of the characters on ordering; while in Simplified Chinese (GB2312), collation 'gb2312_chinese_ci' uses Pinyin of the characters on ordering.
Sign Up Login You must be logged in to post a comment.