Documentation Home
MySQL 5.7 Reference Manual
Related Documentation Download this Manual Excerpts from this Manual Asian Character Sets

The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters. See Section, “The cp932 Character Set”, for additional information about the cp932 and sjis character sets. See Section, “The gb18030 Character Set”, for additional information about character set support for the Chinese National Standard GB 18030.

For answers to some common questions and problems relating support for Asian character sets in MySQL, see Section A.11, “MySQL 5.7 FAQ: MySQL Chinese, Japanese, and Korean Character Sets”.

  • big5 (Big5 Traditional Chinese) collations:

    • big5_bin

    • big5_chinese_ci (default)

  • cp932 (SJIS for Windows Japanese) collations:

    • cp932_bin

    • cp932_japanese_ci (default)

  • eucjpms (UJIS for Windows Japanese) collations:

    • eucjpms_bin

    • eucjpms_japanese_ci (default)

  • euckr (EUC-KR Korean) collations:

    • euckr_bin

    • euckr_korean_ci (default)

  • gb2312 (GB2312 Simplified Chinese) collations:

    • gb2312_bin

    • gb2312_chinese_ci (default)

  • gbk (GBK Simplified Chinese) collations:

    • gbk_bin

    • gbk_chinese_ci (default)

  • gb18030 (China National Standard GB18030) collations:

    • gb18030_bin

    • gb18030_chinese_ci (default)

    • gb18030_unicode_520_ci

  • sjis (Shift-JIS Japanese) collations:

    • sjis_bin

    • sjis_japanese_ci (default)

  • tis620 (TIS620 Thai) collations:

    • tis620_bin

    • tis620_thai_ci (default)

  • ujis (EUC-JP Japanese) collations:

    • ujis_bin

    • ujis_japanese_ci (default)

The big5_chinese_ci collation sorts on number of strokes.

For additional information about Asian collations in MySQL, see Collation-Charts.Org (big5, cp932, eucjpms, euckr, gb2312, gbk, sjis, tis620, ujis). The cp932 Character Set

Why is cp932 needed?

In MySQL, the sjis character set corresponds to the Shift_JIS character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See

However, the meaning of SHIFT JIS as a descriptive term has become very vague and it often includes the extensions to Shift_JIS that are defined by various vendors.

For example, SHIFT JIS used in Japanese Windows environments is a Microsoft extension of Shift_JIS and its exact name is Microsoft Windows Codepage : 932 or cp932. In addition to the characters supported by Shift_JIS, cp932 supports extension characters such as NEC special characters, NEC selected—IBM extended characters, and IBM selected characters.

Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:

  • MySQL automatically converts character sets.

  • Character sets are converted using Unicode (ucs2).

  • The sjis character set does not support the conversion of these extension characters.

  • There are several conversion rules from so-called SHIFT JIS to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).

The MySQL cp932 character set is designed to solve these problems.

Because MySQL supports character set conversion, it is important to separate IANA Shift_JIS and cp932 into two different character sets because they provide different conversion rules.

How does cp932 differ from sjis?

The cp932 character set differs from sjis in the following ways:

For some characters, conversion to and from ucs2 is different for sjis and cp932. The following tables illustrate these differences.

Conversion to ucs2:

sjis/cp932 Valuesjis -> ucs2 Conversioncp932 -> ucs2 Conversion

Conversion from ucs2:

ucs2 valueucs2 -> sjis Conversionucs2 -> cp932 Conversion

Users of any Japanese character sets should be aware that using --character-set-client-handshake (or --skip-character-set-client-handshake) has an important effect. See Section 5.1.3, “Server Command Options”. The gb18030 Character Set

In MySQL, the gb18030 character set, introduced in MySQL 5.7.4, corresponds to the Chinese National Standard GB 18030-2005: Information technology — Chinese coded character set, which is the official character set of the People's Republic of China (PRC).

Characteristics of the MySQL gb18030 Character Set
  • Supports all code points defined by the GB 18030-2005 standard. Unassigned code points in the ranges (GB+8431A439, GB+90308130) and (GB+E3329A36, GB+EF39EF39) are treated as '?' (0x3F). Conversion of unassigned code points return '?'.

  • Supports UPPER and LOWER conversion for all GB18030 code points. Case folding defined by Unicode is also supported (based on CaseFolding-6.3.0.txt).

  • Supports Conversion of data to and from other character sets.

  • Supports SQL statements such as SET NAMES.

  • Supports comparison between gb18030 strings, and between gb18030 strings and strings of other character sets. There is a conversion if strings have different character sets. Comparisons that include or ignore trailing spaces are also supported.

  • The private use area (U+E000, U+F8FF) in Unicode is mapped to gb18030.

  • There is no mapping between (U+D800, U+DFFF) and GB18030. Attempted conversion of code points in this range returns '?'.

  • If an incoming sequence is illegal, an error or warning is returned. If an illegal sequence is used in CONVERT(), an error is returned. Otherwise, a warning is returned.

  • For consistency with utf8 and utf8mb4, UPPER is not supported for ligatures.

  • Searches for ligatures also match uppercase ligatures when using the gb18030_unicode_520_ci collation.

  • If a character has more than one uppercase character, the chosen uppercase character is the one whose lowercase is the character itself.

  • The minimum multibyte length is 1 and the maximum is 4. The character set determines the length of a sequence using the first 1 or 2 bytes.

Supported Collations
  • gb18030_bin: A binary collation.

  • gb18030_chinese_ci: The default collation, which supports Pinyin. Sorting of non-Chinese characters is based on the order of the original sort key. The original sort key is GB(UPPER(ch)) if UPPER(ch) exists. Otherwise, the original sort key is GB(ch). Chinese characters are sorted according to the Pinyin collation defined in the Unicode Common Locale Data Repository (CLDR 24). Non-Chinese characters are sorted before Chinese characters with the exception of GB+FE39FE39, which is the code point maximum.

  • gb18030_unicode_520_ci: A Unicode collation. Use this collation if you need to ensure that ligatures are sorted correctly.

Download this Manual
User Comments
  Posted by on September 27, 2005
As of MySQL 4.1.14,
Please notice that for Traditional Chinese (BIG5), collation 'big5_chinese_ci' uses stroke count of the characters on ordering; while in Simplified Chinese (GB2312), collation 'gb2312_chinese_ci' uses Pinyin of the characters on ordering.
Sign Up Login You must be logged in to post a comment.