Pre-General Availability Draft: 2017-12-16
The Asian character sets that we support include Chinese,
Japanese, Korean, and Thai. These can be complicated. For
example, the Chinese sets must allow for thousands of
different characters. See Section 10.1.10.7.1, “The cp932 Character Set”, for
additional information about the
sjis character sets. See
Section 10.1.10.7.2, “The gb18030 Character Set”, for additional information
about character set support for the Chinese National Standard
For answers to some common questions and problems relating support for Asian character sets in MySQL, see Section A.11, “MySQL 8.0 FAQ: MySQL Chinese, Japanese, and Korean Character Sets”.
big5(Big5 Traditional Chinese) collations:
cp932(SJIS for Windows Japanese) collations:
eucjpms(UJIS for Windows Japanese) collations:
euckr(EUC-KR Korean) collations:
gb2312(GB2312 Simplified Chinese) collations:
gbk(GBK Simplified Chinese) collations:
gb18030(China National Standard GB18030) collations:
sjis(Shift-JIS Japanese) collations:
tis620(TIS620 Thai) collations:
ujis(EUC-JP Japanese) collations:
big5_chinese_ci collation sorts on
number of strokes.
In MySQL, the
sjis character set
corresponds to the
set defined by IANA, which supports JIS X0201 and JIS X0208
However, the meaning of “SHIFT JIS” as a
descriptive term has become very vague and it often includes
the extensions to
Shift_JIS that are
defined by various vendors.
For example, “SHIFT JIS” used in Japanese
Windows environments is a Microsoft extension of
Shift_JIS and its exact name is
Microsoft Windows Codepage : 932 or
cp932. In addition to the characters
cp932 supports extension characters such
as NEC special characters, NEC selected—IBM extended
characters, and IBM selected characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
Character sets are converted using Unicode (
sjischaracter set does not support the conversion of these extension characters.
There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
cp932 character set is designed
to solve these problems.
Because MySQL supports character set conversion, it is
important to separate IANA
cp932 into two different character sets
because they provide different conversion rules.
cp932 character set differs from
sjis in the following ways:
cp932supports NEC special characters, NEC selected—IBM extended characters, and IBM selected characters.
cp932characters have two different code points, both of which convert to the same Unicode code point. When converting from Unicode back to
cp932, one of the code points must be selected. For this “round trip conversion,” the rule recommended by Microsoft is used. (See http://support.microsoft.com/kb/170559/EN-US/.)
The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC selected—IBM extended characters, use the code point of IBM extended characters.
The table shown at https://msdn.microsoft.com/en-us/goglobal/cc305152.aspx provides information about the Unicode values of
cp932table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode (
ucs2) encoding. For table entries with an underlined two-digit value appears, there is a range of
cp932character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of the
cp932characters that begin with those digits.
The following links are of special interest. They correspond to the encodings for the following sets of characters:
NEC special characters (lead byte
NEC selected—IBM extended characters (lead byte
IBM selected characters (lead byte
https://msdn.microsoft.com/en-us/goglobal/gg671839 https://msdn.microsoft.com/en-us/goglobal/gg671840 https://msdn.microsoft.com/en-us/goglobal/gg671841
cp932supports conversion of user-defined characters in combination with
eucjpms, and solves the problems with
ujisconversion. For details, please refer to http://www.sljfaq.org/afaq/encodings.html.
For some characters, conversion to and from
ucs2 is different for
following tables illustrate these differences.
Users of any Japanese character sets should be aware that
has an important effect. See
Section 5.1.4, “Server Command Options”.
In MySQL, the
gb18030 character set
corresponds to the “Chinese National Standard GB
18030-2005: Information technology — Chinese coded
character set”, which is the official character set
of the People's Republic of China (PRC).
Supports all code points defined by the GB 18030-2005 standard. Unassigned code points in the ranges (GB+8431A439, GB+90308130) and (GB+E3329A36, GB+EF39EF39) are treated as '
?' (0x3F). Conversion of unassigned code points return '
Supports UPPER and LOWER conversion for all GB18030 code points. Case folding defined by Unicode is also supported (based on
Supports Conversion of data to and from other character sets.
Supports SQL statements such as
Supports comparison between
gb18030strings, and between
gb18030strings and strings of other character sets. There is a conversion if strings have different character sets. Comparisons that include or ignore trailing spaces are also supported.
The private use area (U+E000, U+F8FF) in Unicode is mapped to
There is no mapping between (U+D800, U+DFFF) and GB18030. Attempted conversion of code points in this range returns '
If an incoming sequence is illegal, an error or warning is returned. If an illegal sequence is used in
CONVERT(), an error is returned. Otherwise, a warning is returned.
For consistency with
utf8mb4, UPPER is not supported for ligatures.
Searches for ligatures also match uppercase ligatures when using the
If a character has more than one uppercase character, the chosen uppercase character is the one whose lowercase is the character itself.
The minimum multibyte length is 1 and the maximum is 4. The character set determines the length of a sequence using the first 1 or 2 bytes.
gb18030_bin: A binary collation.
gb18030_chinese_ci: The default collation, which supports Pinyin. Sorting of non-Chinese characters is based on the order of the original sort key. The original sort key is
UPPER(ch)exists. Otherwise, the original sort key is
GB(ch). Chinese characters are sorted according to the Pinyin collation defined in the Unicode Common Locale Data Repository (CLDR 24). Non-Chinese characters are sorted before Chinese characters with the exception of
GB+FE39FE39, which is the code point maximum.
gb18030_unicode_520_ci: A Unicode collation. Use this collation if you need to ensure that ligatures are sorted correctly.