A.11.1.  | 
        What CJK character sets are available in MySQL?
        | 
 | 
        The list of CJK character sets may vary depending on your MySQL
        version. For example, the gb18030 character
        set is not supported prior to MySQL 5.7.4. However, since the
        name of the applicable language appears in the
        DESCRIPTION column for every entry in the
        INFORMATION_SCHEMA.CHARACTER_SETS
        table, you can obtain a current list of all the non-Unicode CJK
        character sets using this query:
       mysql> SELECT CHARACTER_SET_NAME, DESCRIPTION
       FROM INFORMATION_SCHEMA.CHARACTER_SETS
       WHERE DESCRIPTION LIKE '%Chin%'
       OR DESCRIPTION LIKE '%Japanese%'
       OR DESCRIPTION LIKE '%Korean%'
       ORDER BY CHARACTER_SET_NAME;
+--------------------+---------------------------------+
| CHARACTER_SET_NAME | DESCRIPTION                     |
+--------------------+---------------------------------+
| big5               | Big5 Traditional Chinese        |
| cp932              | SJIS for Windows Japanese       |
| eucjpms            | UJIS for Windows Japanese       |
| euckr              | EUC-KR Korean                   |
| gb18030            | China National Standard GB18030 |
| gb2312             | GB2312 Simplified Chinese       |
| gbk                | GBK Simplified Chinese          |
| sjis               | Shift-JIS Japanese              |
| ujis               | EUC-JP Japanese                 |
+--------------------+---------------------------------+
 
        (For more information, see
        Section 28.3.4, “The INFORMATION_SCHEMA CHARACTER_SETS Table”.)
       
        MySQL supports three variants of the
        GB (Guojia
        Biaozhun, or National
        Standard, or Simplified Chinese)
        character sets which are official in the People's Republic of
        China: gb2312, gbk, and
        (as of MySQL 5.7.4) gb18030.
       
        Sometimes people try to insert gbk characters
        into gb2312, and it works most of the time
        because gbk is a superset of
        gb2312. But eventually they try to insert a
        rarer Chinese character and it does not work. (For an example,
        see Bug #16072).
       
        Here, we try to clarify exactly what characters are legitimate
        in gb2312 or gbk, with
        reference to the official documents. Please check these
        references before reporting gb2312 or
        gbk bugs:
 
        It is also possible to store CJK characters in Unicode character
        sets, although the available collations may not sort characters
        quite as you expect:
 
            The utf8 and ucs2
            character sets support the characters from Unicode Basic
            Multilingual Plane (BMP). These characters have code point
            values between U+0000 and
            U+FFFF.
           
            The utf8mb4, utf16,
            utf16le, and utf32
            character sets support BMP characters, as well as
            supplementary characters that lie outside the BMP.
            Supplementary characters have code point values between
            U+10000 and U+10FFFF.
  
 
        The collation used for a Unicode character set determines the
        ability to sort (that is, distinguish) characters in the set:
 
            Collations based on Unicode Collation Algorithm (UCA) 4.0.0
            distinguish only BMP characters.
           
            Collations based on UCA 5.2.0 or 9.0.0 distinguish BMP and
            supplementary characters.
           
            Non-UCA collations may not distinguish all Unicode
            characters. For example, the utf8mb4
            default collation is utf8mb4_general_ci,
            which distinguishes only BMP characters.
  
 
        Moreover, distinguishing characters is not the same as ordering
        them per the conventions of a given CJK language. Currently,
        MySQL has only one CJK-specific UCA collation,
        gb18030_unicode_520_ci (which requires use of
        the non-Unicode gb18030 character set).
       
        For information about Unicode collations and their
        differentiating properties, including collation properties for
        supplementary characters, see
        Section 12.10.1, “Unicode Character Sets”.
        | 
A.11.2.  | 
        I have inserted CJK characters into my table. Why does
        SELECT display them as
        “?” characters?
        | 
 | 
        This problem is usually due to a setting in MySQL that does not
        match the settings for the application program or the operating
        system. Here are some common steps for correcting these types of
        issues:
 
            Be certain of what MySQL version you are
            using.
           
            Use the statement SELECT VERSION(); to
            determine this.
           
            Make sure that the database is actually using the
            desired character set.
           
            People often think that the client character set is always
            the same as either the server character set or the character
            set used for display purposes. However, both of these are
            false assumptions. You can make sure by checking the result
            of SHOW CREATE TABLE
            tablename or, better
            yet, by using this statement:
           SELECT character_set_name, collation_name
    FROM information_schema.columns
    WHERE table_schema = your_database_name
        AND table_name = your_table_name
        AND column_name = your_column_name;
 
            Determine the hexadecimal value of the character
            or characters that are not being displayed
            correctly.
           
            You can obtain this information for a column
            column_name in the table
            table_name using the following
            query:
           SELECT HEX(column_name)
FROM table_name;
 
            3F is the encoding for the
            ? character; this means that
            ? is the character actually stored in the
            column. This most often happens because of a problem
            converting a particular character from your client character
            set to the target character set.
           
            Make sure that a round trip is possible. When you
            select literal (or
            _introducer hexadecimal-value),
            do you obtain literal as a
            result?
           
            For example, the Japanese Katakana character
            Pe (ペ')
            exists in all CJK character sets, and has the code point
            value (hexadecimal coding) 0x30da. To
            test a round trip for this character, use this query:
           SELECT 'ペ' AS `ペ`;         /* or SELECT _ucs2 0x30da; */
 
            If the result is not also ペ, the round
            trip failed.
           
            For bug reports regarding such failures, we might ask you to
            follow up with SELECT HEX('ペ');. Then
            we can determine whether the client encoding is correct.
           
            Make sure that the problem is not with the browser
            or other application, rather than with MySQL.
           
            Use the mysql client program to
            accomplish this task. If mysql displays
            characters correctly but your application does not, your
            problem is probably due to system settings.
           
            To determine your settings, use the
            SHOW VARIABLES statement,
            whose output should resemble what is shown here:
           mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | utf8                                   |
| character_set_connection | utf8                                   |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | utf8                                   |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
 
            These are typical character-set settings for an
            international-oriented client (notice the use of
            utf8 Unicode) connected to a server in
            the West (latin1 is a West Europe
            character set).
           
            Although Unicode (usually the utf8
            variant on Unix, and the ucs2 variant on
            Windows) is preferable to Latin, it is often not what your
            operating system utilities support best. Many Windows users
            find that a Microsoft character set, such as
            cp932 for Japanese Windows, is suitable.
           
            If you cannot control the server settings, and you have no
            idea what setting your underlying computer uses, try
            changing to a common character set for the country that
            you're in (euckr = Korea;
            gb18030, gb2312 or
            gbk = People's Republic of China;
            big5 = Taiwan; sjis,
            ujis, cp932, or
            eucjpms = Japan; ucs2
            or utf8 = anywhere). Usually it is
            necessary to change only the client and connection and
            results settings. The SET
            NAMES. statement changes all three at once. For
            example:
           SET NAMES 'big5';
 
            Once the setting is correct, you can make it permanent by
            editing my.cnf or
            my.ini. For example you might add lines
            looking like these:
           [mysqld]
character-set-server=big5
[client]
default-character-set=big5
 
            It is also possible that there are issues with the API
            configuration setting being used in your application; see
            Why does my GUI front end or browser not display
            CJK characters correctly...? for more
            information.
  
 
 | 
A.11.3.  | 
        What problems should I be aware of when working with the Big5
        Chinese character set?
        | 
 | 
        MySQL supports the Big5 character set which is common in Hong
        Kong and Taiwan (Republic of China). The MySQL
        big5 character set is in reality Microsoft
        code page 950, which is very similar to the original
        big5 character set.
       
        A feature request for adding HKSCS extensions
        has been filed. People who need this extension may find the
        suggested patch for Bug #13577 to be of interest.
        | 
A.11.4.  | 
        Why do Japanese character set conversions fail?
        | 
 | 
        MySQL supports the sjis,
        ujis, cp932, and
        eucjpms character sets, as well as Unicode. A
        common need is to convert between character sets. For example,
        there might be a Unix server (typically with
        sjis or ujis) and a
        Windows client (typically with cp932).
       
        In the following conversion table, the ucs2
        column represents the source, and the sjis,
        cp932, ujis, and
        eucjpms columns represent the destinations;
        that is, the last 4 columns provide the hexadecimal result when
        we use CONVERT(ucs2) or we assign
        a ucs2 column containing the value to an
        sjis, cp932,
        ujis, or eucjpms column.
 
        Now consider the following portion of the table.
 
        This means that MySQL converts the NOT SIGN
        (Unicode U+00AC) to sjis
        code point 0x81CA and to
        cp932 code point 3F.
        (3F is the question mark (“?”.
        This is what is always used when the conversion cannot be
        performed.)
        | 
A.11.5.  | 
        What should I do if I want to convert SJIS
        81CA to cp932?
        | 
 | 
        Our answer is: “?”. There are disadvantages to
        this, and many people would prefer a “loose”
        conversion, so that 81CA (NOT SIGN) in
        sjis becomes 81CA (FULLWIDTH NOT
        SIGN) in cp932.
        | 
A.11.6.  | 
        How does MySQL represent the Yen (¥) sign?
        | 
 | 
        A problem arises because some versions of Japanese character
        sets (both sjis and euc)
        treat 5C as a
        reverse solidus
        (\, also known as a backslash), whereas
        others treat it as a yen sign (¥).
       
        MySQL follows only one version of the JIS (Japanese Industrial
        Standards) standard description. In MySQL,
        5C is always the reverse solidus
        (\).
        | 
A.11.7.  | 
        Of what issues should I be aware when working with Korean
        character sets in MySQL?
        | 
 | 
        In theory, while there have been several versions of the
        euckr (Extended
        Unix Code Korea) character set, only one problem has
        been noted. We use the “ASCII” variant of EUC-KR,
        in which the code point 0x5c is REVERSE
        SOLIDUS, that is \, instead of the
        “KS-Roman” variant of EUC-KR, in which the code
        point 0x5c is WON SIGN
        (₩). This means that you cannot convert
        Unicode U+20A9 to euckr:
       mysql> SELECT
           CONVERT('₩' USING euckr) AS euckr,
           HEX(CONVERT('₩' USING euckr)) AS hexeuckr;
+-------+----------+
| euckr | hexeuckr |
+-------+----------+
| ?     | 3F       |
+-------+----------+
  | 
A.11.8.  | 
        Why do I get Incorrect string value error
        messages?
        | 
 | 
        To see the problem, create a table with one Unicode
        (ucs2) column and one Chinese
        (gb2312) column.
       mysql> CREATE TABLE ch
       (ucs2 CHAR(3) CHARACTER SET ucs2,
       gb2312 CHAR(3) CHARACTER SET gb2312);
 
        In nonstrict SQL mode, try to place the rare character
        汌 in both columns.
       mysql> SET sql_mode = '';
mysql> INSERT INTO ch VALUES ('A汌B','A汌B');
Query OK, 1 row affected, 1 warning (0.00 sec)
 
        The INSERT produces a warning.
        Use the following statement to see what it is:
       mysql> SHOW WARNINGS\G
*************************** 1. row ***************************
  Level: Warning
   Code: 1366
Message: Incorrect string value: '\xE6\xB1\x8CB' for column 'gb2312' at row 1
 
        So it is a warning about the gb2312 column
        only.
       mysql> SELECT ucs2,HEX(ucs2),gb2312,HEX(gb2312) FROM ch;
+-------+--------------+--------+-------------+
| ucs2  | HEX(ucs2)    | gb2312 | HEX(gb2312) |
+-------+--------------+--------+-------------+
| A汌B | 00416C4C0042 | A?B    | 413F42      |
+-------+--------------+--------+-------------+
 
        Several things need explanation here:
 
            The 汌 character is not in the
            gb2312 character set, as described
            earlier.
           
            If you are using an old version of MySQL, you may see a
            different message.
           
            A warning occurs rather than an error because MySQL is not
            set to use strict SQL mode. In nonstrict mode, MySQL tries
            to do what it can, to get the best fit, rather than give up.
            With strict SQL mode, the Incorrect string
            value message occurs as an error rather than a
            warning, and the INSERT
            fails.
  
 
 | 
A.11.9.  | 
        Why does my GUI front end or browser display CJK characters
        incorrectly in my application using Access, PHP, or another API?
        | 
 | 
        Obtain a direct connection to the server using the
        mysql client, and try the same query there.
        If mysql responds correctly, the trouble may
        be that your application interface requires initialization. Use
        mysql to tell you what character set or sets
        it uses with the statement SHOW VARIABLES LIKE
        'char%';. If you are using Access, you are most likely
        connecting with Connector/ODBC. In this case, you should check
        Configuring Connector/ODBC. If, for example,
        you use big5, you would enter SET
        NAMES 'big5'. (In this case, no ;
        character is required.) If you are using ASP, you might need to
        add SET NAMES in the code. Here
        is an example that has worked in the past:
       <%
Session.CodePage=0
Dim strConnection
Dim Conn
strConnection="driver={MySQL ODBC 3.51 Driver};server=server;uid=username;" \
               & "pwd=password;database=database;stmt=SET NAMES 'big5';"
Set Conn = Server.CreateObject("ADODB.Connection")
Conn.Open strConnection
%>
 
        In much the same way, if you are using any character set other
        than latin1 with Connector/NET, you must specify the
        character set in the connection string. See
        Connector/NET Connections, for more
        information.
       
        If you are using PHP, try this:
       <?php
  $link = new mysqli($host, $usr, $pwd, $db);
  if( mysqli_connect_errno() )
  {
    printf("Connect failed: %s\n", mysqli_connect_error());
    exit();
  }
  $link->query("SET NAMES 'utf8'");
?>
 
        In this case, we used SET NAMES
        to change character_set_client,
        character_set_connection, and
        character_set_results.
       
        Another issue often encountered in PHP applications has to do
        with assumptions made by the browser. Sometimes adding or
        changing a <meta> tag suffices to
        correct the problem: for example, to insure that the user agent
        interprets page content as UTF-8, include
        <meta http-equiv="Content-Type" content="text/html;
        charset=utf-8"> in the
        <head> section of the HTML page.
       
        If you are using Connector/J, see
        Using Character Sets and Unicode.
        | 
A.11.10.  | 
        I've upgraded to MySQL 8.4. How can I revert to
        behavior like that in MySQL 4.0 with regard to character sets?
        | 
 | 
        In MySQL Version 4.0, there was a single “global”
        character set for both server and client, and the decision as to
        which character to use was made by the server administrator.
        This changed starting with MySQL Version 4.1. What happens now
        is a “handshake”, as described in
        Section 12.4, “Connection Character Sets and Collations”:
 
        The effect of this is that you cannot control the client
        character set by starting mysqld with
        --character-set-server=utf8.
        However, some Asian customers prefer the MySQL 4.0 behavior. To
        make it possible to retain this behavior, we added a
        mysqld switch,
        --character-set-client-handshake,
        which can be turned off with
        --skip-character-set-client-handshake.
        If you start mysqld with
        --skip-character-set-client-handshake,
        then, when a client connects, it sends to the server the name of
        the character set that it wants to use. However, the
        server ignores this request from the client.
       
        By way of example, suppose that your favorite server character
        set is latin1. Suppose further that the
        client uses utf8 because this is what the
        client's operating system supports. Start the server with
        latin1 as its default character set:
       mysqld --character-set-server=latin1
 
        And then start the client with the default character set
        utf8:
       mysql --default-character-set=utf8
 
        The resulting settings can be seen by viewing the output of
        SHOW VARIABLES:
       mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | utf8                                   |
| character_set_connection | utf8                                   |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | utf8                                   |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
 
        Now stop the client, and stop the server using
        mysqladmin. Then start the server again, but
        this time tell it to skip the handshake like so:
       mysqld --character-set-server=utf8 --skip-character-set-client-handshake
 
        Start the client with utf8 once again as the
        default character set, then display the resulting settings:
       mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | latin1                                 |
| character_set_connection | latin1                                 |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | latin1                                 |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
 
        As you can see by comparing the differing results from
        SHOW VARIABLES, the server
        ignores the client's initial settings if the
        --skip-character-set-client-handshake
        option is used.
        | 
A.11.11.  | 
        Why do some LIKE and
        FULLTEXT searches with CJK characters fail?
        | 
 | 
        For LIKE searches, there is a very
        simple problem with binary string column types such as
        BINARY and
        BLOB: we must know where
        characters end. With multibyte character sets, different
        characters might have different octet lengths. For example, in
        utf8, A requires one byte
        but ペ requires three bytes, as shown here:
       +-------------------------+---------------------------+
| OCTET_LENGTH(_utf8 'A') | OCTET_LENGTH(_utf8 'ペ') |
+-------------------------+---------------------------+
|                       1 |                         3 |
+-------------------------+---------------------------+
 
        If we do not know where the first character in a string ends, we
        do not know where the second character begins, in which case
        even very simple searches such as
        LIKE '_A%' fail.
        The solution is to use a nonbinary string column type defined to
        have the proper CJK character set. For example: mycol
        TEXT CHARACTER SET sjis. Alternatively, convert to a
        CJK character set before comparing.
       
        This is one reason why MySQL cannot permit encodings of
        nonexistent characters. If it is not strict about rejecting bad
        input, it has no way of knowing where characters end.
       
        For FULLTEXT searches, we must know where
        words begin and end. With Western languages, this is rarely a
        problem because most (if not all) of these use an
        easy-to-identify word boundary: the space character. However,
        this is not usually the case with Asian writing. We could use
        arbitrary halfway measures, like assuming that all Han
        characters represent words, or (for Japanese) depending on
        changes from Katakana to Hiragana due to grammatical endings.
        However, the only sure solution requires a comprehensive word
        list, which means that we would have to include a dictionary in
        the server for each Asian language supported. This is simply not
        feasible.
        | 
A.11.12.  | 
        How do I know whether character X is
        available in all character sets?
        | 
 | 
        The majority of simplified Chinese and basic nonhalfwidth
        Japanese Kana characters appear in all CJK character sets. The
        following stored procedure accepts a UCS-2
        Unicode character, converts it to other character sets, and
        displays the results in hexadecimal.
       DELIMITER //
CREATE PROCEDURE p_convert(ucs2_char CHAR(1) CHARACTER SET ucs2)
BEGIN
CREATE TABLE tj
             (ucs2 CHAR(1) character set ucs2,
              utf8 CHAR(1) character set utf8,
              big5 CHAR(1) character set big5,
              cp932 CHAR(1) character set cp932,
              eucjpms CHAR(1) character set eucjpms,
              euckr CHAR(1) character set euckr,
              gb2312 CHAR(1) character set gb2312,
              gbk CHAR(1) character set gbk,
              sjis CHAR(1) character set sjis,
              ujis CHAR(1) character set ujis);
INSERT INTO tj (ucs2) VALUES (ucs2_char);
UPDATE tj SET utf8=ucs2,
              big5=ucs2,
              cp932=ucs2,
              eucjpms=ucs2,
              euckr=ucs2,
              gb2312=ucs2,
              gbk=ucs2,
              sjis=ucs2,
              ujis=ucs2;
/* If there are conversion problems, UPDATE produces warnings. */
SELECT hex(ucs2) AS ucs2,
       hex(utf8) AS utf8,
       hex(big5) AS big5,
       hex(cp932) AS cp932,
       hex(eucjpms) AS eucjpms,
       hex(euckr) AS euckr,
       hex(gb2312) AS gb2312,
       hex(gbk) AS gbk,
       hex(sjis) AS sjis,
       hex(ujis) AS ujis
FROM tj;
DROP TABLE tj;
END//
DELIMITER ;
 
        The input can be any single ucs2 character,
        or it can be the code value (hexadecimal representation) of that
        character. For example, from Unicode's list of
        ucs2 encodings and names
        (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt),
        we know that the Katakana character
        Pe appears in all CJK character
        sets, and that its code value is X'30DA'. If
        we use this value as the argument to
        p_convert(), the result is as shown here:
       mysql> CALL p_convert(X'30DA');
+------+--------+------+-------+---------+-------+--------+------+------+------+
| ucs2 | utf8   | big5 | cp932 | eucjpms | euckr | gb2312 | gbk  | sjis | ujis |
+------+--------+------+-------+---------+-------+--------+------+------+------+
| 30DA | E3839A | C772 | 8379  | A5DA    | ABDA  | A5DA   | A5DA | 8379 | A5DA |
+------+--------+------+-------+---------+-------+--------+------+------+------+
 
        Since none of the column values is 3F (that
        is, the question mark character, ?), we know
        that every conversion worked.
        | 
A.11.13.  | 
        Why do CJK strings sort incorrectly in Unicode? (I)
        | 
 | 
        CJK sorting problems that occurred in older MySQL versions can
        be solved as of MySQL 8.0 by using the
        utf8mb4 character set and the
        utf8mb4_ja_0900_as_cs collation.
        | 
A.11.14.  | 
        Why do CJK strings sort incorrectly in Unicode? (II)
        | 
 | 
        CJK sorting problems that occurred in older MySQL versions can
        be solved as of MySQL 8.0 by using the
        utf8mb4 character set and the
        utf8mb4_ja_0900_as_cs collation.
        | 
A.11.15.  | 
        Why are my supplementary characters rejected by MySQL?
        | 
 | 
        Supplementary characters lie outside the Unicode Basic
        Multilingual Plane / Plane 0. BMP characters have
        code point values between U+0000 and
        U+FFFF. Supplementary characters have code
        point values between U+10000 and
        U+10FFFF.
       
        To store supplementary characters, you must use a character set
        that permits them:
 
            The utf8 and ucs2
            character sets support BMP characters only.
           
            The utf8 character set permits only
            UTF-8 characters that take up to three
            bytes. This has led to reports such as that found in Bug
            #12600, which we rejected as “not a bug”. With
            utf8, MySQL must truncate an input string
            when it encounters bytes that it does no understand.
            Otherwise, it is unknown how long the bad multibyte
            character is.
           
            One possible workaround is to use ucs2
            instead of utf8, in which case the
            “bad” characters are changed to question marks.
            However, no truncation takes place. You can also change the
            data type to BLOB or
            BINARY, which perform no
            validity checking.
           
            The utf8mb4, utf16,
            utf16le, and utf32
            character sets support BMP characters, as well as
            supplementary characters outside the BMP.
  
 
 | 
A.11.16.  | 
        Should “CJK” be “CJKV”?
        | 
 | 
        No. The term “CJKV”
        (Chinese Japanese Korean
        Vietnamese) refers to Vietnamese character sets which
        contain Han (originally Chinese) characters. MySQL supports the
        modern Vietnamese script with Western characters, but does not
        support the old Vietnamese script using Han characters.
       
        As of MySQL 5.6, there are Vietnamese collations for Unicode
        character sets, as described in
        Section 12.10.1, “Unicode Character Sets”.
        | 
A.11.17.  | 
        Does MySQL permit CJK characters to be used in database and
        table names?
        | 
 | 
        Yes.
        | 
A.11.18.  | 
        Where can I find translations of the MySQL Manual into Chinese,
        Japanese, and Korean?
        | 
 | 
        The Japanese translation of the MySQL 5.6 manual can be
        downloaded from https://dev.mysql.com/doc/.
        | 
A.11.19.  | 
        Where can I get help with CJK and related issues in MySQL?
        | 
 | 
        The following resources are available:
 
  |