This set of Frequently Asked Questions derives from the experience of MySQL's Support and Development groups in handling many inquiries about CJK (Chinese-Japanese-Korean) issues.
Questions
A.12.1:
I have inserted CJK characters into my table. Why does
SELECT display them as 「?」
characters?
A.12.2: What GB (Chinese) character sets does MySQL support?
A.12.3: What problems should I be aware of when working with the Big5 Chinese character set?
A.12.4: Why do Japanese character set conversions fail?
A.12.5:
What should I do if I want to convert SJIS
81CA to cp932?
A.12.6:
How does MySQL represent the Yen (¥)
sign?
A.12.7:
Do MySQL plan to make a separate character set where
5C is the Yen sign, as at least one other
major DBMS does?
A.12.8: Of what issues should I be aware when working with Korean character sets in MySQL?
A.12.9: Why do I get Data truncated error messages?
A.12.10: Why does my GUI front end or browser not display CJK characters correctly in my application using Access, PHP, or another API?
A.12.11: I've upgraded to MySQL 5.1. How can I revert to behavior like that in MySQL 4.0 with regard to character sets?
A.12.12:
Why do some LIKE and
FULLTEXT searches with CJK characters
fail?
A.12.13: What CJK character sets are available in MySQL?
A.12.14:
How do I know whether character X
is available in all character sets?
A.12.15: Why don't CJK strings sort correctly in Unicode? (I)
A.12.16: Why don't CJK strings sort correctly in Unicode? (II)
A.12.17: Why are my supplementary characters rejected by MySQL?
A.12.18: Shouldn't it be 「CJKV」?
A.12.19: Does MySQL allow CJK characters to be used in database and table names?
A.12.20: Where can I find translations of the MySQL Manual into Chinese, Japanese, and Korean?
A.12.21: Where can I get help with CJK and related issues in MySQL?
Questions and Answers
A.12.1:
I have inserted CJK characters into my table. Why does
SELECT display them as 「?」
characters?
This problem is usually due to a setting in MySQL that doesn't match the settings for the application program or the operating system. Here are some common steps for correcting these types of issues:
Be certain of what MySQL version you are using.
Use the statement SELECT VERSION();
to determine this.
Make sure that the database is actually using the desired character set.
People often think that the client character set is
always the same as either the server character set or
the character set used for display purposes. However,
both of these are false assumptions. You can make sure
by checking the result of SHOW CREATE TABLE
or
— better — yet by using this statement:
tablename
SELECT character_set_name, collation_name
FROM information_schema.columns
WHERE table_schema = your_database_name
AND table_name = your_table_name
AND column_name = your_column_name;
Determine the hexadecimal value of the character or characters that are not being displayed correctly.
You can obtain this information for a column
column_name in the table
table_name using the
following query:
SELECT HEX(column_name) FROMtable_name;
3F is the encoding for the
? character; this means that
? is the character actually stored in
the column. This most often happens because of a problem
converting a particular character from your client
character set to the target character set.
Make sure that a round trip possible —
that is, when you select
literal (or
_introducer
hexadecimal-value), you obtain
literal as a
result.
For example, the Japanese
Katakana character
Pe
(ペ') exists in all CJK character
sets, and has the code point value (hexadecimal coding)
0x30da. To test a round trip for this
character, use this query:
SELECT 'ペ' AS `ペ`; /* or SELECT _ucs2 0x30da; */
For bug reports regarding such failures, we might ask
you to follow up with SELECT
HEX('ペ');. Then we can determine whether the
client encoding is correct.
Make sure that the problem is not with the browser or other application, rather than with MySQL.
Use the mysql client program (on Windows: mysql.exe) to accomplish this task. If mysql displays correctly but your application doesn't, then your problem is probably due to system settings.
To find out what your settings are, use the
SHOW VARIABLES statement, whose
output should resemble what is shown here:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.03 sec)
Although Unicode (usually the utf8
variant on Unix, and the ucs2 variant
on Windows) is preferable to Latin, it's often not what
your operating system utilities support best. Many
Windows users find that a Microsoft character set, such
as cp932 for Japanese Windows, is
what's suitable.
If you cannot control the server settings, and you have
no idea what your underlying computer is, then try
changing to a common character set for the country that
you're in (euckr = Korea;
gb2312 or gbk =
People's Republic of China; big5 =
Taiwan; sjis,
ujis, cp932, or
eucjpms = Japan;
ucs2 or utf8 =
anywhere). Usually it is necessary to change only the
client and connection and results settings. There is a
simple statement which changes all three at once:
SET NAMES. For example:
SET NAMES 'big5';
Once the setting is correct, you can make it permanent
by editing my.cnf or
my.ini. For example you might add
lines looking like these:
[mysqld] character-set-server=big5 [client] default-character-set=big5
It is also possible that there are issues with the API configuration setting being used in your application; see Why does my GUI front end or browser not display CJK characters correctly...? for more information.
A.12.2: What GB (Chinese) character sets does MySQL support?
MySQL supports the two common variants of the
GB (Guojia
Biaozhun, or National
Standard) character sets which are official in
the People's Republic of China: gb2312
and gbk. Sometimes people try to insert
gbk characters into
gb2312, and it works most of the time
because gbk is a superset of
gb2312 — but eventually they try to
insert a rarer Chinese character and it doesn't work. (See
Bug #16072 for an example).
Here, we try to clarify exactly what characters are
legitimate in gb2312 or
gbk, with reference to the official
documents. Please check these references before reporting
gb2312 or gbk bugs.
For a complete listing of the
gb2312 characters, ordered
according to the gb2312_chinese_ci
collation:
http://d.udm.net/bar/~bar/charts/gb2312_chinese_ci.html.
MySQL's gbk is in reality
「Microsoft code page 936」. This differs
from the official gbk for
characters A1A4 (middle dot),
A1AA (em dash),
A6E0-A6F5, and
A8BB-A8C0. For a listing of the
differences, see
http://recode.progiciels-bpi.ca/showfile.html?name=dist/libiconv/gbk.h.
For a listing of gbk/Unicode
mappings, see
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT.
For MySQL's listing of gbk
characters, see
http://d.udm.net/bar/~bar/charts/gbk_chinese_ci.html.
A.12.3: What problems should I be aware of when working with the Big5 Chinese character set?
MySQL supports the Big5 character set which is common in
Hong Kong and Taiwan (Republic of China). MySQL's
big5 is in reality Microsoft code page
950, which is very similar to the original
big5 character set. We changed to this
character set starting with MySQL version 4.1.16 / 5.0.16
(as a result of Bug #12476). For example, the following
statements work in current versions of MySQL, but not in old
versions:
mysql>CREATE TABLE big5 (BIG5 CHAR(1) CHARACTER SET BIG5);Query OK, 0 rows affected (0.13 sec) mysql>INSERT INTO big5 VALUES (0xf9dc);Query OK, 1 row affected (0.00 sec) mysql>SELECT * FROM big5;+------+ | big5 | +------+ | 嫺 | +------+ 1 row in set (0.02 sec)
A.12.4: Why do Japanese character set conversions fail?
MySQL supports the sjis,
ujis, cp932, and
eucjpms character sets, as well as Unicode. A common need is
to convert between character sets. For example, there might
be a Unix server (typically with sjis or
ujis) and a Windows client (typically
with cp932).
In the following conversion table, the
ucs2 column represents the source, and
the sjis, cp932,
ujis, and eucjpms
columns represent the destinations — that is, the last
4 columns provide the hexadecimal result when we use
CONVERT(ucs2) or we assign a
ucs2 column containing the value to an
sjis, cp932,
ujis, or eucjpms
column.
| Character Name | ucs2 | sjis | cp932 | ujis | eucjpms |
|---|---|---|---|---|---|
| BROKEN BAR | 00A6 | 3F | 3F | 8FA2C3 | 3F |
| FULLWIDTH BROKEN BAR | FFE4 | 3F | FA55 | 3F | 8FA2 |
| YEN SIGN | 00A5 | 3F | 3F | 20 | 3F |
| FULLWIDTH YEN SIGN | FFE5 | 818F | 818F | A1EF | 3F |
| TILDE | 007E | 7E | 7E | 7E | 7E |
| OVERLINE | 203E | 3F | 3F | 20 | 3F |
| HORIZONTAL BAR | 2015 | 815C | 815C | A1BD | A1BD |
| EM DASH | 2014 | 3F | 3F | 3F | 3F |
| REVERSE SOLIDUS | 005C | 815F | 5C | 5C | 5C |
| FULLWIDTH "" | FF3C | 3F | 815F | 3F | A1C0 |
| WAVE DASH | 301C | 8160 | 3F | A1C1 | 3F |
| FULLWIDTH TILDE | FF5E | 3F | 8160 | 3F | A1C1 |
| DOUBLE VERTICAL LINE | 2016 | 8161 | 3F | A1C2 | 3F |
| PARALLEL TO | 2225 | 3F | 8161 | 3F | A1C2 |
| MINUS SIGN | 2212 | 817C | 3F | A1DD | 3F |
| FULLWIDTH HYPHEN-MINUS | FF0D | 3F | 817C | 3F | A1DD |
| CENT SIGN | 00A2 | 8191 | 3F | A1F1 | 3F |
| FULLWIDTH CENT SIGN | FFE0 | 3F | 8191 | 3F | A1F1 |
| POUND SIGN | 00A3 | 8192 | 3F | A1F2 | 3F |
| FULLWIDTH POUND SIGN | FFE1 | 3F | 8192 | 3F | A1F2 |
| NOT SIGN | 00AC | 81CA | 3F | A2CC | 3F |
| FULLWIDTH NOT SIGN | FFE2 | 3F | 81CA | 3F | A2CC |
A.12.5:
What should I do if I want to convert SJIS
81CA to cp932?
Our answer is: 「?」. There are serious
complaints about this: many people would prefer a
「loose」 conversion, so that 81CA (NOT
SIGN) in sjis becomes
81CA (FULLWIDTH NOT SIGN) in
cp932. We are considering a change to
this behavior.
A.12.6:
How does MySQL represent the Yen (¥)
sign?
A problem arises because some versions of Japanese character
sets (both sjis and
euc) treat 5C as a
reverse solidus (\
— also known as a backslash), and others treat it as a
yen sign (¥).
MySQL follows only one version of the JIS (Japanese
Industrial Standards) standard description. In MySQL,
5C is always the reverse
solidus (\).
A.12.7:
Do MySQL plan to make a separate character set where
5C is the Yen sign, as at least one other
major DBMS does?
This is one possible solution to the Yen sign issue; however, this will not happen in MySQL 5.1 or 5.2.
A.12.8: Of what issues should I be aware when working with Korean character sets in MySQL?
In theory, while there have been several versions of the
euckr (Extended Unix Code
Korea) character set, only one problem has been
noted.
We use the 「ASCII」 variant of EUC-KR, in which
the code point 0x5c is REVERSE SOLIDUS,
that is \, instead of the
「KS-Roman」 variant of EUC-KR, in which the code
point 0x5c is WON
SIGN(₩). This means that you
cannot convert Unicode U+20A9 to
euckr:
mysql>SELECT->CONVERT('₩' USING euckr) AS euckr,->HEX(CONVERT('₩' USING euckr)) AS hexeuckr;+-------+----------+ | euckr | hexeuckr | +-------+----------+ | ? | 3F | +-------+----------+ 1 row in set (0.00 sec)
A.12.9: Why do I get Data truncated error messages?
For illustration, we'll create a table with one Unicode
(ucs2) column and one Chinese
(gb2312) column.
mysql>CREATE TABLE ch->(ucs2 CHAR(3) CHARACTER SET ucs2,->gb2312 CHAR(3) CHARACTER SET gb2312);Query OK, 0 rows affected (0.05 sec)
A.12.10: Why does my GUI front end or browser not display CJK characters correctly in my application using Access, PHP, or another API?
Obtain a direct connection to the server using the
mysql client (Windows:
mysql.exe), and try the same query there.
If mysql responds correctly, then the
trouble may be that your application interface requires
initialization. Use mysql to tell you
what character set or sets it uses with the statement
SHOW VARIABLES LIKE 'char%';. If you are
using Access, then you are most likely connecting with
MyODBC. In this case, you should check
「Connector/ODBC の構成」. If, for instance,
you use big5, you would enter
SET NAMES 'big5'. (Note that no
; is required in this case). If you are
using ASP, you might need to add SET
NAMES in the code. Here is an example that has
worked in the past:
<%
Session.CodePage=0
Dim strConnection
Dim Conn
strConnection="driver={MySQL ODBC 3.51 Driver};server=server;uid=username;" \
& "pwd=password;database=database;stmt=SET NAMES 'big5';"
Set Conn = Server.CreateObject("ADODB.Connection")
Conn.Open strConnection
%>
If you are using PHP, try this:
<?php
$link = mysql_connect($host, $usr, $pwd);
mysql_select_db($db);
if( mysql_error() ) { print "Database ERROR: " . mysql_error(); }
mysql_query("SET NAMES 'utf8'", $link);
?>
We encourage the use of the newer mysqli
extension, rather than mysql. Using
mysqli, the previous example could be
rewritten as shown here:
<?php
$link = new mysqli($host, $usr, $pwd, $db);
if( mysqli_connect_errno() )
{
printf("Connect failed: %s\n", mysqli_connect_error());
exit();
}
$link->query("SET NAMES 'utf8'");
?>
If you are using Connector/J, see 「文字セットと Unicode の使用」.
A.12.11: I've upgraded to MySQL 5.1. How can I revert to behavior like that in MySQL 4.0 with regard to character sets?
In MySQL Version 4.0, there was a single 「global」 character set for both server and client, and the decision as to which character to use was made by the server administrator. This changed starting with MySQL Version 4.1. What happens now is a 「handshake」, as described in 「接続のキャラクタセットおよび照合順序」:
When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the
character_set_client,character_set_results, andcharacter_set_connectionsystem variables. In effect, the server performs aSET NAMESoperation using the character set name.
By way of example, suppose that your favorite server
character set is latin1 (unlikely in a
CJK area, but this is the default value). Suppose further
that the client uses utf8 because this is
what the client's operating system supports. Now, start the
server with latin1 as its default
character set:
mysqld --character-set-server=latin1
A.12.12:
Why do some LIKE and
FULLTEXT searches with CJK characters
fail?
There is a very simple problem with LIKE
searches on BINARY and
BLOB columns: we need to know the end of
a character. With multi-byte character sets, different
characters might have different octet lengths. For example,
in utf8, A requires
one byte but ペ requires three bytes, as
shown here:
+-------------------------+---------------------------+ | OCTET_LENGTH(_utf8 'A') | OCTET_LENGTH(_utf8 'ペ') | +-------------------------+---------------------------+ | 1 | 3 | +-------------------------+---------------------------+ 1 row in set (0.00 sec)
This is one reason why MySQL cannot allow encodings of nonexistent characters. If it is not strict about rejecting bad input, then it has no way of knowing where characters end.
For FULLTEXT searches, we need to know
where words begin and end. With Western languages, this is
rarely a problem because most (if not all) of these use an
easy-to-identify word boundary — the space character.
However, this is not usually the case with Asian writing. We
could use arbitrary halfway measures, like assuming that all
Han characters represent words, or (for Japanese) depending
on changes from Katakana to Hiragana due to grammatical
endings. However, the only sure solution requires a
comprehensive word list, which means that we would have to
include a dictionary in the server for each Asian language
supported. This is simply not feasible.
A.12.13: What CJK character sets are available in MySQL?
The list of CJK character sets may vary depending on your
MySQL version. For example, the eucjpms
character set was not supported prior to MySQL 5.0.3.
However, since the name of the applicable language appears
in the DESCRIPTION column for every entry
in the INFORMATION_SCHEMA.CHARACTER_SETS
table, you can obtain a current list of all the non-Unicode
CJK character sets using this query:
mysql>SELECT CHARACTER_SET_NAME, DESCRIPTION->FROM INFORMATION_SCHEMA.CHARACTER_SETS->WHERE DESCRIPTION LIKE '%Chinese%'->OR DESCRIPTION LIKE '%Japanese%'->OR DESCRIPTION LIKE '%Korean%'->ORDER BY CHARACTER_SET_NAME;+--------------------+---------------------------+ | CHARACTER_SET_NAME | DESCRIPTION | +--------------------+---------------------------+ | big5 | Big5 Traditional Chinese | | cp932 | SJIS for Windows Japanese | | eucjpms | UJIS for Windows Japanese | | euckr | EUC-KR Korean | | gb2312 | GB2312 Simplified Chinese | | gbk | GBK Simplified Chinese | | sjis | Shift-JIS Japanese | | ujis | EUC-JP Japanese | +--------------------+---------------------------+ 8 rows in set (0.01 sec)
A.12.14:
How do I know whether character X
is available in all character sets?
The majority of simplified Chinese and basic non-halfwidth
Japanese Kana characters
appear in all CJK character sets. This stored procedure
accepts a UCS-2 Unicode character,
converts it to all other character sets, and displays the
results in hexadecimal.
DELIMITER //
CREATE PROCEDURE p_convert(ucs2_char CHAR(1) CHARACTER SET ucs2)
BEGIN
CREATE TABLE tj
(ucs2 CHAR(1) character set ucs2,
utf8 CHAR(1) character set utf8,
big5 CHAR(1) character set big5,
cp932 CHAR(1) character set cp932,
eucjpms CHAR(1) character set eucjpms,
euckr CHAR(1) character set euckr,
gb2312 CHAR(1) character set gb2312,
gbk CHAR(1) character set gbk,
sjis CHAR(1) character set sjis,
ujis CHAR(1) character set ujis);
INSERT INTO tj (ucs2) VALUES (ucs2_char);
UPDATE tj SET utf8=ucs2,
big5=ucs2,
cp932=ucs2,
eucjpms=ucs2,
euckr=ucs2,
gb2312=ucs2,
gbk=ucs2,
sjis=ucs2,
ujis=ucs2;
/* If there's a conversion problem, UPDATE will produce a warning. */
SELECT hex(ucs2) AS ucs2,
hex(utf8) AS utf8,
hex(big5) AS big5,
hex(cp932) AS cp932,
hex(eucjpms) AS eucjpms,
hex(euckr) AS euckr,
hex(gb2312) AS gb2312,
hex(gbk) AS gbk,
hex(sjis) AS sjis,
hex(ujis) AS ujis
FROM tj;
DROP TABLE tj;
END//
A.12.15: Why don't CJK strings sort correctly in Unicode? (I)
Sometimes people observe that the result of a
utf8_unicode_ci or
ucs2_unicode_ci search, or of an
ORDER BY sort is not what they think a
native would expect. Although we never rule out the
possibility that there is a bug, we have found in the past
that many people do not read correctly the standard table of
weights for the Unicode Collation Algorithm. MySQL uses the
table found at
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
This is not the first table you will find by navigating from
the unicode.org home page, because MySQL
uses the older 4.0.0 「allkeys」 table, rather
than the more recent 4.1.0 table. This is because we are
very wary about changing ordering which affects indexes,
lest we bring about situations such as that reported in Bug
#16526, illustrated as follows:
mysql<CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);Query OK, 0 rows affected (0.05 sec) mysql>INSERT INTO tj VALUES ('が'),('か');Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql>SELECT * FROM tj WHERE s1 = 'か';+------+ | s1 | +------+ | が | | か | +------+ 2 rows in set (0.00 sec)
A.12.16: Why don't CJK strings sort correctly in Unicode? (II)
If you are using Unicode (ucs2 or
utf8), and you know what the Unicode sort
order is (see 「MySQL 5.1 FAQ — MySQL Chinese, Japanese, and Korean
Character Sets」), but MySQL still
seems to sort your table incorrectly, then you should first
verify the table character set:
mysql> SHOW CREATE TABLE t\G
******************** 1. row ******************
Table: t
Create Table: CREATE TABLE `t` (
`s1` char(1) CHARACTER SET ucs2 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
You can see that the collation is
ucs2_general_ci instead of
ucs2_unicode_ci. The reason why this is
so can be found using SHOW CHARSET, as
shown here:
mysql> SHOW CHARSET LIKE 'ucs2%';
+---------+---------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
+---------+---------------+-------------------+--------+
1 row in set (0.00 sec)
A.12.17: Why are my supplementary characters rejected by MySQL?
MySQL does not support supplementary characters — that
is, characters which need more than 3 bytes — for
UTF-8. We support only what Unicode calls
the Basic Multilingual Plane / Plane 0.
Only a few very rare Han characters are supplementary;
support for them is uncommon. This has led to reports such
as that found in Bug #12600, which we rejected as 「not
a bug」. With utf8, we must
truncate an input string when we encounter bytes that we
don't understand. Otherwise, we wouldn't know how long the
bad multi-byte character is.
One possible workaround is to use ucs2
instead of utf8, in which case the
「bad」 characters are changed to question marks;
however, no truncation takes place. You can also change the
data type to BLOB or
BINARY, which perform no validity
checking.
We intend at some point in the future to add support for
UTF-16, which would solve such issues by
allowing 4-byte characters. However, we have as yet set no
definite timetable for doing so.
A.12.18: Shouldn't it be 「CJKV」?
No. The term 「CJKV」 (Chinese Japanese Korean Vietnamese) refers to Vietnamese character sets which contain Han (originally Chinese) characters. MySQL has no plan to support the old Vietnamese script using Han characters. MySQL does of course support the modern Vietnamese script with Western characters.
Bug #4745 is a request for a specialized Vietnamese collation, which we might add in the future if there is sufficient demand for it.
A.12.19: Does MySQL allow CJK characters to be used in database and table names?
This issue is fixed in MySQL 5.1, by automatically rewriting the names of the corresponding directories and files.
For example, if you create a database named
楮 on a server whose operating system
does not support CJK in directory names, MySQL creates a
directory named @0w@00a5@00ae. which is
just a fancy way of encoding E6A5AE
— that is, the Unicode hexadecimal representation for
the 楮 character. However, if you run a
SHOW DATABASES statement, you can see
that the database is listed as 楮.
A.12.20: Where can I find translations of the MySQL Manual into Chinese, Japanese, and Korean?
A Simplified Chinese version of the Manual, current for MySQL 5.1.12, can be found at http://dev.mysql.com/doc/#chinese-5.1. The Japanese translation of the MySQL 4.1 manual can be downloaded from http://dev.mysql.com/doc/#japanese-4.1.
A.12.21: Where can I get help with CJK and related issues in MySQL?
The following resources are available:
A listing of MySQL user groups can be found at http://dev.mysql.com/user-groups/.
You can contact a sales engineer at the MySQL KK Japan office using any of the following:
Tel: +81(0)3-5326-3133 Fax: +81(0)3-5326-3001 Email: dsaito@mysql.com
View feature requests relating to character set issues at http://tinyurl.com/y6xcuf.
Visit the MySQL Character Sets, Collation, Unicode Forum. We are also in the process of adding foreign-language forums at http://forums.mysql.com/.
