WL#1213: Implement 4-byte UTF8, UTF16 and UTF32
Affects: Server-5.5 — Status: Complete — Priority: Low
Pushed to 6.0.4 on Nov 27/2007. Handle the rare characters which are outside the Unicode Basic Multilingual Plane (BMP). We will need to extend what we can do with the existing ucs2 and utf8 character sets, or add new Unicode character sets. We'll need up to four bytes per character. There will be UCS4 and full UTF8 support. There will be UTF16 support. WL#1213 is a step forward to have full support of Japanese encodings. WL#1213 adds support for new characters, but does not provide proper collations. In order to have collations, the following task should be complete: WL#2673; fixed filesort.
MySQL supports only those Unicode characters in the Basic Multilingual Plane (BMP) -- the ones which we can store in 2 bytes with ucs2, or up to 3 bytes with utf8. Unicode 3.1 (2001) and subsequent Unicode versions have 45,000 - 50,000 characters outside the BMP. With a UTF-16 character set, these supplementary characters require four bytes to store with "surrogate pairs" (a high 16-bit word and a low 16-bit word). With a UTF-32 (also known as UCS-4) character set, all characters require 4 bytes to store. With a UTF-8 extended character set, which we intend to call utf8mb4, these supplementary characters require four bytes to store. Additionally there is CESU-8, which is a UTF-8 variant that takes 6 bytes to store a supplementary character: 3 bytes for the equivalent of the high part of a UTF-16 surrogate pair, 3 bytes for the equivalent of the low part of a UTF-16 surrogate pair. Until now, we have regularly said that we intend to support supplementary characters someday, and CESU-8 never. This task is necessary before MySQL can support (Japanese) JIS 2004, or (Chinese) GB18030, which contain characters that can't be converted to Unicode BMP characters. This task does not include any changes to handling of combining characters, that's WL#3770. Validity checking ----------------- There will still be illegal sequences. We should check for them. If you allow anything at all, then how will you know how long a character is? You'd have to guess the length, and as soon as you make a wrong guess, everything that follows the bad character will be interpreted as garbage. (WL#3780 proposes to find and replace all 4-byte characters, but it would be a temporary workaround.) The Unicode Consortium publishes the sequences, e.g. in http://www.unicode.org/charts/PDF/Unicode-5.0/. We have decided what Unicode version to support for this purpose: "Unicode 5.0" http://www.unicode.org/versions/Unicode5.0.0/ Formally, for this task, it is sufficient to say that MySQL must support at least the version-3.1 supplementary characters. But there was no objection to saying "Unicode 5.0". Allow all characters which are acceptable in Unicode 5.0, including the so-called "non-characters". Do validity checking on input. Validity checks are: in utf16 there must not be a top surrogate without a bottom surrogate (or a bottom surrogate without a top surrogate), in any encoding the code point values must not be greater than 0x10ffff. Since the MySQL Reference Manual says we follow RFC 3629 (http://www.rfc-editor.org/rfc/rfc3629.txt), the only valid characters are Unicode characters, and values greater than 0x10ffff will never be possible. There is another point here. Actually, we can parse the query several times. For example, stored-program definition is parsed each time it is loaded in the cache. We should perform the validity checking only once -- when the user submitted the query. utf32 ----- Character set name: utf32. Corresponds to: UCS-4, UTF-32. Every character requires 32 bits. This is the simplest new character set, but the least useful. Few people want to use 4 bytes to store one character. The original proposal said "character set name: ucs4", but utf32 is the standard name. There will be no character set named ucs4. utf16 ----- Character set name: utf16. Corresponds to: UTF-16. Most characters require 16 bits. A BMP character looks exactly the same in utf16 as in ucs2. A non-BMP character will require a "surrogate pair", two 16-bit values in sequence. For general description of surrogates, see: http://unicode.org/versions/Unicode4.0.0/ch15.pdf Though utf16 was not part of the original task, it seems to fit here. Originally there was thought about splitting 'utf16' into a separate task, but that won't happen. Byte order in utf32 and utf16 ----------------------------- Unicode allows for several variations of UTF-16 and UTF-32: UTF-16 16-bit, big-endian unless string starts with BOM UTF-16BE 16-bit, big-endian UTF-16LE 16-bit, little-endian UTF-32 32-bit, big-endian unless string starts with BOM UTF-32BE 32-bit, big-endian UTF-32LE 32-bit, little-endian Big-endian means the high byte comes first. BOM means Byte Order Mark, U+FEFF. Applications can figure out whether a UTF-16 string is big-endian or little-endian by checking whether U+FEFF looks like 0xFE 0xFF or 0xFF 0xFE. A quick description is "Byte order encoding schemes" in a Wikipedia article, http://en.wikipedia.org/wiki/UTF-16 Advantages of little-endian: * ">" and "<" comparisons of utf32 strings can be more efficient when the processor favours little-endian, as Intel does * SQL Server compatibility is enhanced because SQL Server, like most Microsoft products, favours little-endian Advantages of big-endian: * MySQL's old "ucs2" character set is always big-endian (please correct me if I'm wrong about the word "always") * Unicode says that big-endian should be the default when there is no byte order mark, and Unicode recommends "Do not tag every string in a database or set of fields with a BOM". Approved Decisions: * MySQL ignores the byte order mark (BOM) or, more precisely, treats U+FEFF as ZERO WIDTH NON-BREAKING SPACE. * MySQL utf32 and utf16 are big-endian on all platforms. * Although MySQL's utf16 + utf32 correspond to Unicode's UTF-16BE and UTF-32BE, MySQL will not use the names utf16be or utf32be, utf16 and utf32 are SQL standard names. utf8? ----- Character set name: utf8 or utf8_4 or utf8mb4. Corresponds to: UTF-8. Three alternatives have been suggested: 1. Extend the old character set (utf8). Say that it's still the old utf8 character set but it's possible to use 4 bytes. Nobody else has seen a need for two different character sets so utf8_4 or utf8mb4 would be perceived as a defect or "gotcha" in MySQL. 2. Add a new character set named utf8mb4 (utf8 maximum bytes 4) or utf8_4 (utf8 4-byte). It's necessary to know how many bytes to reserve for CHAR(1) columns with storage_engine=MyISAM, and for sort strings. The decision long ago was: 3 bytes. (The name "uf8mb4" is derived from suggestions in the email thread "Valueclick UTF8".) (The name "utf8_4" is derived from suggestions in the email thread "Re: BUG#12600 and #9337".) 3. Rename the old character set (utf8) to something else, for example utf8mb3 or utf8_3. Make a new 4-byte character set and call the new one utf8. The meaning of the name "utf8" could be controlled by a variable, for example "old_utf8=yes". But we decided not to do this, it would cause sorrow, like sql_mode. This task can't start until this utf8 matter is decided. The choice will be "Alternative 2" and the new character set will be named utf8mb4. [ The above sentence is due to a decision change in February 2010. ] utf8mb3 alias for utf8 ---------------------- In a future version it's possible that utf8 will become the 4-byte utf8, and users who want to indicate 3-byte utf8 will have to say utf8mb3. To avoid some future problems which might occur with replication when master and slave have different versions, we will begin now to allow users to say 'utf8mb3' in the CHARACTER SET clause, and 'utf8mb3_[collation_substring]' in COLLATE clauses, where [collation_substring] is 'bin', 'czech_ci', 'danish_ci', 'esperanto_ci', 'estonian_ci', etc. For example: CREATE TABLE t (s1 CHAR(1) CHARACTER SET utf8mb3; SELECT * FROM t WHERE s1 COLLATE utf8mb3_general_ci = 'x'; DECLARE x VARCHAR(5) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci; SELECT CAST('a' AS CHAR CHARACTER SET utf8) COLLATE utf8_czech_ci; MySQL immediately converts the 'utf8mb3' in an alias to 'utf8', so in SHOW CREATE TABLE or SELECT CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS/etc. or SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS/etc. users will see the true name, "utf8" or "utf8_[collation_substring]". The "utf8mb3" alias is valid only in CHARACTER SET clauses, and in certain other places. For example these are legal: mysqld --character-set-server=utf8mb3 SET NAMES utf8mb3; /* and other SET statements which have similar effect */ SELECT _utf8mb3 'a'; No CESU-8, No Java Modified UTF-8 --------------------------------- Character set name: cesu8. Corresponds to: CESU-8. MySQL will not support CESU-8 or Java Modified UTF-8. Essentially the difference between UTF-8 and CESU-8 lies in the starting point for supplementary characters. In UTF-8 one encodes the code points, in CESU-8 one encodes the surrogate-pair values that exist for UTF16. See Unicode Technical Report #26 "Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)" http://unicode.org/reports/tr26/ Sun's "Modified UTF-8" is apparently the same as CESU-8, along with a special two-byte coding for U+0000: c0 80. See "Modified UTF-8" http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8 Since c0 80 is longer than 00, it is a "nonshortest code value" and therefore illegal according to unicode.org "Corrigendum #1: UTF-8 Shortest Form": http://unicode.org/versions/corrigendum1.html "Inheriting" repertoires and collations --------------------------------------- The new character sets have the same collations as the old (ucs2 and utf8) character sets, the only changed things are the names and the code ranges. Until WL#2673 "Unicode Collation Algorithm new version" is complete, supplementary characters will all have the same value for collating weights. For "general" collations they'll be treated uas U+FFFD REPLACEMENT CHARACTER. For "uca" collations their value for weighting will be 0xfffd. Thus supplementary characters will sort higher than BMP characters, and equal to other supplementary characters. For example, the utf16 character set will have utf16_general_ci, utf16_bin, utf16_unicode_ci, utf16_romanian_ci, etc. And its default collation will be utf16_general_ci because ucs2's default collation is ucs2_general_ci. For example, for valid BMP characters, the repertoire of ucs2 and utf16 will be the same. The utf16_bin flaw ------------------ The order of characters in utf16_bin (binary collation for utf16) will differ from the order in utf8_bin. For example: Here is a chart showing two rare characters. The first character is in the range E000-FFFF, so it is greater than a surrogate but less than a supplementary. The second character is a supplementary. Code point Character utf16 utf8 ---------- --------- ----- ---- 0FF9D HALFWIDTH KATAKANA LETTER N FF 9D EF BE 9D 10384 UGARITIC LETTER DELTA D8 00 DF 84 F0 90 8E 84 The two characters in the chart are in order by code point value, and they are in order by utf8 value, but they are not in order by utf16 value, because 0xff > 0xd8. Tentative approved decision: MySQL accepts this flaw. [ Note added 2010-11-26 ] The order of characters will be according to code point, not byte by byte. See also BUG#55980. Functionality ------------- Functionality that is available for ucs2 / utf8 will be available for the new character sets. Limitations that apply to ucs2 / utf8 will apply for the new character sets. For example: Since ucs2 is not a legal client character set, utf32 and utf16 are not legal client character sets, until completion of WL#897 "Accept SQL statements written with UCS-2". Although utf32 and utf16 are supersets of ucs2, and utf8_4 (if it exists) is a superset of utf8, there is no compatibility for comparisons. For example, WHERE _ucs2_column_value = _ucs4_column_value will cause: ERROR 1267 (HY000): Illegal mix of collations (ucs2_general_ci,IMPLICIT) and (ucs4_general_ci,IMPLICIT) for operation '=' The implementor will decide whether to keep or remove this test. (See next section.) Functionality: Decision of the implementor ------------------------------------------ The decision of the implementor is: utf8mb4 should be a superset for 3-byte utf8. When a character set is a superset of another character set, MySQL already has rules for conversion (for example one can concatenate latin1 with utf8 and get utf8). We will apply the same rules when mixing utf8mb4 and utf8. Examples: SELECT CONCAT(utf8mb4_column, utf8_column) FROM t1; Instead of an "Illegal mix of collations" error, this should return a utf8mb4 result, with collation of utf8mb4_column. SELECT * FROM utf8mb4_table, utf8_table WHERE utf8mb4_table.utf8mb4_column = utf8mb3_table.utf8mb3_column; Instead of an "Illegal mix of collations" error, this should work according to the collation of utf8mb4_column. We will not do something similar for similar new character set pairs, for example we will not say that utf16 is a superset of ucs2. Identifiers ----------- Supplementary characters will be illegal in identifiers because: (1) The character set for all metadata is UTF8, and the supplementary characters are in utf8mb4. (2) Perhaps new difficulty will arise with file names (this statement is unchecked speculation). (3) The new characters may be ignored for collation, for a while. Connectors ---------- People involved with PHP / Java / ODBC / Merlin / etc. were asked to comment on the amount of work that this task would cause them. In summary, the response was "very little". In detail, responses were: Mark Matthews: "Java supports both UTF-16, UTF-16 big endian, UTF-16 little endian, and the same with UTF-32. We won't have to do anything for Connector/J other than add the mapping into the configuration that maps from "java" names to "mysql" names for the character encodings." Georg Richter: "PHP6 uses UTF-16." Reggie Burnett: "MSDN indicates that the following are valid encodings under .NET (utf-7, utf-8, utf-16, utf-32, utf-32BE, unicodeFFEE) so it should be no problem for .NET. Since .NET supports it, I assume that also means that Windows can handle utf-32 so Connector/ODBC on Windows should work." Mike Zinner: "The GUI Tools have full UTF8 support internally via glib. On the front end we are limited to UTF-16/UCS-2 on Windows because afaik Windows does not support UTF-32/UCS-4. I will check and get back to you." [Notice the opinion difference about utf-32 Windows support.] Tim Smith: "MySQL command-line clients should not have serious problems adapting." InnoDB ------ Marko Mäkelä wrote, in a comment for BUG#9337: "... InnoDB assumes in many places that UTF-8 sequences are at most 3 bytes. I am afraid that making MySQL accept wider than 3-byte UTF-8 characters will require an incompatible modification to the InnoDB data file format, at least in row_format=compact." Marko Mäkelä wrote, in a comment for BUG#12600: I strongly recommend that new charset-collation codes are introduced in WL#2829 for the encodings of the full Unicode. This would avoid incompatible changes to data files. If WL#2829 does not already mention this, please add it to the specification. Also, please notify email@example.com when the implementation starts, as we will need to adjust some things inside InnoDB. [WL#2829 was cancelled and its description is now part of WL#1213.] [ Another Company ] ------------------- A section here about another company has been edited. To see the section, click the "Progress" bar and look for the changes on 2009-11-18. The section mentions BUG#12600, BUG#9337, and JVER tests. Other DBMSs ----------- DB2 supports supplementary characters, and supports CESU-8. SQL Server 2005 does not support supplementary characters, its NCHAR and NVARCHAR data types are always two bytes per character and are defined as "character data types that ... use the UNICODE UCS-2 character set". There are a few surrogate-aware functions (len_s, left_s, right_s, sub_s, replace_s). A slide presentation by a Microsoft employee unicode.org/iuc/iuc18/papers/a8.ppt says SQL Server is currently "surrogate safe" and in future will be "surrogate aware". Oracle10g supports supplementary characters with AL16UTF16, UTF8, and AL32UTF8 character sets. Oracle documentation makes it clear that Oracle "AL16UTF16" is really UTF-16: "Oracle's AL16UTF16 character set supports surrogate pairs." http://download-east.oracle.com/docs/cd/A91202_01/901_doc/server.901/a90236/appb.htm So MySQL's UTF16 will be the same as Oracle's AL16UTF16. Oracle documentation makes it clear that Oracle "UTF8" is really CESU-8, Oracle "AL32UTF8" is really UTF-8: "UTF8: Unicode 3.0 UTF-8 Universal character set, CESU-8 compliant AL32UTF8: Unicode 3.1 UTF-8 Universal character set" http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10759/functions023.htm "Because UTF8 can require up to three bytes for each character, one supplementary character is represented in two code points, totalling six bytes. In AL32UTF8, one supplementary character is represented in one code point, totalling four bytes." http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96529/ch6.htm "When the character set is determined to be UTF-8, the Oracle character set UTF8 is returned by default unless 4-byte characters (supplementary characters) are detected within the text. If 4-byte characters are detected, then the character set is reported as AL32UTF8." http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14225/applocaledata.htm#NLSPG014 So MySQL's UTF8MB4 will not be the same as Oracle's UTF8. It will be the same as Oracle's AL32UTF8. SQL standard ------------ For better compliance with SQL standard requirements: 1. Support the three names that the standard has for "Universal Character Sets": UTF8, UTF16, UTF32. So the name UTF32 should be preferred instead of UCS4. 2. All the Universal Character Sets must have the same repertoire. That's only possible with a 4-byte UTF8. (That is, it's only possible with what we call UTF8MB4.) 3. All the Universal Character Sets must be mutually comparable. So "WHERE utf32_column = utf16_column" shouldn't always cause an error. That's impossible unless utf32 and utf16 can have the same collation. (The standard suggests elsewhere that one collation can apply to two character sets provided they have the same repertoire, or that a collation that's applicable for a superset can be applicable for a subset.) MySQL already allows and prefers a superset collation for expressions like "utf8_column = latin1_column" but for "utf8_column = ucs2_column" there's an error. Bar says: "It [for example comparison of a utf32_unicode_ci column and a utf16_unicode_ci] should be not so hard to implement." 4. Support a more recent version of the SQL standard. The April 2006 draft of the SQL:200x standard no longer mentions Unicode 3.0, it mentions "Unicode 4.0.1". (Elsewhere it mentions Unicode 3.1 but that is possibly an error in the draft.) Of course the real target for MySQL should be 5.0. 5. The name of the default collation should be UCS_BASIC (for code point order) or UNICODE (for UCA order). MySQL will be close enough by saying that the collation is xxx_unicode_xxx. But xxx_general_xxx (the current default) is not standard. Upgrade ------- Since some changes to existing behaviour are contemplated, there are upgrade considerations. 1. If UTF8 character set name changes, or if default name changes for identifiers. (Due to a February 2010 decision, this will not happen.) 2. If xxx_general_ci ceases to be the default collation (because UCS_BASIC or UNICODE or xxx_unicode_ci is chosen as the default). Bar's comments about this consideration: > To make upgrade easier, we should probably start > always print character set and collation in "SHOW CREATE TABLE", > even if character set and collation for a column are the default ones - > currently they are not displayed: > > create table t1 (a char(1), b char(1) collate latin1_bin); > show create table t1; > > CREATE TABLE `t1` ( > `a` char(1) default NULL, -- no character set and collate clauses > `b` char(1) character set latin1 collate latin1_bin default NULL > ) ENGINE=MyISAM DEFAULT CHARSET=latin1 -- no collate clause > > > > Another way to simplify upgrade could be allowing users to reassign > the default collation for a character set. > Something like: > > ALTER CHARACTER SET utf8 DEFAULT COLLATION utf8_unicode_ci; 3. If a length is already 2/3 of the maximum length. For example, these statements currently cause no errors or warnings: CREATE TABLE t (s1 VARCHAR(16000) CHARACTER SET UTF8) engine=myisam; CREATE INDEX i ON t (s1(333)); but with 4-byte UTF8, that is, if one says CHARACTER SET UTF8MB4 rather than CHARACTER SET UTF8, the CREATE TABLE will cause a warning "Converting column 's1' from VARCHAR to TEXT" and the CREATE INDEX will cause a warning "Specified key was too long; max key length is 999 bytes". See BUG#4541 for a taste of possible reactions. (I am talking here about the case, where the user manually creates a new table and on the 3-bytes-utf8-version the statement works, but on the 4-bytes-utf8-version it does not.) References ---------- "Unicode implementation in DB2 Universal Database" http://publib.boulder.ibm.com/infocenter/db2help/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0004821.htm "Oracle Globalization" http://www.oracle.com/technology/tech/globalization/pdf/TWP_AppDev_Unicode_10gR2.pdf "SQL Server 2005 Books Online" http://msdn2.microsoft.com/en-us/library/ms243717 A handy online converter from code points to UTF-16 and UTF-8 values: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=ff9d&mode=hex Related Bugs/Requests --------------------- "BUG#9337 misleading warning when inserting invalid utf8 data" http://bugs.mysql.com/bug.php?id=9337 "BUG#12600 Inserting into LONGTEXT gives strange error message" http://bugs.mysql.com/bug.php?id=12600 "BUG#14052 No UTF-16 charset encoding available (surrogate pairs)" http://bugs.mysql.com/bug.php?id=14052 "BUG#25666 UTF-8 support beyong the BMP" http://bugs.mysql.com/bug.php?id=25666 "BUG#38885 a japanese character make sql bad" http://bugs.mysql.com/bug.php?id=38885 "Valueclick UTF8" email thread [ mysql intranet ] /secure/mailarchive/mail.php?folder=4&mail=8297 "UTF-16 or UTF-32 support?" http://lists.mysql.com/mysql/192885 "supplementary unicode characters" http://forums.mysql.com/read.php?39,130718,130718 "Surrogate pairs? Non-BMP characters?" http://forums.mysql.com/read.php?103,274048,274048#msg-274048
Design details are included in the HLS.
Copyright (c) 2000, 2016, Oracle Corporation and/or its affiliates. All rights reserved.