WL#1213: Implement 4-byte UTF8, UTF16 and UTF32

Affects: Server-5.5 — Status: Complete

Description
High Level Architecture
Low Level Design

Pushed to 6.0.4 on Nov 27/2007.

Handle the rare characters which are outside the 
Unicode Basic Multilingual Plane (BMP). We will 
need to extend what we can do with the existing 
ucs2 and utf8 character sets, or add new Unicode 
character sets. We'll need up to four bytes per 
character. 

There will be UCS4 and full UTF8 support.
There will be UTF16 support.

WL#1213 is a step forward to have full support of Japanese encodings.
WL#1213 adds support for new characters, but does not provide proper
collations. In order to have collations, the following task should be
complete: WL#2673; fixed filesort.

MySQL supports only those Unicode characters in the
Basic Multilingual Plane (BMP) -- the ones which we
can store in 2 bytes with ucs2, or up to 3 bytes with
utf8. Unicode 3.1 (2001) and subsequent Unicode
versions have 45,000 - 50,000 characters outside
the BMP.

With a UTF-16 character set, these supplementary
characters require four bytes to store with
"surrogate pairs" (a high 16-bit word and a low
16-bit word).

With a UTF-32 (also known as UCS-4) character set,
all characters require 4 bytes to store.

With a UTF-8 extended character set,
which we intend to call utf8mb4, these supplementary
characters require four bytes to store.

Additionally there is CESU-8, which is a UTF-8
variant that takes 6 bytes to store a supplementary
character: 3 bytes for the equivalent of the high
part of a UTF-16 surrogate pair, 3 bytes for the
equivalent of the low part of a UTF-16 surrogate
pair.

Until now, we have regularly said that we intend to
support supplementary characters someday, and CESU-8
never.

This task is necessary before MySQL can support
(Japanese) JIS 2004, or (Chinese) GB18030, which
contain characters that can't be converted to
Unicode BMP characters.

This task does not include any changes to handling
of combining characters, that's WL#3770.

Validity checking
-----------------

There will still be illegal sequences. We should
check for them.

If you allow anything at all, then how will you know how
long a character is? You'd have to guess the length, and
as soon as you make a wrong guess, everything that follows
the bad character will be interpreted as garbage. (WL#3780
proposes to find and replace all 4-byte characters, but it
would be a temporary workaround.)

The Unicode Consortium publishes the sequences, e.g. in
http://www.unicode.org/charts/PDF/Unicode-5.0/.

We have decided what Unicode version to support for
this purpose: "Unicode 5.0"
http://www.unicode.org/versions/Unicode5.0.0/
Formally, for this task, it is sufficient to say that
MySQL must support at least the version-3.1 supplementary
characters. But there was no objection to saying "Unicode 5.0".

Allow all characters which are acceptable in Unicode
5.0, including the so-called "non-characters". Do
validity checking on input.
Validity checks are: in utf16 there must not be a top
surrogate without a bottom surrogate (or a bottom
surrogate without a top surrogate), in any encoding
the code point values must not be greater than 0x10ffff.

Since the MySQL Reference Manual says we follow RFC 3629
(http://www.rfc-editor.org/rfc/rfc3629.txt), the only
valid characters are Unicode characters, and values
greater than 0x10ffff will never be possible.

There is another point here. Actually, we can parse the query several
times. For example, stored-program definition is parsed each time it is
loaded in the cache. We should perform the validity checking only once --
when the user submitted the query.

utf32
-----

Character set name: utf32. Corresponds to: UCS-4, UTF-32.

Every character requires 32 bits. This is the simplest
new character set, but the least useful. Few people
want to use 4 bytes to store one character.

The original proposal said "character set name: ucs4",
but utf32 is the standard name. There will be no character
set named ucs4.

utf16
-----

Character set name: utf16. Corresponds to: UTF-16.

Most characters require 16 bits. A BMP character looks
exactly the same in utf16 as in ucs2. A non-BMP character
will require a "surrogate pair", two 16-bit values in
sequence. For general description of surrogates, see:
http://unicode.org/versions/Unicode4.0.0/ch15.pdf

Though utf16 was not part of the original task, it seems
to fit here. Originally there was thought about splitting
'utf16' into a separate task, but that won't happen.

Byte order in utf32 and utf16
-----------------------------

Unicode allows for several variations of UTF-16 and UTF-32:
UTF-16 16-bit, big-endian unless string starts with BOM
UTF-16BE 16-bit, big-endian
UTF-16LE 16-bit, little-endian
UTF-32 32-bit, big-endian unless string starts with BOM
UTF-32BE 32-bit, big-endian
UTF-32LE 32-bit, little-endian

Big-endian means the high byte comes first.
BOM means Byte Order Mark, U+FEFF. Applications can
figure out whether a UTF-16 string is big-endian or
little-endian by checking whether U+FEFF looks like
0xFE 0xFF or 0xFF 0xFE.
A quick description is "Byte order encoding schemes" in a
Wikipedia article, http://en.wikipedia.org/wiki/UTF-16

Advantages of little-endian:
* ">" and "<" comparisons of utf32 strings can be more efficient
when the processor favours little-endian, as Intel does
* SQL Server compatibility is enhanced because SQL Server,
like most Microsoft products, favours little-endian

Advantages of big-endian:
* MySQL's old "ucs2" character set is always big-endian
(please correct me if I'm wrong about the word "always")
* Unicode says that big-endian should be the default when
there is no byte order mark, and Unicode recommends "Do
not tag every string in a database or set of fields with a BOM".

Approved Decisions:
* MySQL ignores the byte order mark (BOM) or, more
precisely, treats U+FEFF as ZERO WIDTH NON-BREAKING SPACE.
* MySQL utf32 and utf16 are big-endian on all platforms.
* Although MySQL's utf16 + utf32 correspond to Unicode's
UTF-16BE and UTF-32BE, MySQL will not use the names
utf16be or utf32be, utf16 and utf32 are SQL standard names.

utf8?
-----

Character set name: utf8 or utf8_4 or utf8mb4.
Corresponds to: UTF-8.

Three alternatives have been suggested:

1. Extend the old character set (utf8). Say that it's
still the old utf8 character set but it's possible to
use 4 bytes. Nobody else has seen a need for two
different character sets so utf8_4 or utf8mb4 would
be perceived as a defect or "gotcha" in MySQL.

2. Add a new character set named utf8mb4 (utf8 maximum
bytes 4) or utf8_4 (utf8 4-byte). It's necessary to know how
many bytes to reserve for CHAR(1) columns with
storage_engine=MyISAM, and for sort strings. The
decision long ago was: 3 bytes.
(The name "uf8mb4" is derived from suggestions in
the email thread "Valueclick UTF8".)
(The name "utf8_4" is derived from suggestions in
the email thread "Re: BUG#12600 and #9337".)

3. Rename the old character set (utf8) to something
else, for example utf8mb3 or utf8_3. Make a new
4-byte character set and call the new one utf8.

The meaning of the name "utf8" could be controlled by
a variable, for example "old_utf8=yes".
But we decided not to do this, it would cause sorrow, like sql_mode.

This task can't start until this utf8 matter is decided.

The choice will be "Alternative 2" and the new character
set will be named utf8mb4.

[ The above sentence is due to a decision change in February 2010. ]

utf8mb3 alias for utf8
----------------------

In a future version it's possible that utf8 will become
the 4-byte utf8, and users who want to indicate 3-byte utf8
will have to say utf8mb3. To avoid some future problems which might
occur with replication when master and slave have different
versions, we will begin now to allow users to say 'utf8mb3'
in the CHARACTER SET clause, and 'utf8mb3_[collation_substring]'
in COLLATE clauses, where [collation_substring] is 'bin',
'czech_ci', 'danish_ci', 'esperanto_ci', 'estonian_ci', etc.
For example:

CREATE TABLE t (s1 CHAR(1) CHARACTER SET utf8mb3;
SELECT * FROM t WHERE s1 COLLATE utf8mb3_general_ci = 'x';
DECLARE x VARCHAR(5) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci;
SELECT CAST('a' AS CHAR CHARACTER SET utf8) COLLATE utf8_czech_ci;

MySQL immediately converts the 'utf8mb3' in an alias to 'utf8',
so in SHOW CREATE TABLE or SELECT CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.COLUMNS/etc. or SELECT COLLATION_NAME
FROM INFORMATION_SCHEMA.COLUMNS/etc. users will see the true
name, "utf8" or "utf8_[collation_substring]".

The "utf8mb3" alias is valid only in CHARACTER SET clauses,
and in certain other places. For example these are legal:
mysqld --character-set-server=utf8mb3
SET NAMES utf8mb3; /* and other SET statements which have similar effect */
SELECT _utf8mb3 'a';

No CESU-8, No Java Modified UTF-8
---------------------------------

Character set name: cesu8. Corresponds to: CESU-8.

MySQL will not support CESU-8 or Java Modified UTF-8.

Essentially the difference between UTF-8 and CESU-8
lies in the starting point for supplementary characters.
In UTF-8 one encodes the code points, in CESU-8 one
encodes the surrogate-pair values that exist for UTF16.
See Unicode Technical Report #26
"Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)"
http://unicode.org/reports/tr26/

Sun's "Modified UTF-8" is apparently the same as CESU-8,
along with a special two-byte coding for U+0000:
c0 80. See "Modified UTF-8"
http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8
Since c0 80 is longer than 00, it is a "nonshortest
code value" and therefore illegal according to
unicode.org "Corrigendum #1: UTF-8 Shortest Form":
http://unicode.org/versions/corrigendum1.html

"Inheriting" repertoires and collations
---------------------------------------

The new character sets have the same collations as
the old (ucs2 and utf8) character sets, the only
changed things are the names and the code ranges.

Until WL#2673 "Unicode Collation Algorithm new version"
is complete, supplementary characters will all have the
same value for collating weights. For "general"
collations they'll be treated uas U+FFFD REPLACEMENT
CHARACTER. For "uca" collations their value for
weighting will be 0xfffd. Thus supplementary characters
will sort higher than BMP characters, and equal to
other supplementary characters.

For example, the utf16 character set will have
utf16_general_ci, utf16_bin, utf16_unicode_ci,
utf16_romanian_ci, etc. And its default collation
will be utf16_general_ci because ucs2's default
collation is ucs2_general_ci.

For example, for valid BMP characters, the repertoire
of ucs2 and utf16 will be the same.

The utf16_bin flaw
------------------

The order of characters in utf16_bin (binary collation for utf16)
will differ from the order in utf8_bin. For example:

Here is a chart showing two rare characters. The first character is
in the range E000-FFFF, so it is greater than a surrogate but
less than a supplementary. The second character is a supplementary.

Code point Character utf16 utf8
---------- --------- ----- ----
0FF9D HALFWIDTH KATAKANA LETTER N FF 9D EF BE 9D
10384 UGARITIC LETTER DELTA D8 00 DF 84 F0 90 8E 84

The two characters in the chart are in order by code point value,
and they are in order by utf8 value, but they are not in order
by utf16 value, because 0xff > 0xd8.

Tentative approved decision: MySQL accepts this flaw.

[ Note added 2010-11-26 ]
The order of characters will be according to code point,
not byte by byte. See also BUG#55980.

Functionality
-------------

Functionality that is available for ucs2 / utf8 will be
available for the new character sets.

Limitations that apply to ucs2 / utf8 will apply
for the new character sets. For example:
Since ucs2 is not a legal client character set,
utf32 and utf16 are not legal client character sets,
until completion of WL#897 "Accept SQL statements
written with UCS-2".

Although utf32 and utf16 are supersets of ucs2, and
utf8_4 (if it exists) is a superset of utf8, there
is no compatibility for comparisons. For example,
WHERE _ucs2_column_value = _ucs4_column_value
will cause:
ERROR 1267 (HY000): Illegal mix of collations
(ucs2_general_ci,IMPLICIT) and (ucs4_general_ci,IMPLICIT)
for operation '='
The implementor will decide whether to keep or remove this test.
(See next section.)

Functionality: Decision of the implementor
------------------------------------------

The decision of the implementor is:
utf8mb4 should be a superset for 3-byte utf8.
When a character set is a superset of another character
set, MySQL already has rules for conversion (for example
one can concatenate latin1 with utf8 and get utf8).
We will apply the same rules when mixing utf8mb4 and utf8.

Examples:

SELECT CONCAT(utf8mb4_column, utf8_column) FROM t1;
Instead of an "Illegal mix of collations" error,
this should return a utf8mb4 result, with collation of
utf8mb4_column.

SELECT * FROM utf8mb4_table, utf8_table
WHERE utf8mb4_table.utf8mb4_column = utf8mb3_table.utf8mb3_column;
Instead of an "Illegal mix of collations" error,
this should work according to the collation of utf8mb4_column.

We will not do something similar for similar new character set
pairs, for example we will not say that utf16 is a superset of ucs2.

Identifiers
-----------

Supplementary characters will be illegal in identifiers because:
(1) The character set for all metadata is UTF8, and the
supplementary characters are in utf8mb4.
(2) Perhaps new difficulty will arise with file names
(this statement is unchecked speculation).
(3) The new characters may be ignored for collation,
for a while.

Connectors
----------

People involved with PHP / Java / ODBC / Merlin / etc.
were asked to comment on the amount of work that
this task would cause them. In summary, the response
was "very little". In detail, responses were:

Mark Matthews: "Java supports both UTF-16, UTF-16 big
endian, UTF-16 little endian, and the same with UTF-32.
We won't have to do anything for Connector/J other than
add the mapping into the configuration that maps from
"java" names to "mysql" names for the character encodings."

Georg Richter: "PHP6 uses UTF-16."

Reggie Burnett: "MSDN indicates that the following are
valid encodings under .NET (utf-7, utf-8, utf-16, utf-32,
utf-32BE, unicodeFFEE) so it should be no problem for
.NET. Since .NET supports it, I assume that also means
that Windows can handle utf-32 so Connector/ODBC on
Windows should work."

Mike Zinner: "The GUI Tools have full UTF8 support internally
via glib. On the front end we are limited to UTF-16/UCS-2
on Windows because afaik Windows does not support
UTF-32/UCS-4. I will check and get back to you."

[Notice the opinion difference about utf-32 Windows support.]

Tim Smith: "MySQL command-line clients should not have
serious problems adapting."

InnoDB
------

Marko Mäkelä wrote, in a comment for BUG#9337:
"... InnoDB assumes in many places that UTF-8 sequences
are at most 3 bytes. I am afraid that making MySQL
accept wider than 3-byte UTF-8 characters will require
an incompatible modification to the InnoDB data file
format, at least in row_format=compact."

Marko Mäkelä wrote, in a comment for BUG#12600:
I strongly recommend that new charset-collation
codes are introduced in WL#2829 for the encodings of
the full Unicode. This would avoid incompatible changes
to data files. If WL#2829 does not already mention
this, please add it to the specification. Also, please notify
dev@innodb.com when the implementation starts, as we will need to
adjust some things inside InnoDB. [WL#2829 was cancelled
and its description is now part of WL#1213.]

[ Another Company ]
-------------------

A section here about another company has been edited.
To see the section, click the "Progress" bar and look
for the changes on 2009-11-18.

The section mentions BUG#12600, BUG#9337, and JVER tests.

Other DBMSs
-----------

DB2 supports supplementary characters, and supports CESU-8.

SQL Server 2005 does not support supplementary characters,
its NCHAR and NVARCHAR data types are always two bytes per
character and are defined as "character data types that ...
use the UNICODE UCS-2 character set". There are a few
surrogate-aware functions (len_s, left_s, right_s, sub_s,
replace_s). A slide presentation by a Microsoft employee
unicode.org/iuc/iuc18/papers/a8.ppt
says SQL Server is currently "surrogate safe" and in future
will be "surrogate aware".

Oracle10g supports supplementary characters with AL16UTF16,
UTF8, and AL32UTF8 character sets.

Oracle documentation makes it clear that Oracle "AL16UTF16" is
really UTF-16:

"Oracle's AL16UTF16 character set supports surrogate pairs."
http://download-east.oracle.com/docs/cd/A91202_01/901_doc/server.901/a90236/appb.htm

So MySQL's UTF16 will be the same as Oracle's AL16UTF16.

Oracle documentation makes it clear that Oracle "UTF8" is
really CESU-8, Oracle "AL32UTF8" is really UTF-8:

"UTF8: Unicode 3.0 UTF-8 Universal character set, CESU-8 compliant
AL32UTF8: Unicode 3.1 UTF-8 Universal character set"
http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10759/functions023.htm

"Because UTF8 can require up to three bytes for each character, one
supplementary character is represented in two code points, totalling
six bytes. In AL32UTF8, one supplementary character is represented in
one code point, totalling four bytes."
http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96529/ch6.htm

"When the character set is determined to be UTF-8, the Oracle character
set UTF8 is returned by default unless 4-byte characters (supplementary
characters) are detected within the text. If 4-byte characters are
detected, then the character set is reported as AL32UTF8."
http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14225/applocaledata.htm#NLSPG014

So MySQL's UTF8MB4 will not be the same as Oracle's UTF8.
It will be the same as Oracle's AL32UTF8.

SQL standard
------------

For better compliance with SQL standard requirements:

1. Support the three names that the standard has for
"Universal Character Sets": UTF8, UTF16, UTF32.
So the name UTF32 should be preferred instead of UCS4.

2. All the Universal Character Sets must have the same
repertoire. That's only possible with a 4-byte UTF8.
(That is, it's only possible with what we call UTF8MB4.)

3. All the Universal Character Sets must be mutually
comparable. So "WHERE utf32_column = utf16_column"
shouldn't always cause an error. That's impossible
unless utf32 and utf16 can have the same collation.
(The standard suggests elsewhere that one collation
can apply to two character sets provided they have
the same repertoire, or that a collation that's
applicable for a superset can be applicable for a
subset.) MySQL already allows and prefers a superset
collation for expressions like "utf8_column = latin1_column"
but for "utf8_column = ucs2_column" there's an error.
Bar says: "It [for example comparison of a utf32_unicode_ci
column and a utf16_unicode_ci] should be not so hard
to implement."

4. Support a more recent version of the SQL standard.
The April 2006 draft of the SQL:200x standard no
longer mentions Unicode 3.0, it mentions "Unicode
4.0.1". (Elsewhere it mentions Unicode 3.1 but that
is possibly an error in the draft.) Of course the
real target for MySQL should be 5.0.

5. The name of the default collation should be
UCS_BASIC (for code point order) or UNICODE (for UCA
order). MySQL will be close enough by saying
that the collation is xxx_unicode_xxx. But
xxx_general_xxx (the current default) is not
standard.

Upgrade
-------

Since some changes to existing behaviour are contemplated,
there are upgrade considerations.

1. If UTF8 character set name changes, or if default name
changes for identifiers. (Due to a February 2010 decision,
this will not happen.)

2. If xxx_general_ci ceases to be the default collation
(because UCS_BASIC or UNICODE or xxx_unicode_ci is chosen
as the default). Bar's comments about this consideration:

> To make upgrade easier, we should probably start
> always print character set and collation in "SHOW CREATE TABLE",
> even if character set and collation for a column are the default ones -
> currently they are not displayed:
>
> create table t1 (a char(1), b char(1) collate latin1_bin);
> show create table t1;
>
> CREATE TABLE `t1` (
> `a` char(1) default NULL, -- no character set and collate clauses
> `b` char(1) character set latin1 collate latin1_bin default NULL
> ) ENGINE=MyISAM DEFAULT CHARSET=latin1 -- no collate clause
>
>
>
> Another way to simplify upgrade could be allowing users to reassign
> the default collation for a character set.
> Something like:
>
> ALTER CHARACTER SET utf8 DEFAULT COLLATION utf8_unicode_ci;

3. If a length is already 2/3 of the maximum length. For example,
these statements currently cause no errors or warnings:
CREATE TABLE t (s1 VARCHAR(16000) CHARACTER SET UTF8) engine=myisam;
CREATE INDEX i ON t (s1(333));
but with 4-byte UTF8, that is, if one says CHARACTER SET UTF8MB4
rather than CHARACTER SET UTF8, the CREATE TABLE will cause a warning
"Converting column 's1' from VARCHAR to TEXT" and the
CREATE INDEX will cause a warning "Specified key was too long;
max key length is 999 bytes". See BUG#4541 for a taste of possible reactions.

(I am talking here about the case, where the user manually creates a
new table and on the 3-bytes-utf8-version the statement works, but on the
4-bytes-utf8-version it does not.)

References
----------

"Unicode implementation in DB2 Universal Database"
http://publib.boulder.ibm.com/infocenter/db2help/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0004821.htm

"Oracle Globalization"
http://www.oracle.com/technology/tech/globalization/pdf/TWP_AppDev_Unicode_10gR2.pdf

"SQL Server 2005 Books Online"
http://msdn2.microsoft.com/en-us/library/ms243717

A handy online converter from code points to UTF-16 and UTF-8 values:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=ff9d&mode=hex

Related Bugs/Requests
---------------------

"BUG#9337 misleading warning when inserting invalid utf8 data"
http://bugs.mysql.com/bug.php?id=9337

"BUG#12600 Inserting into LONGTEXT gives strange error message"
http://bugs.mysql.com/bug.php?id=12600

"BUG#14052 No UTF-16 charset encoding available (surrogate pairs)"
http://bugs.mysql.com/bug.php?id=14052

"BUG#25666 UTF-8 support beyong the BMP"
http://bugs.mysql.com/bug.php?id=25666

"BUG#38885 a japanese character make sql bad"
http://bugs.mysql.com/bug.php?id=38885

"Valueclick UTF8" email thread
[ mysql intranet ] /secure/mailarchive/mail.php?folder=4&mail=8297

"UTF-16 or UTF-32 support?"
http://lists.mysql.com/mysql/192885

"supplementary unicode characters"
http://forums.mysql.com/read.php?39,130718,130718

"Surrogate pairs? Non-BMP characters?"
http://forums.mysql.com/read.php?103,274048,274048#msg-274048

Design details are included in the HLS.